Python Scraper + Text Mining (SEDAR+/TSX, 2013–2025) Title Canadian Corporate Digitalization Dataset (2013–2025): Scraping SEDAR+/TSX, Text Extraction, Digitalization Index & Topic Modeling Project Overview I need a freelancer to build a dataset of corporate digitalization disclosure for all Canadian listed companies (approx. 3,476 issuers) over the period 2013–2025. The work requires: 1. Scraping MD&A, Annual Reports, and AIF from SEDAR+ / TSX. 2. Extracting & cleaning text from reports. 3. Measuring a Digitalization Index (dictionary-based, using keywords from prior academic literature). 4. Conducting Topic Modeling (LDA/STM) to identify digitalization themes. 5. Delivering structured firm–year CSV files and reproducible Python code. Tasks & Deliverables 1. Scraping (2013–2025) • Collect issuer list (CSV provided, ~3,476 firms). • For each issuer × year, download available: o MD&A (Management Discussion & Analysis) o Annual Report o Annual Information Form (AIF) • Save PDFs under: • data/reports_raw/{FirmName}/{Year}/document.pdf • Provide a manifest (CSV) with: firm, ticker, year, document type, source URL, download date, file path, checksum. 2. Text Extraction & Cleaning • Convert PDFs → text (pdfminer.six, PyPDF2, OCR fallback). • Clean text: remove headers, tables, footers, page numbers. • Save under: • data/reports_txt/{FirmName}/{Year}/document.txt 3. Digitalization Index (Mandatory) Use a dictionary-based approach with the following keywords compiled from prior academic literature: Core Digitalization • digitalization, digitization, digital transformation, digital economy, information technology, information systems (Bharadwaj et al., 2013; Li et al., 2021) Technologies • artificial intelligence, AI, machine learning, ML, deep learning, DL, natural language processing, NLP, computer vision • robotics, robotic process automation, RPA • cloud computing, SaaS, PaaS, IaaS, cloud • blockchain, distributed ledger, DLT • fintech • internet of things, IoT, industrial internet • big data, data analytics • edge computing • digital twin (Verhoef et al., 2021; Chen et al., 2022) Business Models & Finance • digital platform, e-commerce, online marketplace • open banking, mobile banking, mobile payments • digital banking, neobanking • application programming interface, API • microservices • fintech innovation (Gomber et al., 2018; Vial, 2019) Organizational Processes • digital strategy, IT capability, IT infrastructure • enterprise resource planning, ERP • customer relationship management, CRM • business process automation • data warehouse, data lake • omnichannel, multichannel (Matt et al., 2015; Susanti et al., 2023) Scoring rules • Count frequency of these keywords per document. • Normalize by total word count (per 10,000 words). • For each firm–year, calculate: o dict_raw_count (sum of matches) o dict_score_per_10k (normalized index). • Save in firm_year_summary.csv. 4. Topic Modeling (Mandatory) • Apply Latent Dirichlet Allocation (LDA) or Structural Topic Modeling (STM) across the corpus. • Identify digitalization-related latent topics. • Report: o Topic distributions for each firm–year. o Top 10 words per topic with coherence scores. • Deliver: o firm_year_topics.csv (firm, year, topic_1_share, …). o topic_keywords.csv (topic_id, top_words, coherence_score). 5. Final Dataset Deliver three structured CSV files: 1. firm_year_manifest.csv → metadata for every document. 2. firm_year_summary.csv → aggregated per firm–year with Digitalization Index. 3. firm_year_topics.csv → topic shares per firm–year. 6. Code & Documentation • All scripts in /src. • requirements.txt for dependencies. • README.md with instructions to rerun pipeline. • Config file (config.yaml) for paths, years, scoring settings. Example Output firm_year_summary.csv firm_name ticker year total_word_count dict_raw_count dict_score_per_10k dominant_topic digitalization_topic_share Bank of Nova Scotia BNS 2019 82,134 245 29.8 4 (FinTech) 0.32 Shopify Inc. SHOP 2021 61,255 432 70.5 2 (Cloud) 0.55 topic_keywords.csv topic_id top_words coherence_score 1 risk, credit, impairment, exposure 0.46 2 cloud, platform, saas, software 0.51 4 fintech, digital, payment, ai 0.54 Application Instructions Please include in your proposal: 1. Your experience scraping large regulatory datasets (e.g., SEDAR+, EDGAR). 2. Python/NLP experience (dictionary scoring, TF-IDF, topic modeling). 3. How you will handle scanned PDFs. 4. Links to GitHub/portfolio if available. 5. Confirmation you will deliver both Digitalization Index and Topic Modeling outputs as described.