{ "cells": [ { "cell_type": "markdown", "id": "title-cell", "metadata": {}, "source": [ "# SEC Filing Processor\n", "## Morningstar RAG Pipeline — Phase 2b\n", "\n", "This notebook processes **Apple SEC filings** (10-K, 10-Q, 8-K) from raw HTML into structured JSON, following the same pattern as `01_pdf_processing.ipynb` for Morningstar PDFs.\n", "\n", "### Why a separate processor?\n", "\n", "| | Morningstar PDFs | SEC HTML filings |\n", "|---|---|---|\n", "| Input format | PDF | HTML (`.htm`) |\n", "| Page numbers | Yes | No — HTML has no pages |\n", "| Noise filter | Page-based (remove pages 13-14) | Cover-section boilerplate |\n", "| Metadata | company, ticker, doc_type | fiscal_year, accession, filing_date |\n", "| Docling model | DocLayNet layout detection | HTML structure parsing |\n", "\n", "Both use the **same Docling converter** and both save a `_docling.json` — so **HybridChunker in Phase 3 works identically** for PDFs and HTML.\n", "\n", "### Filings we are processing\n", "| Type | Count | Description |\n", "|---|---|---|\n", "| 10-K | 3 | Annual reports (2023, 2024, 2025) — large |\n", "| 10-Q | 6 | Quarterly reports — medium |\n", "| 8-K | 5 | Current reports / earnings releases — small |\n", "\n", "### Steps in this notebook\n", "```\n", "STEP 1 — Imports & Paths\n", "STEP 2 — Configure Docling for HTML\n", "STEP 3 — Parse One Filing (8-K demo — fastest to run)\n", "STEP 4 — Inspect Raw Output\n", "STEP 5 — Extract Sections + Boilerplate Detection\n", "STEP 6 — Extract Tables\n", "STEP 7 — Attach Metadata\n", "STEP 8 — Save JSON + _docling.json\n", "STEP 9 — Batch Process All SEC Filings\n", "STEP 10 — Verify Output\n", "```" ] }, { "cell_type": "markdown", "id": "step1-hdr", "metadata": {}, "source": [ "## STEP 1 — Imports & Paths" ] }, { "cell_type": "code", "execution_count": 1, "id": "step1-code", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Project root : /home/pushkardeshpand/Documents/Morningstar RAG Pipeline\n", "Raw SEC dir : /home/pushkardeshpand/Documents/Morningstar RAG Pipeline/data/raw/sec_filings/AAPL\n", "Output dir : /home/pushkardeshpand/Documents/Morningstar RAG Pipeline/data/processed/sec_filings/AAPL\n", "\n", "10-K (3): ['2023', '2024', '2025']\n", "10-Q (6): ['2024_Q3', '2025_Q1', '2025_Q2', '2025_Q3', '2026_Q1', '2026_Q2']\n", "8-K (5): ['2026-01-02', '2026-01-29', '2026-02-24', '2026-04-20', '2026-04-30']\n" ] } ], "source": [ "import re\n", "import json\n", "import sys\n", "import time\n", "from pathlib import Path\n", "from datetime import datetime, timezone\n", "\n", "# ── Project paths ──────────────────────────────────────────────────────────────\n", "NOTEBOOK_DIR = Path().resolve()\n", "PROJECT_ROOT = NOTEBOOK_DIR.parent\n", "SRC_DIR = PROJECT_ROOT / \"src\"\n", "RAW_SEC_DIR = PROJECT_ROOT / \"data\" / \"raw\" / \"sec_filings\" / \"AAPL\"\n", "SEC_OUT_DIR = PROJECT_ROOT / \"data\" / \"processed\" / \"sec_filings\" / \"AAPL\"\n", "\n", "sys.path.insert(0, str(SRC_DIR))\n", "\n", "print(f\"Project root : {PROJECT_ROOT}\")\n", "print(f\"Raw SEC dir : {RAW_SEC_DIR}\")\n", "print(f\"Output dir : {SEC_OUT_DIR}\")\n", "print()\n", "\n", "# Inventory the raw filings\n", "for doc_type in [\"10-K\", \"10-Q\", \"8-K\"]:\n", " type_dir = RAW_SEC_DIR / doc_type\n", " if not type_dir.exists():\n", " continue\n", " periods = sorted(p.name for p in type_dir.iterdir() if (p / \"filing.htm\").exists())\n", " print(f\"{doc_type:5s} ({len(periods)}): {periods}\")" ] }, { "cell_type": "markdown", "id": "step2-hdr", "metadata": {}, "source": [ "## STEP 2 — Configure Docling for HTML\n", "\n", "Docling handles HTML natively — the same `DocumentConverter` used for PDFs works for `.htm` files.\n", "\n", "For HTML, Docling:\n", "- Parses `