{ "cells": [ { "cell_type": "markdown", "id": "title-cell", "metadata": {}, "source": [ "# SEC Filing Processor\n", "## Morningstar RAG Pipeline — Phase 2b\n", "\n", "This notebook processes **Apple SEC filings** (10-K, 10-Q, 8-K) from raw HTML into structured JSON, following the same pattern as `01_pdf_processing.ipynb` for Morningstar PDFs.\n", "\n", "### Why a separate processor?\n", "\n", "| | Morningstar PDFs | SEC HTML filings |\n", "|---|---|---|\n", "| Input format | PDF | HTML (`.htm`) |\n", "| Page numbers | Yes | No — HTML has no pages |\n", "| Noise filter | Page-based (remove pages 13-14) | Cover-section boilerplate |\n", "| Metadata | company, ticker, doc_type | fiscal_year, accession, filing_date |\n", "| Docling model | DocLayNet layout detection | HTML structure parsing |\n", "\n", "Both use the **same Docling converter** and both save a `_docling.json` — so **HybridChunker in Phase 3 works identically** for PDFs and HTML.\n", "\n", "### Filings we are processing\n", "| Type | Count | Description |\n", "|---|---|---|\n", "| 10-K | 3 | Annual reports (2023, 2024, 2025) — large |\n", "| 10-Q | 6 | Quarterly reports — medium |\n", "| 8-K | 5 | Current reports / earnings releases — small |\n", "\n", "### Steps in this notebook\n", "```\n", "STEP 1 — Imports & Paths\n", "STEP 2 — Configure Docling for HTML\n", "STEP 3 — Parse One Filing (8-K demo — fastest to run)\n", "STEP 4 — Inspect Raw Output\n", "STEP 5 — Extract Sections + Boilerplate Detection\n", "STEP 6 — Extract Tables\n", "STEP 7 — Attach Metadata\n", "STEP 8 — Save JSON + _docling.json\n", "STEP 9 — Batch Process All SEC Filings\n", "STEP 10 — Verify Output\n", "```" ] }, { "cell_type": "markdown", "id": "step1-hdr", "metadata": {}, "source": [ "## STEP 1 — Imports & Paths" ] }, { "cell_type": "code", "execution_count": 1, "id": "step1-code", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Project root : /home/pushkardeshpand/Documents/Morningstar RAG Pipeline\n", "Raw SEC dir : /home/pushkardeshpand/Documents/Morningstar RAG Pipeline/data/raw/sec_filings/AAPL\n", "Output dir : /home/pushkardeshpand/Documents/Morningstar RAG Pipeline/data/processed/sec_filings/AAPL\n", "\n", "10-K (3): ['2023', '2024', '2025']\n", "10-Q (6): ['2024_Q3', '2025_Q1', '2025_Q2', '2025_Q3', '2026_Q1', '2026_Q2']\n", "8-K (5): ['2026-01-02', '2026-01-29', '2026-02-24', '2026-04-20', '2026-04-30']\n" ] } ], "source": [ "import re\n", "import json\n", "import sys\n", "import time\n", "from pathlib import Path\n", "from datetime import datetime, timezone\n", "\n", "# ── Project paths ──────────────────────────────────────────────────────────────\n", "NOTEBOOK_DIR = Path().resolve()\n", "PROJECT_ROOT = NOTEBOOK_DIR.parent\n", "SRC_DIR = PROJECT_ROOT / \"src\"\n", "RAW_SEC_DIR = PROJECT_ROOT / \"data\" / \"raw\" / \"sec_filings\" / \"AAPL\"\n", "SEC_OUT_DIR = PROJECT_ROOT / \"data\" / \"processed\" / \"sec_filings\" / \"AAPL\"\n", "\n", "sys.path.insert(0, str(SRC_DIR))\n", "\n", "print(f\"Project root : {PROJECT_ROOT}\")\n", "print(f\"Raw SEC dir : {RAW_SEC_DIR}\")\n", "print(f\"Output dir : {SEC_OUT_DIR}\")\n", "print()\n", "\n", "# Inventory the raw filings\n", "for doc_type in [\"10-K\", \"10-Q\", \"8-K\"]:\n", " type_dir = RAW_SEC_DIR / doc_type\n", " if not type_dir.exists():\n", " continue\n", " periods = sorted(p.name for p in type_dir.iterdir() if (p / \"filing.htm\").exists())\n", " print(f\"{doc_type:5s} ({len(periods)}): {periods}\")" ] }, { "cell_type": "markdown", "id": "step2-hdr", "metadata": {}, "source": [ "## STEP 2 — Configure Docling for HTML\n", "\n", "Docling handles HTML natively — the same `DocumentConverter` used for PDFs works for `.htm` files.\n", "\n", "For HTML, Docling:\n", "- Parses `

`–`

` tags → `SectionHeaderItem` (heading hierarchy preserved automatically)\n", "- Parses `` tags → `TableItem` with structure reconstruction via TableFormer\n", "- Parses `

`, `

` text → `TextItem`\n", "- No OCR, no layout detection (not needed for HTML)\n", "\n", "The resulting `DoclingDocument` object is identical in structure to the one produced from a PDF — so **HybridChunker works on both without any changes**." ] }, { "cell_type": "code", "execution_count": 2, "id": "step2-code", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n", "I0000 00:00:1782117419.440624 3132948 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n", "I0000 00:00:1782117419.473381 3132948 cpu_feature_guard.cc:227] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", "To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n", "I0000 00:00:1782117420.228651 3132948 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Docling converter ready.\n", " Table structure : True\n", " OCR : False\n", " Picture images : False\n" ] } ], "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "from docling.document_converter import DocumentConverter, PdfFormatOption\n", "from docling.datamodel.pipeline_options import PdfPipelineOptions\n", "from docling.datamodel.base_models import InputFormat\n", "\n", "opts = PdfPipelineOptions()\n", "opts.do_table_structure = True # reconstruct
rows/columns\n", "opts.do_ocr = False # HTML — no OCR needed\n", "opts.generate_picture_images = False # skip figure extraction\n", "\n", "converter = DocumentConverter(\n", " format_options={\n", " InputFormat.PDF: PdfFormatOption(pipeline_options=opts)\n", " }\n", ")\n", "\n", "print(\"Docling converter ready.\")\n", "print(f\" Table structure : {opts.do_table_structure}\")\n", "print(f\" OCR : {opts.do_ocr}\")\n", "print(f\" Picture images : {opts.generate_picture_images}\")" ] }, { "cell_type": "markdown", "id": "step3-hdr", "metadata": {}, "source": [ "## STEP 3 — Parse One Filing (8-K Demo)\n", "\n", "We start with a recent 8-K (earnings press release) — it is the smallest filing type and parses quickly.\n", "\n", "After this walkthrough, STEP 9 will batch-process **all 14 filings** automatically.\n", "\n", "> **Note on processing time:**\n", "> - 8-K: ~10–30 seconds (small — earnings release text)\n", "> - 10-Q: ~1–3 minutes (medium — quarterly financial statements)\n", "> - 10-K: ~5–15 minutes per file (large — full annual report with 60+ tables)" ] }, { "cell_type": "code", "execution_count": 3, "id": "step3-code", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Filing : filing.htm\n", "Size : 37 KB\n", "Form type : 8-K\n", "Filing date : 2026-04-30\n", "Accession : 0000320193-26-000011\n", "\n", "Parsing with Docling (this may take 10–30 seconds for an 8-K)...\n", "\n", "Parsing complete in 0.1s\n", "Document type : DoclingDocument\n", "Tables found : 14\n" ] } ], "source": [ "# Use the most recent 8-K as the demo filing\n", "HTM_PATH = RAW_SEC_DIR / \"8-K\" / \"2026-04-30\" / \"filing.htm\"\n", "\n", "with open(RAW_SEC_DIR / \"8-K\" / \"2026-04-30\" / \"metadata.json\") as f:\n", " file_meta = json.load(f)\n", "\n", "print(f\"Filing : {HTM_PATH.name}\")\n", "print(f\"Size : {HTM_PATH.stat().st_size / 1024:.0f} KB\")\n", "print(f\"Form type : {file_meta['form']}\")\n", "print(f\"Filing date : {file_meta['filing_date']}\")\n", "print(f\"Accession : {file_meta['accession']}\")\n", "print()\n", "print(\"Parsing with Docling (this may take 10–30 seconds for an 8-K)...\")\n", "\n", "start = time.time()\n", "result = converter.convert(str(HTM_PATH))\n", "doc = result.document\n", "elapsed = time.time() - start\n", "\n", "print(f\"\\nParsing complete in {elapsed:.1f}s\")\n", "print(f\"Document type : {type(doc).__name__}\")\n", "print(f\"Tables found : {len(doc.tables)}\")" ] }, { "cell_type": "markdown", "id": "step4-hdr", "metadata": {}, "source": [ "## STEP 4 — Inspect Raw Output\n", "\n", "Let's see what element types Docling produced from the HTML." ] }, { "cell_type": "code", "execution_count": 4, "id": "step4-code", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Element types in document:\n", " TextItem 36\n", " TableItem 14\n", " PictureItem 1\n", "\n", "First 12 elements:\n", "----------------------------------------------------------------------\n", " [TableItem ] lvl=1 \"\"\n", " [TextItem ] lvl=1 \"UNITED STATES\"\n", " [TextItem ] lvl=1 \"SECURITIES AND EXCHANGE COMMISSION\"\n", " [TextItem ] lvl=1 \"Washington, D.C. 20549\"\n", " [TableItem ] lvl=1 \"\"\n", " [TextItem ] lvl=1 \"FORM 8-K\"\n", " [TextItem ] lvl=1 \"CURRENT REPORT\"\n", " [TextItem ] lvl=1 \"Pursuant to Section 13 OR 15(d) of The Securities Exchange Act of 1934\"\n", " [TextItem ] lvl=1 \"April 30, 2026\"\n", " [TextItem ] lvl=1 \"Date of Report (Date of earliest event reported)\"\n", " [TableItem ] lvl=1 \"\"\n", " [TextItem ] lvl=1 \"g325078g0426062022046a24.jpg\"\n" ] } ], "source": [ "from docling.datamodel.document import TextItem, SectionHeaderItem, TableItem\n", "\n", "# Count element types\n", "type_counts = {}\n", "for item, level in doc.iterate_items():\n", " t = type(item).__name__\n", " type_counts[t] = type_counts.get(t, 0) + 1\n", "\n", "print(\"Element types in document:\")\n", "for t, n in sorted(type_counts.items(), key=lambda x: -x[1]):\n", " print(f\" {t:30s} {n}\")\n", "\n", "print()\n", "print(\"First 12 elements:\")\n", "print(\"-\" * 70)\n", "for i, (item, level) in enumerate(doc.iterate_items()):\n", " if i >= 12:\n", " break\n", " text = getattr(item, \"text\", \"\")[:80]\n", " itype = type(item).__name__\n", " print(f\" [{itype:20s}] lvl={level} \\\"{text}\\\"\")" ] }, { "cell_type": "markdown", "id": "step5-hdr", "metadata": {}, "source": [ "## STEP 5 — Extract Sections + Boilerplate Detection\n", "\n", "SEC filings begin with a **cover section** containing administrative metadata — not financial content.\n", "\n", "This boilerplate includes:\n", "- `UNITED STATES` / `SECURITIES AND EXCHANGE COMMISSION`\n", "- `Commission File Number: 001-36743`\n", "- `(Exact name of Registrant as specified in its charter)`\n", "- Form checkboxes: `☒ ANNUAL REPORT PURSUANT TO SECTION 13(d)...`\n", "\n", "We tag these with `is_boilerplate: True` so downstream processes can identify them. We do **not** remove them from the `_docling.json` — HybridChunker handles the structure.\n", "\n", "> Unlike Morningstar PDFs where we removed entire pages, here we flag individual fragments because there are no page boundaries." ] }, { "cell_type": "code", "execution_count": 5, "id": "step5-code", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sections extracted : 36\n", " Boilerplate : 7 (cover-section administrative text)\n", " Content : 29\n", " Headers : 0\n", "\n", "Section headers found:\n", "\n", "First 5 content sections:\n", " [text ] parent=''\n", " text='CURRENT REPORT'\n", " [text ] parent=''\n", " text='Pursuant to Section 13 OR 15(d) of The Securities Exchange Act of 1934'\n", " [text ] parent=''\n", " text='April 30, 2026'\n", " [text ] parent=''\n", " text='Date of Report (Date of earliest event reported)'\n", " [text ] parent=''\n", " text='g325078g0426062022046a24.jpg'\n" ] } ], "source": [ "_BOILERPLATE_EXACT = {\n", " \"united states\",\n", " \"securities and exchange commission\",\n", " \"washington, d.c. 20549\",\n", " \"(mark one)\",\n", " \"or\",\n", " \"for the transition period from to .\",\n", " \"\\u2612\", \"\\u2610\",\n", "}\n", "\n", "_BOILERPLATE_RE = re.compile(\n", " r\"^(\"\n", " r\"form \\d+[\\-/][a-z]+\"\n", " r\"|commission file\"\n", " r\"|irs employer\"\n", " r\"|state or other\"\n", " r\"|jurisdiction\"\n", " r\"|\\(exact name\"\n", " r\"|\\(zip code\"\n", " r\"|indicate by check\"\n", " r\"|securities registered\"\n", " r\"|aggregate market value\"\n", " r\"|number of shares\"\n", " r\"|\\u2612|\\u2610\"\n", " r\")\",\n", " re.IGNORECASE,\n", ")\n", "\n", "def _is_boilerplate(text: str) -> bool:\n", " t = text.strip().lower()\n", " if t in _BOILERPLATE_EXACT: return True\n", " if len(t) < 5: return True\n", " if _BOILERPLATE_RE.match(text.strip()): return True\n", " return False\n", "\n", "def clean_text(text: str) -> str:\n", " if not text: return \"\"\n", " text = text.replace(\"\\u00ad\", \"\").replace(\"\\u200b\", \"\")\n", " text = re.sub(r\"[ \\t]+\", \" \", text)\n", " text = re.sub(r\"\\n{3,}\", \"\\n\\n\", text)\n", " return text.strip()\n", "\n", "\n", "# Extract all sections\n", "sections = []\n", "current_header = \"\"\n", "\n", "for item, level in doc.iterate_items():\n", " text = getattr(item, \"text\", None)\n", " if not text or not text.strip():\n", " continue\n", " if isinstance(item, TableItem):\n", " continue\n", "\n", " raw = text.strip()\n", " is_hdr = isinstance(item, SectionHeaderItem)\n", "\n", " sections.append({\n", " \"type\" : \"header\" if is_hdr else \"text\",\n", " \"level\" : level,\n", " \"text\" : raw,\n", " \"cleaned_text\" : clean_text(raw),\n", " \"page_num\" : None,\n", " \"parent_header\" : current_header,\n", " \"is_boilerplate\": _is_boilerplate(raw),\n", " })\n", "\n", " if is_hdr:\n", " current_header = raw\n", "\n", "boilerplate = [s for s in sections if s[\"is_boilerplate\"]]\n", "content = [s for s in sections if not s[\"is_boilerplate\"]]\n", "headers = [s for s in sections if s[\"type\"] == \"header\"]\n", "\n", "print(f\"Sections extracted : {len(sections)}\")\n", "print(f\" Boilerplate : {len(boilerplate)} (cover-section administrative text)\")\n", "print(f\" Content : {len(content)}\")\n", "print(f\" Headers : {len(headers)}\")\n", "print()\n", "print(\"Section headers found:\")\n", "for h in headers:\n", " print(f\" [H{h['level']}] {h['text'][:80]}\")\n", "print()\n", "print(\"First 5 content sections:\")\n", "for s in content[:5]:\n", " print(f\" [{s['type']:6s}] parent='{s['parent_header'][:40]}'\")\n", " print(f\" text='{s['text'][:100]}'\")" ] }, { "cell_type": "markdown", "id": "step6-hdr", "metadata": {}, "source": [ "## STEP 6 — Extract Tables\n", "\n", "SEC filings contain financial statement tables — balance sheets, income statements, cash flow statements.\n", "\n", "Same approach as Morningstar PDFs:\n", "- `export_to_dataframe(doc)` → structured data\n", "- `export_to_markdown(doc)` → markdown for LLM context\n", "- `is_atomic = True` → never split these in chunking" ] }, { "cell_type": "code", "execution_count": 9, "id": "step6-code", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tables extracted: 14\n", "\n", " # Rows Cols Headers\n", "------------------------------------------------------------\n", " 0 2 3 ['0', '1', '2']\n", " 1 2 3 ['0', '1', '2']\n", " 2 2 3 ['0', '1', '2']\n", " 3 3 15 ['0', '1', '2']\n", " 4 2 6 ['0', '1', '2']\n", " 5 2 6 ['0', '1', '2']\n", " 6 2 6 ['0', '1', '2']\n", " 7 2 6 ['0', '1', '2']\n", " 8 2 3 ['0', '1', '2']\n", " 9 9 9 ['0', '1', '2']\n", " 10 2 3 ['0', '1', '2']\n", " 11 2 3 ['0', '1', '2']\n", " 12 5 9 ['0', '1', '2']\n", " 13 6 18 ['0', '1', '2']\n", "\n", "First table preview (markdown format — what LLM receives):\n", "=======================================================\n", "| | | | | | | | | | | | | | | | | | |\n", "|-------|-------|-------|----------------|----------------|----------------|----|----|----|------------|------------|------------|------------|------------|------------|------------------------------------------------|------------------|------------------|\n", "| Date: | Date: | Date: | April 30, 2026 | April 30, 2026 | April 30, 2026 | | | | Apple Inc. | Apple Inc. | Apple Inc. | Apple Inc. | Apple Inc. | Apple Inc. | Apple Inc. | Apple Inc. | Apple Inc. |\n", "| | | | | | | | | | | | | | | | | \n", "=======================================================\n" ] } ], "source": [ "tables = []\n", "\n", "for i, table in enumerate(doc.tables):\n", " try:\n", " df = table.export_to_dataframe(doc)\n", " markdown = table.export_to_markdown(doc)\n", "\n", " if df.empty or len(df) < 2:\n", " continue\n", "\n", " tables.append({\n", " \"index\" : i,\n", " \"page_num\" : None,\n", " \"markdown\" : markdown,\n", " \"headers\" : list(df.columns.astype(str)),\n", " \"rows\" : len(df),\n", " \"cols\" : len(df.columns),\n", " \"data\" : df.fillna(\"\").values.tolist(),\n", " \"is_atomic\": True,\n", " })\n", " except Exception as e:\n", " print(f\" Warning: table {i} skipped — {e}\")\n", "\n", "print(f\"Tables extracted: {len(tables)}\")\n", "print()\n", "print(f\"{'#':>3} {'Rows':>4} {'Cols':>4} Headers\")\n", "print(\"-\" * 60)\n", "for t in tables:\n", " print(f\"{t['index']:>3} {t['rows']:>4} {t['cols']:>4} {t['headers'][:3]}\")\n", "\n", "if tables:\n", " print()\n", " print(\"First table preview (markdown format — what LLM receives):\")\n", " print(\"=\" * 55)\n", " print(tables[13][\"markdown\"][:1000])\n", " print(\"=\" * 55)" ] }, { "cell_type": "markdown", "id": "step7-hdr", "metadata": {}, "source": [ "## STEP 7 — Attach Metadata\n", "\n", "Every chunk from this document will carry metadata that enables filtered retrieval:\n", "\n", "```python\n", "# Retrieve only Apple 10-K chunks\n", "vectorstore.query(query, filter={\"doc_type\": \"10-K\", \"ticker\": \"AAPL\"})\n", "\n", "# Retrieve only the 2024 annual report\n", "vectorstore.query(query, filter={\"doc_type\": \"10-K\", \"fiscal_year\": \"2024\"})\n", "\n", "# Retrieve all Apple financial data (any doc type)\n", "vectorstore.query(query, filter={\"ticker\": \"AAPL\"})\n", "```" ] }, { "cell_type": "code", "execution_count": 10, "id": "step7-code", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Document metadata:\n", " source : sec_edgar\n", " doc_type : 8-K\n", " ticker : AAPL\n", " company : Apple Inc.\n", " fiscal_year : 2026\n", " filing_date : 2026-04-30\n", " accession : 0000320193-26-000011\n", " file_name : filing.htm\n", " file_path : /home/pushkardeshpand/Documents/Morningstar RAG Pipeline/data/raw/sec_filings/AAPL/8-K/2026-04-30/filing.htm\n", " license : public\n", " access_level : public\n", " parsed_at : 2026-06-22T08:41:12.684682+00:00\n", " parser : docling\n", " total_pages : 0\n", " total_sections : 36\n", " total_tables : 14\n", " removed_pages : []\n" ] } ], "source": [ "doc_metadata = {\n", " \"source\" : \"sec_edgar\",\n", " \"doc_type\" : file_meta[\"form\"],\n", " \"ticker\" : file_meta[\"ticker\"],\n", " \"company\" : \"Apple Inc.\",\n", " \"fiscal_year\" : file_meta[\"fiscal_year\"],\n", " \"filing_date\" : file_meta[\"filing_date\"],\n", " \"accession\" : file_meta[\"accession\"],\n", " \"file_name\" : HTM_PATH.name,\n", " \"file_path\" : str(HTM_PATH),\n", " \"license\" : \"public\",\n", " \"access_level\": \"public\",\n", " \"parsed_at\" : datetime.now(timezone.utc).isoformat(),\n", " \"parser\" : \"docling\",\n", " \"total_pages\" : 0,\n", " \"total_sections\" : len(sections),\n", " \"total_tables\" : len(tables),\n", " \"removed_pages\" : [], # HTML has no page numbers\n", "}\n", "\n", "print(\"Document metadata:\")\n", "for k, v in doc_metadata.items():\n", " print(f\" {k:20s}: {v}\")" ] }, { "cell_type": "markdown", "id": "step8-hdr", "metadata": {}, "source": [ "## STEP 8 — Save JSON + `_docling.json`\n", "\n", "We save two files — same pattern as `pdf_processor.py`:\n", "\n", "| File | Purpose |\n", "|---|---|\n", "| `8-K_2026-04-30.json` | Structured JSON — sections, tables, metadata for inspection |\n", "| `8-K_2026-04-30_docling.json` | Native DoclingDocument — loaded by HybridChunker in Phase 3 |\n", "\n", "The `_docling.json` is what enables HybridChunker to understand the full HTML heading hierarchy — headings, reading order, table positions — rather than working on a flat text list." ] }, { "cell_type": "code", "execution_count": 11, "id": "step8-code", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Saved JSON : 8-K_2026-04-30.json (36.3 KB)\n", "Saved _docling : 8-K_2026-04-30_docling.json (142.3 KB)\n", "\n", "What HybridChunker will see when it loads _docling.json:\n", " Full document heading hierarchy preserved\n", " 36 text sections\n", " 14 tables (will be handled atomically)\n", " All HTML

-

tags → SectionHeaderItem → heading_path metadata\n" ] } ], "source": [ "DOC_STEM = f\"{file_meta['form']}_{file_meta['filing_date']}\"\n", "\n", "SEC_OUT_DIR.mkdir(parents=True, exist_ok=True)\n", "\n", "out_path = SEC_OUT_DIR / f\"{DOC_STEM}.json\"\n", "docling_path = SEC_OUT_DIR / f\"{DOC_STEM}_docling.json\"\n", "\n", "parsed = {\n", " \"metadata\" : doc_metadata,\n", " \"sections\" : sections,\n", " \"tables\" : tables,\n", "}\n", "\n", "# Save structured JSON\n", "with open(out_path, \"w\") as f:\n", " json.dump(parsed, f, indent=2, ensure_ascii=False, default=str)\n", "print(f\"Saved JSON : {out_path.name} ({out_path.stat().st_size/1024:.1f} KB)\")\n", "\n", "# Save native DoclingDocument (required for HybridChunker)\n", "with open(docling_path, \"w\") as f:\n", " f.write(doc.model_dump_json())\n", "print(f\"Saved _docling : {docling_path.name} ({docling_path.stat().st_size/1024:.1f} KB)\")\n", "\n", "print()\n", "print(\"What HybridChunker will see when it loads _docling.json:\")\n", "print(f\" Full document heading hierarchy preserved\")\n", "print(f\" {len(sections)} text sections\")\n", "print(f\" {len(tables)} tables (will be handled atomically)\")\n", "print(f\" All HTML

-

tags → SectionHeaderItem → heading_path metadata\")" ] }, { "cell_type": "markdown", "id": "step9-hdr", "metadata": {}, "source": [ "## STEP 9 — Batch Process All SEC Filings\n", "\n", "Now use the `SECProcessor` class from `src/sec_processor.py` to process all 14 filings.\n", "\n", "> **Expected processing time:**\n", "> - 8-K (×5) : ~10–30s each → ~2 min total\n", "> - 10-Q (×6) : ~1–3 min each → ~10 min total\n", "> - 10-K (×3) : ~5–15 min each → ~30 min total\n", ">\n", "> Total: **~45 minutes** on CPU. The batch runs sequentially — grab a coffee.\n", "> Already-processed files are skipped automatically (`force=False`)." ] }, { "cell_type": "code", "execution_count": 12, "id": "step9-code", "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2026-06-22 14:17:09,896 INFO \n", "── 10-K filings ────────────────────────────\n", "2026-06-22 14:17:09,897 INFO Processing: 10-K_2023 (filing.htm)\n", "2026-06-22 14:17:09,920 INFO Docling converter ready.\n", "2026-06-22 14:17:09,923 INFO detected formats: []\n", "2026-06-22 14:17:10,112 INFO Going to convert document batch...\n", "2026-06-22 14:17:10,112 INFO Initializing pipeline for SimplePipeline with options hash 7d306d2d021deac65a97d1a5f925362a\n", "2026-06-22 14:17:10,113 INFO Processing document filing.htm\n", "2026-06-22 14:17:11,163 INFO Finished converting document filing.htm in 1.24 sec.\n", "2026-06-22 14:17:35,599 INFO Saved JSON : 10-K_2023.json (1213.1 KB)\n", "2026-06-22 14:17:35,707 INFO Saved _docling : 10-K_2023_docling.json (5302.1 KB)\n", "2026-06-22 14:17:35,707 INFO Sections: 1086 (boilerplate: 69) Tables: 66\n", "2026-06-22 14:17:35,717 INFO Processing: 10-K_2024 (filing.htm)\n", "2026-06-22 14:17:35,720 INFO detected formats: []\n", "2026-06-22 14:17:35,888 INFO Going to convert document batch...\n", "2026-06-22 14:17:35,888 INFO Processing document filing.htm\n", "2026-06-22 14:17:36,773 INFO Finished converting document filing.htm in 1.06 sec.\n", "2026-06-22 14:17:59,804 INFO Saved JSON : 10-K_2024.json (1196.1 KB)\n", "2026-06-22 14:17:59,920 INFO Saved _docling : 10-K_2024_docling.json (5167.8 KB)\n", "2026-06-22 14:17:59,921 INFO Sections: 1077 (boilerplate: 70) Tables: 63\n", "2026-06-22 14:17:59,928 INFO Processing: 10-K_2025 (filing.htm)\n", "2026-06-22 14:17:59,944 INFO detected formats: []\n", "2026-06-22 14:18:00,130 INFO Going to convert document batch...\n", "2026-06-22 14:18:00,131 INFO Processing document filing.htm\n", "2026-06-22 14:18:01,064 INFO Finished converting document filing.htm in 1.14 sec.\n", "2026-06-22 14:18:28,078 INFO Saved JSON : 10-K_2025.json (1204.9 KB)\n", "2026-06-22 14:18:28,177 INFO Saved _docling : 10-K_2025_docling.json (5242.1 KB)\n", "2026-06-22 14:18:28,178 INFO Sections: 1107 (boilerplate: 70) Tables: 62\n", "2026-06-22 14:18:28,182 INFO \n", "── 10-Q filings ────────────────────────────\n", "2026-06-22 14:18:28,183 INFO Processing: 10-Q_2024_Q3 (filing.htm)\n", "2026-06-22 14:18:28,184 INFO detected formats: []\n", "2026-06-22 14:18:28,332 INFO Going to convert document batch...\n", "2026-06-22 14:18:28,333 INFO Processing document filing.htm\n", "2026-06-22 14:18:28,998 INFO Finished converting document filing.htm in 0.82 sec.\n", "2026-06-22 14:18:37,051 INFO Saved JSON : 10-Q_2024_Q3.json (597.8 KB)\n", "2026-06-22 14:18:37,108 INFO Saved _docling : 10-Q_2024_Q3_docling.json (3212.1 KB)\n", "2026-06-22 14:18:37,109 INFO Sections: 512 (boilerplate: 32) Tables: 39\n", "2026-06-22 14:18:37,111 INFO Processing: 10-Q_2025_Q1 (filing.htm)\n", "2026-06-22 14:18:37,114 INFO detected formats: []\n", "2026-06-22 14:18:37,201 INFO Going to convert document batch...\n", "2026-06-22 14:18:37,202 INFO Processing document filing.htm\n", "2026-06-22 14:18:37,563 INFO Finished converting document filing.htm in 0.45 sec.\n", "2026-06-22 14:18:42,913 INFO Saved JSON : 10-Q_2025_Q1.json (564.1 KB)\n", "2026-06-22 14:18:42,964 INFO Saved _docling : 10-Q_2025_Q1_docling.json (3025.9 KB)\n", "2026-06-22 14:18:42,964 INFO Sections: 460 (boilerplate: 32) Tables: 37\n", "2026-06-22 14:18:42,967 INFO Processing: 10-Q_2025_Q2 (filing.htm)\n", "2026-06-22 14:18:42,969 INFO detected formats: []\n", "2026-06-22 14:18:43,076 INFO Going to convert document batch...\n", "2026-06-22 14:18:43,076 INFO Processing document filing.htm\n", "2026-06-22 14:18:43,498 INFO Finished converting document filing.htm in 0.53 sec.\n", "2026-06-22 14:18:50,899 INFO Saved JSON : 10-Q_2025_Q2.json (615.4 KB)\n", "2026-06-22 14:18:50,952 INFO Saved _docling : 10-Q_2025_Q2_docling.json (3159.3 KB)\n", "2026-06-22 14:18:50,952 INFO Sections: 513 (boilerplate: 32) Tables: 37\n", "2026-06-22 14:18:50,954 INFO Processing: 10-Q_2025_Q3 (filing.htm)\n", "2026-06-22 14:18:50,956 INFO detected formats: []\n", "2026-06-22 14:18:51,062 INFO Going to convert document batch...\n", "2026-06-22 14:18:51,062 INFO Processing document filing.htm\n", "2026-06-22 14:18:51,489 INFO Finished converting document filing.htm in 0.53 sec.\n", "2026-06-22 14:18:59,275 INFO Saved JSON : 10-Q_2025_Q3.json (601.6 KB)\n", "2026-06-22 14:18:59,327 INFO Saved _docling : 10-Q_2025_Q3_docling.json (3183.1 KB)\n", "2026-06-22 14:18:59,327 INFO Sections: 506 (boilerplate: 32) Tables: 38\n", "2026-06-22 14:18:59,330 INFO Processing: 10-Q_2026_Q1 (filing.htm)\n", "2026-06-22 14:18:59,332 INFO detected formats: []\n", "2026-06-22 14:18:59,422 INFO Going to convert document batch...\n", "2026-06-22 14:18:59,423 INFO Processing document filing.htm\n", "2026-06-22 14:18:59,799 INFO Finished converting document filing.htm in 0.47 sec.\n", "2026-06-22 14:19:05,928 INFO Saved JSON : 10-Q_2026_Q1.json (602.8 KB)\n", "2026-06-22 14:19:05,991 INFO Saved _docling : 10-Q_2026_Q1_docling.json (3345.3 KB)\n", "2026-06-22 14:19:05,992 INFO Sections: 474 (boilerplate: 32) Tables: 38\n", "2026-06-22 14:19:05,994 INFO Processing: 10-Q_2026_Q2 (filing.htm)\n", "2026-06-22 14:19:05,997 INFO detected formats: []\n", "2026-06-22 14:19:06,115 INFO Going to convert document batch...\n", "2026-06-22 14:19:06,116 INFO Processing document filing.htm\n", "2026-06-22 14:19:06,581 INFO Finished converting document filing.htm in 0.59 sec.\n", "2026-06-22 14:19:17,686 INFO Saved JSON : 10-Q_2026_Q2.json (733.2 KB)\n", "2026-06-22 14:19:17,759 INFO Saved _docling : 10-Q_2026_Q2_docling.json (3608.9 KB)\n", "2026-06-22 14:19:17,760 INFO Sections: 610 (boilerplate: 32) Tables: 39\n", "2026-06-22 14:19:17,763 INFO \n", "── 8-K filings ────────────────────────────\n", "2026-06-22 14:19:17,764 INFO Processing: 8-K_2026-01-02 (filing.htm)\n", "2026-06-22 14:19:17,764 INFO detected formats: []\n", "2026-06-22 14:19:17,775 INFO Going to convert document batch...\n", "2026-06-22 14:19:17,775 INFO Processing document filing.htm\n", "2026-06-22 14:19:17,809 INFO Finished converting document filing.htm in 0.05 sec.\n", "2026-06-22 14:19:17,890 INFO Saved JSON : 8-K_2026-01-02.json (22.8 KB)\n", "2026-06-22 14:19:17,892 INFO Saved _docling : 8-K_2026-01-02_docling.json (60.0 KB)\n", "2026-06-22 14:19:17,892 INFO Sections: 73 (boilerplate: 24) Tables: 2\n", "2026-06-22 14:19:17,893 INFO Processing: 8-K_2026-01-29 (filing.htm)\n", "2026-06-22 14:19:17,894 INFO detected formats: []\n", "2026-06-22 14:19:17,903 INFO Going to convert document batch...\n", "2026-06-22 14:19:17,903 INFO Processing document filing.htm\n", "2026-06-22 14:19:17,928 INFO Finished converting document filing.htm in 0.04 sec.\n", "2026-06-22 14:19:17,973 INFO Saved JSON : 8-K_2026-01-29.json (36.4 KB)\n", "2026-06-22 14:19:17,976 INFO Saved _docling : 8-K_2026-01-29_docling.json (142.3 KB)\n", "2026-06-22 14:19:17,977 INFO Sections: 36 (boilerplate: 7) Tables: 14\n", "2026-06-22 14:19:17,978 INFO Processing: 8-K_2026-02-24 (filing.htm)\n", "2026-06-22 14:19:17,979 INFO detected formats: []\n", "2026-06-22 14:19:17,994 INFO Going to convert document batch...\n", "2026-06-22 14:19:17,995 INFO Processing document filing.htm\n", "2026-06-22 14:19:18,046 INFO Finished converting document filing.htm in 0.07 sec.\n", "2026-06-22 14:19:18,170 INFO Saved JSON : 8-K_2026-02-24.json (30.7 KB)\n", "2026-06-22 14:19:18,175 INFO Saved _docling : 8-K_2026-02-24_docling.json (132.1 KB)\n", "2026-06-22 14:19:18,176 INFO Sections: 76 (boilerplate: 24) Tables: 8\n", "2026-06-22 14:19:18,179 INFO Processing: 8-K_2026-04-20 (filing.htm)\n", "2026-06-22 14:19:18,180 INFO detected formats: []\n", "2026-06-22 14:19:18,188 INFO Going to convert document batch...\n", "2026-06-22 14:19:18,189 INFO Processing document filing.htm\n", "2026-06-22 14:19:18,244 INFO Finished converting document filing.htm in 0.06 sec.\n", "2026-06-22 14:19:18,322 INFO Saved JSON : 8-K_2026-04-20.json (24.0 KB)\n", "2026-06-22 14:19:18,325 INFO Saved _docling : 8-K_2026-04-20_docling.json (57.9 KB)\n", "2026-06-22 14:19:18,325 INFO Sections: 73 (boilerplate: 24) Tables: 2\n", "2026-06-22 14:19:18,331 INFO SKIP 8-K_2026-04-30 (already processed → 8-K_2026-04-30.json)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "============================================================\n", "Batch complete in 2.1 minutes\n", "============================================================\n", "\n", " Filing Sections Tables\n", "------------------------------------------------------------\n", " 10-K_2023-11-03 1086 66\n", " 10-K_2024-11-01 1077 63\n", " 10-K_2025-10-31 1107 62\n", " 10-Q_2024-08-02 512 39\n", " 10-Q_2025-01-31 460 37\n", " 10-Q_2025-05-02 513 37\n", " 10-Q_2025-08-01 506 38\n", " 10-Q_2026-01-30 474 38\n", " 10-Q_2026-05-01 610 39\n", " 8-K_2026-01-02 73 2\n", " 8-K_2026-01-29 36 14\n", " 8-K_2026-02-24 76 8\n", " 8-K_2026-04-20 73 2\n", " 8-K_2026-04-30 36 14\n", "\n", " Total sections : 6639\n", " Total tables : 459\n" ] } ], "source": [ "from sec_processor import SECProcessor\n", "\n", "processor = SECProcessor(output_dir=SEC_OUT_DIR)\n", "\n", "batch_start = time.time()\n", "results = processor.process_all(raw_dir=RAW_SEC_DIR, force=False)\n", "batch_elapsed = time.time() - batch_start\n", "\n", "print(f\"\\n{'='*60}\")\n", "print(f\"Batch complete in {batch_elapsed/60:.1f} minutes\")\n", "print(f\"{'='*60}\")\n", "print(f\"\\n {'Filing':40s} {'Sections':>8} {'Tables':>6}\")\n", "print(\"-\" * 60)\n", "for r in results:\n", " m = r[\"metadata\"]\n", " stem = f\"{m['doc_type']}_{m['filing_date'] or m['fiscal_year']}\"\n", " print(f\" {stem:40s} {m['total_sections']:>8} {m['total_tables']:>6}\")\n", "\n", "print(f\"\\n Total sections : {sum(r['metadata']['total_sections'] for r in results)}\")\n", "print(f\" Total tables : {sum(r['metadata']['total_tables'] for r in results)}\")" ] }, { "cell_type": "markdown", "id": "step10-hdr", "metadata": {}, "source": [ "## STEP 10 — Verify Output\n", "\n", "Confirm all JSON and `_docling.json` files were created. Every `.json` must have a paired `_docling.json` — otherwise HybridChunker will skip it in Phase 3." ] }, { "cell_type": "code", "execution_count": 13, "id": "step10-code", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Processed JSONs : 14\n", "_docling.json : 14\n", "\n", "File JSON _docling Status\n", "--------------------------------------------------------------------------------\n", " 10-K_2023.json 1213 KB 5302 KB OK\n", " 10-K_2024.json 1196 KB 5168 KB OK\n", " 10-K_2025.json 1205 KB 5242 KB OK\n", " 10-Q_2024_Q3.json 598 KB 3212 KB OK\n", " 10-Q_2025_Q1.json 564 KB 3026 KB OK\n", " 10-Q_2025_Q2.json 615 KB 3159 KB OK\n", " 10-Q_2025_Q3.json 602 KB 3183 KB OK\n", " 10-Q_2026_Q1.json 603 KB 3345 KB OK\n", " 10-Q_2026_Q2.json 733 KB 3609 KB OK\n", " 8-K_2026-01-02.json 23 KB 60 KB OK\n", " 8-K_2026-01-29.json 36 KB 142 KB OK\n", " 8-K_2026-02-24.json 31 KB 132 KB OK\n", " 8-K_2026-04-20.json 24 KB 58 KB OK\n", " 8-K_2026-04-30.json 36 KB 142 KB OK\n", "\n", "All 14 filings have paired _docling.json files.\n", "HybridChunker will work for all documents in Phase 3.\n" ] } ], "source": [ "all_json = sorted(f for f in SEC_OUT_DIR.glob(\"*.json\") if not f.name.endswith(\"_docling.json\"))\n", "all_docling = sorted(SEC_OUT_DIR.glob(\"*_docling.json\"))\n", "\n", "print(f\"Processed JSONs : {len(all_json)}\")\n", "print(f\"_docling.json : {len(all_docling)}\")\n", "print()\n", "\n", "missing_docling = []\n", "print(f\"{'File':45s} {'JSON':>8} {'_docling':>10} Status\")\n", "print(\"-\" * 80)\n", "for jf in all_json:\n", " dl = jf.with_name(jf.stem + \"_docling.json\")\n", " has = dl.exists()\n", " j_kb = jf.stat().st_size / 1024\n", " dl_kb = dl.stat().st_size / 1024 if has else 0\n", " status = \"OK\" if has else \"MISSING _docling.json\"\n", " print(f\" {jf.name:43s} {j_kb:>6.0f} KB {dl_kb:>8.0f} KB {status}\")\n", " if not has:\n", " missing_docling.append(jf.name)\n", "\n", "if missing_docling:\n", " print(f\"\\nWARNING: {len(missing_docling)} files missing _docling.json:\")\n", " for f in missing_docling:\n", " print(f\" {f}\")\n", "else:\n", " print(f\"\\nAll {len(all_json)} filings have paired _docling.json files.\")\n", " print(\"HybridChunker will work for all documents in Phase 3.\")" ] }, { "cell_type": "code", "execution_count": null, "id": "e495e460-9995-4545-8aac-9ce2742f30eb", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }