Spaces:

Pushkya
/

Financial_bot

Running

File size: 44,845 Bytes
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "title-cell",
   "metadata": {},
   "source": [
    "# SEC Filing Processor\n",
    "## Morningstar RAG Pipeline — Phase 2b\n",
    "\n",
    "This notebook processes **Apple SEC filings** (10-K, 10-Q, 8-K) from raw HTML into structured JSON, following the same pattern as `01_pdf_processing.ipynb` for Morningstar PDFs.\n",
    "\n",
    "### Why a separate processor?\n",
    "\n",
    "| | Morningstar PDFs | SEC HTML filings |\n",
    "|---|---|---|\n",
    "| Input format | PDF | HTML (`.htm`) |\n",
    "| Page numbers | Yes | No — HTML has no pages |\n",
    "| Noise filter | Page-based (remove pages 13-14) | Cover-section boilerplate |\n",
    "| Metadata | company, ticker, doc_type | fiscal_year, accession, filing_date |\n",
    "| Docling model | DocLayNet layout detection | HTML structure parsing |\n",
    "\n",
    "Both use the **same Docling converter** and both save a `_docling.json` — so **HybridChunker in Phase 3 works identically** for PDFs and HTML.\n",
    "\n",
    "### Filings we are processing\n",
    "| Type | Count | Description |\n",
    "|---|---|---|\n",
    "| 10-K | 3 | Annual reports (2023, 2024, 2025) — large |\n",
    "| 10-Q | 6 | Quarterly reports — medium |\n",
    "| 8-K | 5 | Current reports / earnings releases — small |\n",
    "\n",
    "### Steps in this notebook\n",
    "```\n",
    "STEP 1  — Imports & Paths\n",
    "STEP 2  — Configure Docling for HTML\n",
    "STEP 3  — Parse One Filing  (8-K demo — fastest to run)\n",
    "STEP 4  — Inspect Raw Output\n",
    "STEP 5  — Extract Sections + Boilerplate Detection\n",
    "STEP 6  — Extract Tables\n",
    "STEP 7  — Attach Metadata\n",
    "STEP 8  — Save JSON + _docling.json\n",
    "STEP 9  — Batch Process All SEC Filings\n",
    "STEP 10 — Verify Output\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "step1-hdr",
   "metadata": {},
   "source": [
    "## STEP 1 — Imports & Paths"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "step1-code",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Project root  : /home/pushkardeshpand/Documents/Morningstar RAG Pipeline\n",
      "Raw SEC dir   : /home/pushkardeshpand/Documents/Morningstar RAG Pipeline/data/raw/sec_filings/AAPL\n",
      "Output dir    : /home/pushkardeshpand/Documents/Morningstar RAG Pipeline/data/processed/sec_filings/AAPL\n",
      "\n",
      "10-K  (3): ['2023', '2024', '2025']\n",
      "10-Q  (6): ['2024_Q3', '2025_Q1', '2025_Q2', '2025_Q3', '2026_Q1', '2026_Q2']\n",
      "8-K   (5): ['2026-01-02', '2026-01-29', '2026-02-24', '2026-04-20', '2026-04-30']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "import json\n",
    "import sys\n",
    "import time\n",
    "from pathlib import Path\n",
    "from datetime import datetime, timezone\n",
    "\n",
    "# ── Project paths ──────────────────────────────────────────────────────────────\n",
    "NOTEBOOK_DIR  = Path().resolve()\n",
    "PROJECT_ROOT  = NOTEBOOK_DIR.parent\n",
    "SRC_DIR       = PROJECT_ROOT / \"src\"\n",
    "RAW_SEC_DIR   = PROJECT_ROOT / \"data\" / \"raw\"   / \"sec_filings\" / \"AAPL\"\n",
    "SEC_OUT_DIR   = PROJECT_ROOT / \"data\" / \"processed\" / \"sec_filings\" / \"AAPL\"\n",
    "\n",
    "sys.path.insert(0, str(SRC_DIR))\n",
    "\n",
    "print(f\"Project root  : {PROJECT_ROOT}\")\n",
    "print(f\"Raw SEC dir   : {RAW_SEC_DIR}\")\n",
    "print(f\"Output dir    : {SEC_OUT_DIR}\")\n",
    "print()\n",
    "\n",
    "# Inventory the raw filings\n",
    "for doc_type in [\"10-K\", \"10-Q\", \"8-K\"]:\n",
    "    type_dir = RAW_SEC_DIR / doc_type\n",
    "    if not type_dir.exists():\n",
    "        continue\n",
    "    periods = sorted(p.name for p in type_dir.iterdir() if (p / \"filing.htm\").exists())\n",
    "    print(f\"{doc_type:5s} ({len(periods)}): {periods}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "step2-hdr",
   "metadata": {},
   "source": [
    "## STEP 2 — Configure Docling for HTML\n",
    "\n",
    "Docling handles HTML natively — the same `DocumentConverter` used for PDFs works for `.htm` files.\n",
    "\n",
    "For HTML, Docling:\n",
    "- Parses `<h1>`–`<h6>` tags → `SectionHeaderItem` (heading hierarchy preserved automatically)\n",
    "- Parses `<table>` tags → `TableItem` with structure reconstruction via TableFormer\n",
    "- Parses `<p>`, `<div>` text → `TextItem`\n",
    "- No OCR, no layout detection (not needed for HTML)\n",
    "\n",
    "The resulting `DoclingDocument` object is identical in structure to the one produced from a PDF — so **HybridChunker works on both without any changes**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "step2-code",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
      "I0000 00:00:1782117419.440624 3132948 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
      "I0000 00:00:1782117419.473381 3132948 cpu_feature_guard.cc:227] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
      "To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
      "I0000 00:00:1782117420.228651 3132948 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Docling converter ready.\n",
      "  Table structure : True\n",
      "  OCR             : False\n",
      "  Picture images  : False\n"
     ]
    }
   ],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "\n",
    "from docling.document_converter import DocumentConverter, PdfFormatOption\n",
    "from docling.datamodel.pipeline_options import PdfPipelineOptions\n",
    "from docling.datamodel.base_models import InputFormat\n",
    "\n",
    "opts = PdfPipelineOptions()\n",
    "opts.do_table_structure      = True    # reconstruct <table> rows/columns\n",
    "opts.do_ocr                  = False   # HTML — no OCR needed\n",
    "opts.generate_picture_images = False   # skip figure extraction\n",
    "\n",
    "converter = DocumentConverter(\n",
    "    format_options={\n",
    "        InputFormat.PDF: PdfFormatOption(pipeline_options=opts)\n",
    "    }\n",
    ")\n",
    "\n",
    "print(\"Docling converter ready.\")\n",
    "print(f\"  Table structure : {opts.do_table_structure}\")\n",
    "print(f\"  OCR             : {opts.do_ocr}\")\n",
    "print(f\"  Picture images  : {opts.generate_picture_images}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "step3-hdr",
   "metadata": {},
   "source": [
    "## STEP 3 — Parse One Filing (8-K Demo)\n",
    "\n",
    "We start with a recent 8-K (earnings press release) — it is the smallest filing type and parses quickly.\n",
    "\n",
    "After this walkthrough, STEP 9 will batch-process **all 14 filings** automatically.\n",
    "\n",
    "> **Note on processing time:**\n",
    "> - 8-K: ~10–30 seconds (small — earnings release text)\n",
    "> - 10-Q: ~1–3 minutes (medium — quarterly financial statements)\n",
    "> - 10-K: ~5–15 minutes per file (large — full annual report with 60+ tables)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "step3-code",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Filing        : filing.htm\n",
      "Size          : 37 KB\n",
      "Form type     : 8-K\n",
      "Filing date   : 2026-04-30\n",
      "Accession     : 0000320193-26-000011\n",
      "\n",
      "Parsing with Docling (this may take 10–30 seconds for an 8-K)...\n",
      "\n",
      "Parsing complete in 0.1s\n",
      "Document type : DoclingDocument\n",
      "Tables found  : 14\n"
     ]
    }
   ],
   "source": [
    "# Use the most recent 8-K as the demo filing\n",
    "HTM_PATH = RAW_SEC_DIR / \"8-K\" / \"2026-04-30\" / \"filing.htm\"\n",
    "\n",
    "with open(RAW_SEC_DIR / \"8-K\" / \"2026-04-30\" / \"metadata.json\") as f:\n",
    "    file_meta = json.load(f)\n",
    "\n",
    "print(f\"Filing        : {HTM_PATH.name}\")\n",
    "print(f\"Size          : {HTM_PATH.stat().st_size / 1024:.0f} KB\")\n",
    "print(f\"Form type     : {file_meta['form']}\")\n",
    "print(f\"Filing date   : {file_meta['filing_date']}\")\n",
    "print(f\"Accession     : {file_meta['accession']}\")\n",
    "print()\n",
    "print(\"Parsing with Docling (this may take 10–30 seconds for an 8-K)...\")\n",
    "\n",
    "start  = time.time()\n",
    "result = converter.convert(str(HTM_PATH))\n",
    "doc    = result.document\n",
    "elapsed = time.time() - start\n",
    "\n",
    "print(f\"\\nParsing complete in {elapsed:.1f}s\")\n",
    "print(f\"Document type : {type(doc).__name__}\")\n",
    "print(f\"Tables found  : {len(doc.tables)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "step4-hdr",
   "metadata": {},
   "source": [
    "## STEP 4 — Inspect Raw Output\n",
    "\n",
    "Let's see what element types Docling produced from the HTML."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "step4-code",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Element types in document:\n",
      "  TextItem                        36\n",
      "  TableItem                       14\n",
      "  PictureItem                     1\n",
      "\n",
      "First 12 elements:\n",
      "----------------------------------------------------------------------\n",
      "  [TableItem           ]  lvl=1  \"\"\n",
      "  [TextItem            ]  lvl=1  \"UNITED STATES\"\n",
      "  [TextItem            ]  lvl=1  \"SECURITIES AND EXCHANGE COMMISSION\"\n",
      "  [TextItem            ]  lvl=1  \"Washington, D.C. 20549\"\n",
      "  [TableItem           ]  lvl=1  \"\"\n",
      "  [TextItem            ]  lvl=1  \"FORM 8-K\"\n",
      "  [TextItem            ]  lvl=1  \"CURRENT REPORT\"\n",
      "  [TextItem            ]  lvl=1  \"Pursuant to Section 13 OR 15(d) of The Securities Exchange Act of 1934\"\n",
      "  [TextItem            ]  lvl=1  \"April 30, 2026\"\n",
      "  [TextItem            ]  lvl=1  \"Date of Report (Date of earliest event reported)\"\n",
      "  [TableItem           ]  lvl=1  \"\"\n",
      "  [TextItem            ]  lvl=1  \"g325078g0426062022046a24.jpg\"\n"
     ]
    }
   ],
   "source": [
    "from docling.datamodel.document import TextItem, SectionHeaderItem, TableItem\n",
    "\n",
    "# Count element types\n",
    "type_counts = {}\n",
    "for item, level in doc.iterate_items():\n",
    "    t = type(item).__name__\n",
    "    type_counts[t] = type_counts.get(t, 0) + 1\n",
    "\n",
    "print(\"Element types in document:\")\n",
    "for t, n in sorted(type_counts.items(), key=lambda x: -x[1]):\n",
    "    print(f\"  {t:30s}  {n}\")\n",
    "\n",
    "print()\n",
    "print(\"First 12 elements:\")\n",
    "print(\"-\" * 70)\n",
    "for i, (item, level) in enumerate(doc.iterate_items()):\n",
    "    if i >= 12:\n",
    "        break\n",
    "    text  = getattr(item, \"text\", \"\")[:80]\n",
    "    itype = type(item).__name__\n",
    "    print(f\"  [{itype:20s}]  lvl={level}  \\\"{text}\\\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "step5-hdr",
   "metadata": {},
   "source": [
    "## STEP 5 — Extract Sections + Boilerplate Detection\n",
    "\n",
    "SEC filings begin with a **cover section** containing administrative metadata — not financial content.\n",
    "\n",
    "This boilerplate includes:\n",
    "- `UNITED STATES` / `SECURITIES AND EXCHANGE COMMISSION`\n",
    "- `Commission File Number: 001-36743`\n",
    "- `(Exact name of Registrant as specified in its charter)`\n",
    "- Form checkboxes: `☒ ANNUAL REPORT PURSUANT TO SECTION 13(d)...`\n",
    "\n",
    "We tag these with `is_boilerplate: True` so downstream processes can identify them. We do **not** remove them from the `_docling.json` — HybridChunker handles the structure.\n",
    "\n",
    "> Unlike Morningstar PDFs where we removed entire pages, here we flag individual fragments because there are no page boundaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "step5-code",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sections extracted : 36\n",
      "  Boilerplate      : 7  (cover-section administrative text)\n",
      "  Content          : 29\n",
      "  Headers          : 0\n",
      "\n",
      "Section headers found:\n",
      "\n",
      "First 5 content sections:\n",
      "  [text  ] parent=''\n",
      "           text='CURRENT REPORT'\n",
      "  [text  ] parent=''\n",
      "           text='Pursuant to Section 13 OR 15(d) of The Securities Exchange Act of 1934'\n",
      "  [text  ] parent=''\n",
      "           text='April 30, 2026'\n",
      "  [text  ] parent=''\n",
      "           text='Date of Report (Date of earliest event reported)'\n",
      "  [text  ] parent=''\n",
      "           text='g325078g0426062022046a24.jpg'\n"
     ]
    }
   ],
   "source": [
    "_BOILERPLATE_EXACT = {\n",
    "    \"united states\",\n",
    "    \"securities and exchange commission\",\n",
    "    \"washington, d.c. 20549\",\n",
    "    \"(mark one)\",\n",
    "    \"or\",\n",
    "    \"for the transition period from to .\",\n",
    "    \"\\u2612\", \"\\u2610\",\n",
    "}\n",
    "\n",
    "_BOILERPLATE_RE = re.compile(\n",
    "    r\"^(\"\n",
    "    r\"form \\d+[\\-/][a-z]+\"\n",
    "    r\"|commission file\"\n",
    "    r\"|irs employer\"\n",
    "    r\"|state or other\"\n",
    "    r\"|jurisdiction\"\n",
    "    r\"|\\(exact name\"\n",
    "    r\"|\\(zip code\"\n",
    "    r\"|indicate by check\"\n",
    "    r\"|securities registered\"\n",
    "    r\"|aggregate market value\"\n",
    "    r\"|number of shares\"\n",
    "    r\"|\\u2612|\\u2610\"\n",
    "    r\")\",\n",
    "    re.IGNORECASE,\n",
    ")\n",
    "\n",
    "def _is_boilerplate(text: str) -> bool:\n",
    "    t = text.strip().lower()\n",
    "    if t in _BOILERPLATE_EXACT: return True\n",
    "    if len(t) < 5:              return True\n",
    "    if _BOILERPLATE_RE.match(text.strip()): return True\n",
    "    return False\n",
    "\n",
    "def clean_text(text: str) -> str:\n",
    "    if not text: return \"\"\n",
    "    text = text.replace(\"\\u00ad\", \"\").replace(\"\\u200b\", \"\")\n",
    "    text = re.sub(r\"[ \\t]+\", \" \", text)\n",
    "    text = re.sub(r\"\\n{3,}\", \"\\n\\n\", text)\n",
    "    return text.strip()\n",
    "\n",
    "\n",
    "# Extract all sections\n",
    "sections       = []\n",
    "current_header = \"\"\n",
    "\n",
    "for item, level in doc.iterate_items():\n",
    "    text = getattr(item, \"text\", None)\n",
    "    if not text or not text.strip():\n",
    "        continue\n",
    "    if isinstance(item, TableItem):\n",
    "        continue\n",
    "\n",
    "    raw    = text.strip()\n",
    "    is_hdr = isinstance(item, SectionHeaderItem)\n",
    "\n",
    "    sections.append({\n",
    "        \"type\"          : \"header\" if is_hdr else \"text\",\n",
    "        \"level\"         : level,\n",
    "        \"text\"          : raw,\n",
    "        \"cleaned_text\"  : clean_text(raw),\n",
    "        \"page_num\"      : None,\n",
    "        \"parent_header\" : current_header,\n",
    "        \"is_boilerplate\": _is_boilerplate(raw),\n",
    "    })\n",
    "\n",
    "    if is_hdr:\n",
    "        current_header = raw\n",
    "\n",
    "boilerplate = [s for s in sections if s[\"is_boilerplate\"]]\n",
    "content     = [s for s in sections if not s[\"is_boilerplate\"]]\n",
    "headers     = [s for s in sections if s[\"type\"] == \"header\"]\n",
    "\n",
    "print(f\"Sections extracted : {len(sections)}\")\n",
    "print(f\"  Boilerplate      : {len(boilerplate)}  (cover-section administrative text)\")\n",
    "print(f\"  Content          : {len(content)}\")\n",
    "print(f\"  Headers          : {len(headers)}\")\n",
    "print()\n",
    "print(\"Section headers found:\")\n",
    "for h in headers:\n",
    "    print(f\"  [H{h['level']}] {h['text'][:80]}\")\n",
    "print()\n",
    "print(\"First 5 content sections:\")\n",
    "for s in content[:5]:\n",
    "    print(f\"  [{s['type']:6s}] parent='{s['parent_header'][:40]}'\")\n",
    "    print(f\"           text='{s['text'][:100]}'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "step6-hdr",
   "metadata": {},
   "source": [
    "## STEP 6 — Extract Tables\n",
    "\n",
    "SEC filings contain financial statement tables — balance sheets, income statements, cash flow statements.\n",
    "\n",
    "Same approach as Morningstar PDFs:\n",
    "- `export_to_dataframe(doc)` → structured data\n",
    "- `export_to_markdown(doc)` → markdown for LLM context\n",
    "- `is_atomic = True` → never split these in chunking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "step6-code",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tables extracted: 14\n",
      "\n",
      "  #  Rows  Cols  Headers\n",
      "------------------------------------------------------------\n",
      "  0     2     3  ['0', '1', '2']\n",
      "  1     2     3  ['0', '1', '2']\n",
      "  2     2     3  ['0', '1', '2']\n",
      "  3     3    15  ['0', '1', '2']\n",
      "  4     2     6  ['0', '1', '2']\n",
      "  5     2     6  ['0', '1', '2']\n",
      "  6     2     6  ['0', '1', '2']\n",
      "  7     2     6  ['0', '1', '2']\n",
      "  8     2     3  ['0', '1', '2']\n",
      "  9     9     9  ['0', '1', '2']\n",
      " 10     2     3  ['0', '1', '2']\n",
      " 11     2     3  ['0', '1', '2']\n",
      " 12     5     9  ['0', '1', '2']\n",
      " 13     6    18  ['0', '1', '2']\n",
      "\n",
      "First table preview (markdown format — what LLM receives):\n",
      "=======================================================\n",
      "|       |       |       |                |                |                |    |    |    |            |            |            |            |            |            |                                                |                  |                  |\n",
      "|-------|-------|-------|----------------|----------------|----------------|----|----|----|------------|------------|------------|------------|------------|------------|------------------------------------------------|------------------|------------------|\n",
      "| Date: | Date: | Date: | April 30, 2026 | April 30, 2026 | April 30, 2026 |    |    |    | Apple Inc. | Apple Inc. | Apple Inc. | Apple Inc. | Apple Inc. | Apple Inc. | Apple Inc.                                     | Apple Inc.       | Apple Inc.       |\n",
      "|       |       |       |                |                |                |    |    |    |            |            |            |            |            |            |                                                |           \n",
      "=======================================================\n"
     ]
    }
   ],
   "source": [
    "tables = []\n",
    "\n",
    "for i, table in enumerate(doc.tables):\n",
    "    try:\n",
    "        df       = table.export_to_dataframe(doc)\n",
    "        markdown = table.export_to_markdown(doc)\n",
    "\n",
    "        if df.empty or len(df) < 2:\n",
    "            continue\n",
    "\n",
    "        tables.append({\n",
    "            \"index\"    : i,\n",
    "            \"page_num\" : None,\n",
    "            \"markdown\" : markdown,\n",
    "            \"headers\"  : list(df.columns.astype(str)),\n",
    "            \"rows\"     : len(df),\n",
    "            \"cols\"     : len(df.columns),\n",
    "            \"data\"     : df.fillna(\"\").values.tolist(),\n",
    "            \"is_atomic\": True,\n",
    "        })\n",
    "    except Exception as e:\n",
    "        print(f\"  Warning: table {i} skipped — {e}\")\n",
    "\n",
    "print(f\"Tables extracted: {len(tables)}\")\n",
    "print()\n",
    "print(f\"{'#':>3}  {'Rows':>4}  {'Cols':>4}  Headers\")\n",
    "print(\"-\" * 60)\n",
    "for t in tables:\n",
    "    print(f\"{t['index']:>3}  {t['rows']:>4}  {t['cols']:>4}  {t['headers'][:3]}\")\n",
    "\n",
    "if tables:\n",
    "    print()\n",
    "    print(\"First table preview (markdown format — what LLM receives):\")\n",
    "    print(\"=\" * 55)\n",
    "    print(tables[13][\"markdown\"][:1000])\n",
    "    print(\"=\" * 55)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "step7-hdr",
   "metadata": {},
   "source": [
    "## STEP 7 — Attach Metadata\n",
    "\n",
    "Every chunk from this document will carry metadata that enables filtered retrieval:\n",
    "\n",
    "```python\n",
    "# Retrieve only Apple 10-K chunks\n",
    "vectorstore.query(query, filter={\"doc_type\": \"10-K\", \"ticker\": \"AAPL\"})\n",
    "\n",
    "# Retrieve only the 2024 annual report\n",
    "vectorstore.query(query, filter={\"doc_type\": \"10-K\", \"fiscal_year\": \"2024\"})\n",
    "\n",
    "# Retrieve all Apple financial data (any doc type)\n",
    "vectorstore.query(query, filter={\"ticker\": \"AAPL\"})\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "step7-code",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Document metadata:\n",
      "  source              : sec_edgar\n",
      "  doc_type            : 8-K\n",
      "  ticker              : AAPL\n",
      "  company             : Apple Inc.\n",
      "  fiscal_year         : 2026\n",
      "  filing_date         : 2026-04-30\n",
      "  accession           : 0000320193-26-000011\n",
      "  file_name           : filing.htm\n",
      "  file_path           : /home/pushkardeshpand/Documents/Morningstar RAG Pipeline/data/raw/sec_filings/AAPL/8-K/2026-04-30/filing.htm\n",
      "  license             : public\n",
      "  access_level        : public\n",
      "  parsed_at           : 2026-06-22T08:41:12.684682+00:00\n",
      "  parser              : docling\n",
      "  total_pages         : 0\n",
      "  total_sections      : 36\n",
      "  total_tables        : 14\n",
      "  removed_pages       : []\n"
     ]
    }
   ],
   "source": [
    "doc_metadata = {\n",
    "    \"source\"      : \"sec_edgar\",\n",
    "    \"doc_type\"    : file_meta[\"form\"],\n",
    "    \"ticker\"      : file_meta[\"ticker\"],\n",
    "    \"company\"     : \"Apple Inc.\",\n",
    "    \"fiscal_year\" : file_meta[\"fiscal_year\"],\n",
    "    \"filing_date\" : file_meta[\"filing_date\"],\n",
    "    \"accession\"   : file_meta[\"accession\"],\n",
    "    \"file_name\"   : HTM_PATH.name,\n",
    "    \"file_path\"   : str(HTM_PATH),\n",
    "    \"license\"     : \"public\",\n",
    "    \"access_level\": \"public\",\n",
    "    \"parsed_at\"   : datetime.now(timezone.utc).isoformat(),\n",
    "    \"parser\"      : \"docling\",\n",
    "    \"total_pages\"    : 0,\n",
    "    \"total_sections\" : len(sections),\n",
    "    \"total_tables\"   : len(tables),\n",
    "    \"removed_pages\"  : [],   # HTML has no page numbers\n",
    "}\n",
    "\n",
    "print(\"Document metadata:\")\n",
    "for k, v in doc_metadata.items():\n",
    "    print(f\"  {k:20s}: {v}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "step8-hdr",
   "metadata": {},
   "source": [
    "## STEP 8 — Save JSON + `_docling.json`\n",
    "\n",
    "We save two files — same pattern as `pdf_processor.py`:\n",
    "\n",
    "| File | Purpose |\n",
    "|---|---|\n",
    "| `8-K_2026-04-30.json` | Structured JSON — sections, tables, metadata for inspection |\n",
    "| `8-K_2026-04-30_docling.json` | Native DoclingDocument — loaded by HybridChunker in Phase 3 |\n",
    "\n",
    "The `_docling.json` is what enables HybridChunker to understand the full HTML heading hierarchy — headings, reading order, table positions — rather than working on a flat text list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "step8-code",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Saved JSON     : 8-K_2026-04-30.json  (36.3 KB)\n",
      "Saved _docling : 8-K_2026-04-30_docling.json  (142.3 KB)\n",
      "\n",
      "What HybridChunker will see when it loads _docling.json:\n",
      "  Full document heading hierarchy preserved\n",
      "  36 text sections\n",
      "  14 tables (will be handled atomically)\n",
      "  All HTML <h1>-<h6> tags → SectionHeaderItem → heading_path metadata\n"
     ]
    }
   ],
   "source": [
    "DOC_STEM = f\"{file_meta['form']}_{file_meta['filing_date']}\"\n",
    "\n",
    "SEC_OUT_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "out_path     = SEC_OUT_DIR / f\"{DOC_STEM}.json\"\n",
    "docling_path = SEC_OUT_DIR / f\"{DOC_STEM}_docling.json\"\n",
    "\n",
    "parsed = {\n",
    "    \"metadata\" : doc_metadata,\n",
    "    \"sections\" : sections,\n",
    "    \"tables\"   : tables,\n",
    "}\n",
    "\n",
    "# Save structured JSON\n",
    "with open(out_path, \"w\") as f:\n",
    "    json.dump(parsed, f, indent=2, ensure_ascii=False, default=str)\n",
    "print(f\"Saved JSON     : {out_path.name}  ({out_path.stat().st_size/1024:.1f} KB)\")\n",
    "\n",
    "# Save native DoclingDocument (required for HybridChunker)\n",
    "with open(docling_path, \"w\") as f:\n",
    "    f.write(doc.model_dump_json())\n",
    "print(f\"Saved _docling : {docling_path.name}  ({docling_path.stat().st_size/1024:.1f} KB)\")\n",
    "\n",
    "print()\n",
    "print(\"What HybridChunker will see when it loads _docling.json:\")\n",
    "print(f\"  Full document heading hierarchy preserved\")\n",
    "print(f\"  {len(sections)} text sections\")\n",
    "print(f\"  {len(tables)} tables (will be handled atomically)\")\n",
    "print(f\"  All HTML <h1>-<h6> tags → SectionHeaderItem → heading_path metadata\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "step9-hdr",
   "metadata": {},
   "source": [
    "## STEP 9 — Batch Process All SEC Filings\n",
    "\n",
    "Now use the `SECProcessor` class from `src/sec_processor.py` to process all 14 filings.\n",
    "\n",
    "> **Expected processing time:**\n",
    "> - 8-K (×5)  : ~10–30s each  → ~2 min total\n",
    "> - 10-Q (×6) : ~1–3 min each → ~10 min total\n",
    "> - 10-K (×3) : ~5–15 min each → ~30 min total\n",
    ">\n",
    "> Total: **~45 minutes** on CPU. The batch runs sequentially — grab a coffee.\n",
    "> Already-processed files are skipped automatically (`force=False`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "step9-code",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2026-06-22 14:17:09,896  INFO      \n",
      "── 10-K filings ────────────────────────────\n",
      "2026-06-22 14:17:09,897  INFO      Processing: 10-K_2023  (filing.htm)\n",
      "2026-06-22 14:17:09,920  INFO      Docling converter ready.\n",
      "2026-06-22 14:17:09,923  INFO      detected formats: [<InputFormat.HTML: 'html'>]\n",
      "2026-06-22 14:17:10,112  INFO      Going to convert document batch...\n",
      "2026-06-22 14:17:10,112  INFO      Initializing pipeline for SimplePipeline with options hash 7d306d2d021deac65a97d1a5f925362a\n",
      "2026-06-22 14:17:10,113  INFO      Processing document filing.htm\n",
      "2026-06-22 14:17:11,163  INFO      Finished converting document filing.htm in 1.24 sec.\n",
      "2026-06-22 14:17:35,599  INFO        Saved JSON     : 10-K_2023.json  (1213.1 KB)\n",
      "2026-06-22 14:17:35,707  INFO        Saved _docling : 10-K_2023_docling.json  (5302.1 KB)\n",
      "2026-06-22 14:17:35,707  INFO        Sections: 1086  (boilerplate: 69)  Tables: 66\n",
      "2026-06-22 14:17:35,717  INFO      Processing: 10-K_2024  (filing.htm)\n",
      "2026-06-22 14:17:35,720  INFO      detected formats: [<InputFormat.HTML: 'html'>]\n",
      "2026-06-22 14:17:35,888  INFO      Going to convert document batch...\n",
      "2026-06-22 14:17:35,888  INFO      Processing document filing.htm\n",
      "2026-06-22 14:17:36,773  INFO      Finished converting document filing.htm in 1.06 sec.\n",
      "2026-06-22 14:17:59,804  INFO        Saved JSON     : 10-K_2024.json  (1196.1 KB)\n",
      "2026-06-22 14:17:59,920  INFO        Saved _docling : 10-K_2024_docling.json  (5167.8 KB)\n",
      "2026-06-22 14:17:59,921  INFO        Sections: 1077  (boilerplate: 70)  Tables: 63\n",
      "2026-06-22 14:17:59,928  INFO      Processing: 10-K_2025  (filing.htm)\n",
      "2026-06-22 14:17:59,944  INFO      detected formats: [<InputFormat.HTML: 'html'>]\n",
      "2026-06-22 14:18:00,130  INFO      Going to convert document batch...\n",
      "2026-06-22 14:18:00,131  INFO      Processing document filing.htm\n",
      "2026-06-22 14:18:01,064  INFO      Finished converting document filing.htm in 1.14 sec.\n",
      "2026-06-22 14:18:28,078  INFO        Saved JSON     : 10-K_2025.json  (1204.9 KB)\n",
      "2026-06-22 14:18:28,177  INFO        Saved _docling : 10-K_2025_docling.json  (5242.1 KB)\n",
      "2026-06-22 14:18:28,178  INFO        Sections: 1107  (boilerplate: 70)  Tables: 62\n",
      "2026-06-22 14:18:28,182  INFO      \n",
      "── 10-Q filings ────────────────────────────\n",
      "2026-06-22 14:18:28,183  INFO      Processing: 10-Q_2024_Q3  (filing.htm)\n",
      "2026-06-22 14:18:28,184  INFO      detected formats: [<InputFormat.HTML: 'html'>]\n",
      "2026-06-22 14:18:28,332  INFO      Going to convert document batch...\n",
      "2026-06-22 14:18:28,333  INFO      Processing document filing.htm\n",
      "2026-06-22 14:18:28,998  INFO      Finished converting document filing.htm in 0.82 sec.\n",
      "2026-06-22 14:18:37,051  INFO        Saved JSON     : 10-Q_2024_Q3.json  (597.8 KB)\n",
      "2026-06-22 14:18:37,108  INFO        Saved _docling : 10-Q_2024_Q3_docling.json  (3212.1 KB)\n",
      "2026-06-22 14:18:37,109  INFO        Sections: 512  (boilerplate: 32)  Tables: 39\n",
      "2026-06-22 14:18:37,111  INFO      Processing: 10-Q_2025_Q1  (filing.htm)\n",
      "2026-06-22 14:18:37,114  INFO      detected formats: [<InputFormat.HTML: 'html'>]\n",
      "2026-06-22 14:18:37,201  INFO      Going to convert document batch...\n",
      "2026-06-22 14:18:37,202  INFO      Processing document filing.htm\n",
      "2026-06-22 14:18:37,563  INFO      Finished converting document filing.htm in 0.45 sec.\n",
      "2026-06-22 14:18:42,913  INFO        Saved JSON     : 10-Q_2025_Q1.json  (564.1 KB)\n",
      "2026-06-22 14:18:42,964  INFO        Saved _docling : 10-Q_2025_Q1_docling.json  (3025.9 KB)\n",
      "2026-06-22 14:18:42,964  INFO        Sections: 460  (boilerplate: 32)  Tables: 37\n",
      "2026-06-22 14:18:42,967  INFO      Processing: 10-Q_2025_Q2  (filing.htm)\n",
      "2026-06-22 14:18:42,969  INFO      detected formats: [<InputFormat.HTML: 'html'>]\n",
      "2026-06-22 14:18:43,076  INFO      Going to convert document batch...\n",
      "2026-06-22 14:18:43,076  INFO      Processing document filing.htm\n",
      "2026-06-22 14:18:43,498  INFO      Finished converting document filing.htm in 0.53 sec.\n",
      "2026-06-22 14:18:50,899  INFO        Saved JSON     : 10-Q_2025_Q2.json  (615.4 KB)\n",
      "2026-06-22 14:18:50,952  INFO        Saved _docling : 10-Q_2025_Q2_docling.json  (3159.3 KB)\n",
      "2026-06-22 14:18:50,952  INFO        Sections: 513  (boilerplate: 32)  Tables: 37\n",
      "2026-06-22 14:18:50,954  INFO      Processing: 10-Q_2025_Q3  (filing.htm)\n",
      "2026-06-22 14:18:50,956  INFO      detected formats: [<InputFormat.HTML: 'html'>]\n",
      "2026-06-22 14:18:51,062  INFO      Going to convert document batch...\n",
      "2026-06-22 14:18:51,062  INFO      Processing document filing.htm\n",
      "2026-06-22 14:18:51,489  INFO      Finished converting document filing.htm in 0.53 sec.\n",
      "2026-06-22 14:18:59,275  INFO        Saved JSON     : 10-Q_2025_Q3.json  (601.6 KB)\n",
      "2026-06-22 14:18:59,327  INFO        Saved _docling : 10-Q_2025_Q3_docling.json  (3183.1 KB)\n",
      "2026-06-22 14:18:59,327  INFO        Sections: 506  (boilerplate: 32)  Tables: 38\n",
      "2026-06-22 14:18:59,330  INFO      Processing: 10-Q_2026_Q1  (filing.htm)\n",
      "2026-06-22 14:18:59,332  INFO      detected formats: [<InputFormat.HTML: 'html'>]\n",
      "2026-06-22 14:18:59,422  INFO      Going to convert document batch...\n",
      "2026-06-22 14:18:59,423  INFO      Processing document filing.htm\n",
      "2026-06-22 14:18:59,799  INFO      Finished converting document filing.htm in 0.47 sec.\n",
      "2026-06-22 14:19:05,928  INFO        Saved JSON     : 10-Q_2026_Q1.json  (602.8 KB)\n",
      "2026-06-22 14:19:05,991  INFO        Saved _docling : 10-Q_2026_Q1_docling.json  (3345.3 KB)\n",
      "2026-06-22 14:19:05,992  INFO        Sections: 474  (boilerplate: 32)  Tables: 38\n",
      "2026-06-22 14:19:05,994  INFO      Processing: 10-Q_2026_Q2  (filing.htm)\n",
      "2026-06-22 14:19:05,997  INFO      detected formats: [<InputFormat.HTML: 'html'>]\n",
      "2026-06-22 14:19:06,115  INFO      Going to convert document batch...\n",
      "2026-06-22 14:19:06,116  INFO      Processing document filing.htm\n",
      "2026-06-22 14:19:06,581  INFO      Finished converting document filing.htm in 0.59 sec.\n",
      "2026-06-22 14:19:17,686  INFO        Saved JSON     : 10-Q_2026_Q2.json  (733.2 KB)\n",
      "2026-06-22 14:19:17,759  INFO        Saved _docling : 10-Q_2026_Q2_docling.json  (3608.9 KB)\n",
      "2026-06-22 14:19:17,760  INFO        Sections: 610  (boilerplate: 32)  Tables: 39\n",
      "2026-06-22 14:19:17,763  INFO      \n",
      "── 8-K filings ────────────────────────────\n",
      "2026-06-22 14:19:17,764  INFO      Processing: 8-K_2026-01-02  (filing.htm)\n",
      "2026-06-22 14:19:17,764  INFO      detected formats: [<InputFormat.HTML: 'html'>]\n",
      "2026-06-22 14:19:17,775  INFO      Going to convert document batch...\n",
      "2026-06-22 14:19:17,775  INFO      Processing document filing.htm\n",
      "2026-06-22 14:19:17,809  INFO      Finished converting document filing.htm in 0.05 sec.\n",
      "2026-06-22 14:19:17,890  INFO        Saved JSON     : 8-K_2026-01-02.json  (22.8 KB)\n",
      "2026-06-22 14:19:17,892  INFO        Saved _docling : 8-K_2026-01-02_docling.json  (60.0 KB)\n",
      "2026-06-22 14:19:17,892  INFO        Sections: 73  (boilerplate: 24)  Tables: 2\n",
      "2026-06-22 14:19:17,893  INFO      Processing: 8-K_2026-01-29  (filing.htm)\n",
      "2026-06-22 14:19:17,894  INFO      detected formats: [<InputFormat.HTML: 'html'>]\n",
      "2026-06-22 14:19:17,903  INFO      Going to convert document batch...\n",
      "2026-06-22 14:19:17,903  INFO      Processing document filing.htm\n",
      "2026-06-22 14:19:17,928  INFO      Finished converting document filing.htm in 0.04 sec.\n",
      "2026-06-22 14:19:17,973  INFO        Saved JSON     : 8-K_2026-01-29.json  (36.4 KB)\n",
      "2026-06-22 14:19:17,976  INFO        Saved _docling : 8-K_2026-01-29_docling.json  (142.3 KB)\n",
      "2026-06-22 14:19:17,977  INFO        Sections: 36  (boilerplate: 7)  Tables: 14\n",
      "2026-06-22 14:19:17,978  INFO      Processing: 8-K_2026-02-24  (filing.htm)\n",
      "2026-06-22 14:19:17,979  INFO      detected formats: [<InputFormat.HTML: 'html'>]\n",
      "2026-06-22 14:19:17,994  INFO      Going to convert document batch...\n",
      "2026-06-22 14:19:17,995  INFO      Processing document filing.htm\n",
      "2026-06-22 14:19:18,046  INFO      Finished converting document filing.htm in 0.07 sec.\n",
      "2026-06-22 14:19:18,170  INFO        Saved JSON     : 8-K_2026-02-24.json  (30.7 KB)\n",
      "2026-06-22 14:19:18,175  INFO        Saved _docling : 8-K_2026-02-24_docling.json  (132.1 KB)\n",
      "2026-06-22 14:19:18,176  INFO        Sections: 76  (boilerplate: 24)  Tables: 8\n",
      "2026-06-22 14:19:18,179  INFO      Processing: 8-K_2026-04-20  (filing.htm)\n",
      "2026-06-22 14:19:18,180  INFO      detected formats: [<InputFormat.HTML: 'html'>]\n",
      "2026-06-22 14:19:18,188  INFO      Going to convert document batch...\n",
      "2026-06-22 14:19:18,189  INFO      Processing document filing.htm\n",
      "2026-06-22 14:19:18,244  INFO      Finished converting document filing.htm in 0.06 sec.\n",
      "2026-06-22 14:19:18,322  INFO        Saved JSON     : 8-K_2026-04-20.json  (24.0 KB)\n",
      "2026-06-22 14:19:18,325  INFO        Saved _docling : 8-K_2026-04-20_docling.json  (57.9 KB)\n",
      "2026-06-22 14:19:18,325  INFO        Sections: 73  (boilerplate: 24)  Tables: 2\n",
      "2026-06-22 14:19:18,331  INFO      SKIP 8-K_2026-04-30  (already processed → 8-K_2026-04-30.json)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "============================================================\n",
      "Batch complete in 2.1 minutes\n",
      "============================================================\n",
      "\n",
      "  Filing                                    Sections  Tables\n",
      "------------------------------------------------------------\n",
      "  10-K_2023-11-03                               1086      66\n",
      "  10-K_2024-11-01                               1077      63\n",
      "  10-K_2025-10-31                               1107      62\n",
      "  10-Q_2024-08-02                                512      39\n",
      "  10-Q_2025-01-31                                460      37\n",
      "  10-Q_2025-05-02                                513      37\n",
      "  10-Q_2025-08-01                                506      38\n",
      "  10-Q_2026-01-30                                474      38\n",
      "  10-Q_2026-05-01                                610      39\n",
      "  8-K_2026-01-02                                  73       2\n",
      "  8-K_2026-01-29                                  36      14\n",
      "  8-K_2026-02-24                                  76       8\n",
      "  8-K_2026-04-20                                  73       2\n",
      "  8-K_2026-04-30                                  36      14\n",
      "\n",
      "  Total sections : 6639\n",
      "  Total tables   : 459\n"
     ]
    }
   ],
   "source": [
    "from sec_processor import SECProcessor\n",
    "\n",
    "processor = SECProcessor(output_dir=SEC_OUT_DIR)\n",
    "\n",
    "batch_start = time.time()\n",
    "results = processor.process_all(raw_dir=RAW_SEC_DIR, force=False)\n",
    "batch_elapsed = time.time() - batch_start\n",
    "\n",
    "print(f\"\\n{'='*60}\")\n",
    "print(f\"Batch complete in {batch_elapsed/60:.1f} minutes\")\n",
    "print(f\"{'='*60}\")\n",
    "print(f\"\\n  {'Filing':40s}  {'Sections':>8}  {'Tables':>6}\")\n",
    "print(\"-\" * 60)\n",
    "for r in results:\n",
    "    m = r[\"metadata\"]\n",
    "    stem = f\"{m['doc_type']}_{m['filing_date'] or m['fiscal_year']}\"\n",
    "    print(f\"  {stem:40s}  {m['total_sections']:>8}  {m['total_tables']:>6}\")\n",
    "\n",
    "print(f\"\\n  Total sections : {sum(r['metadata']['total_sections'] for r in results)}\")\n",
    "print(f\"  Total tables   : {sum(r['metadata']['total_tables']   for r in results)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "step10-hdr",
   "metadata": {},
   "source": [
    "## STEP 10 — Verify Output\n",
    "\n",
    "Confirm all JSON and `_docling.json` files were created. Every `.json` must have a paired `_docling.json` — otherwise HybridChunker will skip it in Phase 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "step10-code",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processed JSONs : 14\n",
      "_docling.json   : 14\n",
      "\n",
      "File                                               JSON    _docling  Status\n",
      "--------------------------------------------------------------------------------\n",
      "  10-K_2023.json                                 1213 KB      5302 KB  OK\n",
      "  10-K_2024.json                                 1196 KB      5168 KB  OK\n",
      "  10-K_2025.json                                 1205 KB      5242 KB  OK\n",
      "  10-Q_2024_Q3.json                               598 KB      3212 KB  OK\n",
      "  10-Q_2025_Q1.json                               564 KB      3026 KB  OK\n",
      "  10-Q_2025_Q2.json                               615 KB      3159 KB  OK\n",
      "  10-Q_2025_Q3.json                               602 KB      3183 KB  OK\n",
      "  10-Q_2026_Q1.json                               603 KB      3345 KB  OK\n",
      "  10-Q_2026_Q2.json                               733 KB      3609 KB  OK\n",
      "  8-K_2026-01-02.json                              23 KB        60 KB  OK\n",
      "  8-K_2026-01-29.json                              36 KB       142 KB  OK\n",
      "  8-K_2026-02-24.json                              31 KB       132 KB  OK\n",
      "  8-K_2026-04-20.json                              24 KB        58 KB  OK\n",
      "  8-K_2026-04-30.json                              36 KB       142 KB  OK\n",
      "\n",
      "All 14 filings have paired _docling.json files.\n",
      "HybridChunker will work for all documents in Phase 3.\n"
     ]
    }
   ],
   "source": [
    "all_json     = sorted(f for f in SEC_OUT_DIR.glob(\"*.json\") if not f.name.endswith(\"_docling.json\"))\n",
    "all_docling  = sorted(SEC_OUT_DIR.glob(\"*_docling.json\"))\n",
    "\n",
    "print(f\"Processed JSONs : {len(all_json)}\")\n",
    "print(f\"_docling.json   : {len(all_docling)}\")\n",
    "print()\n",
    "\n",
    "missing_docling = []\n",
    "print(f\"{'File':45s}  {'JSON':>8}  {'_docling':>10}  Status\")\n",
    "print(\"-\" * 80)\n",
    "for jf in all_json:\n",
    "    dl  = jf.with_name(jf.stem + \"_docling.json\")\n",
    "    has = dl.exists()\n",
    "    j_kb  = jf.stat().st_size / 1024\n",
    "    dl_kb = dl.stat().st_size / 1024 if has else 0\n",
    "    status = \"OK\" if has else \"MISSING _docling.json\"\n",
    "    print(f\"  {jf.name:43s}  {j_kb:>6.0f} KB  {dl_kb:>8.0f} KB  {status}\")\n",
    "    if not has:\n",
    "        missing_docling.append(jf.name)\n",
    "\n",
    "if missing_docling:\n",
    "    print(f\"\\nWARNING: {len(missing_docling)} files missing _docling.json:\")\n",
    "    for f in missing_docling:\n",
    "        print(f\"  {f}\")\n",
    "else:\n",
    "    print(f\"\\nAll {len(all_json)} filings have paired _docling.json files.\")\n",
    "    print(\"HybridChunker will work for all documents in Phase 3.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e495e460-9995-4545-8aac-9ce2742f30eb",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}