File size: 29,338 Bytes
8299003
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
"""
pdf_processor.py
=================
Production-ready PDF preprocessing module using Docling.

What this module does:
  1. Loads a PDF using Docling's document converter
  2. Extracts text sections with their heading hierarchy
  3. Filters out noise pages (cover, TOC, disclaimers, legal boilerplate)
  4. Extracts tables as structured data (markdown + row/col data)
  5. Cleans and normalises text (whitespace, encoding issues)
  6. Attaches rich metadata to every element
  7. Saves the structured output as JSON

Why Docling over PyPDF / pdfplumber?
  - PyPDF gives raw text dump โ€” tables become garbled single lines
  - pdfplumber is better but still struggles with multi-column layouts
  - Docling runs an AI layout model (DocLayNet) that understands the
    visual structure of the page: columns, tables, headings, captions
  - For financial documents with income statements and data tables
    this structural understanding is non-negotiable

Usage (as a module):
    from src.pdf_processor import PDFProcessor
    processor = PDFProcessor()
    result = processor.process("data/raw/morningstar/ptc01302411420.pdf")

Usage (as a script):
    python src/pdf_processor.py
"""

import re
import json
import logging
from pathlib import Path
from datetime import datetime, timezone

# โ”€โ”€ Logging โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
logging.basicConfig(
    level  = logging.INFO,
    format = "%(asctime)s  %(levelname)-8s  %(message)s"
)
log = logging.getLogger(__name__)

# โ”€โ”€ Paths โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
BASE_DIR      = Path(__file__).parent.parent
RAW_DIR       = BASE_DIR / "data" / "raw" / "morningstar"
PROCESSED_DIR = BASE_DIR / "data" / "processed" / "morningstar"


# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# PREPROCESSING STEP 1 โ€” Build the Docling Converter
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# We configure Docling with specific pipeline options before parsing.
# These options control which AI models run during parsing.
#
# Options we set:
#   do_table_structure = True
#       โ†’ Runs TableFormer model to reconstruct table rows/columns
#       โ†’ Without this, table cells are extracted as unordered text
#
#   do_ocr = False
#       โ†’ These PDFs are digital (not scanned images), so OCR is off
#       โ†’ Turning OCR on for digital PDFs wastes time and adds noise
#
#   generate_picture_images = False
#       โ†’ We don't need embedded chart/figure images
#       โ†’ Skipping this speeds up parsing significantly
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

def build_converter():
    """Build and return a configured Docling DocumentConverter."""
    from docling.document_converter import DocumentConverter, PdfFormatOption
    from docling.datamodel.pipeline_options import PdfPipelineOptions
    from docling.datamodel.base_models import InputFormat

    opts = PdfPipelineOptions()
    opts.do_table_structure      = True   # reconstruct table rows/columns
    opts.do_ocr                  = False  # skip OCR โ€” digital PDFs only
    opts.generate_picture_images = False  # skip figure image extraction

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=opts)
        }
    )
    log.info("Docling converter initialised (table_structure=ON, OCR=OFF)")
    return converter


# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# PREPROCESSING STEP 2 โ€” Noise Page Filter
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Not every page in a financial PDF is useful. Pages we actively remove:
#
#   Cover / Title pages
#       โ†’ Just document title, author name, date
#       โ†’ Zero retrieval value โ€” no financial content
#
#   Table of Contents / Index pages
#       โ†’ Lists section names and page numbers
#       โ†’ Section names are already captured in real section headers
#       โ†’ Page numbers refer to printed pages, useless in RAG
#
#   Disclaimer / Legal pages
#       โ†’ "Important Disclosure", "General Disclosure", "Risk Warning",
#         "Conflicts of Interest", copyright notices
#       โ†’ ACTIVELY HARMFUL: contains terms like "investment", "securities",
#         "risk" that match financial queries but return legal boilerplate
#       โ†’ Query "what are the risks?" should return risk analysis, NOT this
#
# Detection strategy:
#   โ†’ A page is noise if its ONLY headers are known boilerplate titles
#     AND its text is below a minimum meaningful length threshold
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

# Known boilerplate section titles (after normalisation โ€” no ยฎ / โ„ข symbols)
# Used for exact-match check in is_noise_header()
NOISE_HEADERS = {
    "contents", "table of contents", "index",
    "important disclosure", "important disclosures",
    "general disclosure", "general disclosures",
    "risk warning", "risks", "conflicts of interest",
    "third-party distribution", "vaneck disclosures",
    "legal disclaimer", "disclaimer", "disclaimers",
    "about morningstar indexes", "about morningstar equity research",
}

# Regex: street address line  e.g. "22 West Washington Street Chicago, IL 60602 USA"
_ADDRESS_RE = re.compile(
    r"^\d+\s+\w+.*\b(street|st|avenue|ave|boulevard|blvd|road|rd|drive|dr|lane|ln|way)\b",
    re.IGNORECASE,
)


def _normalize_header(text: str) -> str:
    """
    Strip trademark symbols and collapse whitespace so that
    "About Morningstarยฎ Equity Research TM" normalises to
    "about morningstar equity research".
    """
    t = text.strip().lower()
    t = t.replace("ยฎ", "").replace("โ„ข", "").replace("โ„ ", "")
    # Remove standalone " tm" / "(tm)" suffixes
    t = re.sub(r"\s*\(?\btm\b\)?$", "", t)
    # Collapse runs of whitespace left by symbol removal
    t = re.sub(r"\s{2,}", " ", t).strip()
    return t


def _is_noise_header(raw_header: str) -> bool:
    """
    Return True if a single header line is boilerplate.

    Checks (in order):
      1. Exact match in NOISE_HEADERS after normalisation
      2. Header ends with 'disclosure', 'disclosures', 'disclaimer', or 'disclaimers'
         โ†’ catches doc-specific titles like "Wide Moat Focus Index Disclosures"
      3. Header looks like a postal address
         โ†’ "22 West Washington Street Chicago, IL 60602 USA"
    """
    norm = _normalize_header(raw_header)

    if norm in NOISE_HEADERS:
        return True

    # Pattern: ends with a disclosure/disclaimer keyword
    if re.search(r"\b(disclosures?|disclaimers?)\s*$", norm):
        return True

    # Pattern: street address
    if _ADDRESS_RE.match(raw_header.strip()):
        return True

    return False


def is_noise_page(page_sections: list[dict]) -> bool:
    """
    Return True if a page contains only boilerplate content.

    A page is considered noise if:
      - It has no text at all (blank/cover page), OR
      - Case A: ALL headers are noise โ†’ remove regardless of text length
        Catches multi-paragraph legal/disclaimer pages
      - Case B: Noise headers outnumber content headers AND text < 300 chars
        Catches mixed cover pages with one content title + several disclaimer headers
    """
    if not page_sections:
        return True   # blank page

    total_text = " ".join(s["text"] for s in page_sections).strip()

    # Blank or near-blank page (cover pages often have <50 chars)
    if len(total_text) < 50:
        return True

    raw_headers = [s["text"] for s in page_sections if s["type"] == "header"]
    text_blocks  = [s for s in page_sections if s["type"] == "text"]
    text_content = " ".join(s["text"] for s in text_blocks).strip()

    if not raw_headers:
        return False   # no headers โ€” let content pages through

    noise_headers   = [h for h in raw_headers if     _is_noise_header(h)]
    content_headers = [h for h in raw_headers if not _is_noise_header(h)]

    # Case A: ALL headers on the page are noise
    if len(content_headers) == 0:
        return True

    # Case B: Noise headers outnumber content headers AND page is mostly boilerplate text
    if len(noise_headers) > len(content_headers) and len(text_content) < 300:
        return True

    return False


def filter_noise_pages(sections: list[dict]) -> tuple[list[dict], list[int]]:
    """
    Remove sections that belong to noise pages.

    Returns:
        filtered_sections : sections with noise pages removed
        removed_pages     : list of page numbers that were filtered out
    """
    from collections import defaultdict

    # Group sections by page
    by_page = defaultdict(list)
    for s in sections:
        pg = s.get("page_num") or 0
        by_page[pg].append(s)

    removed_pages = []
    kept_sections = []

    for pg in sorted(by_page.keys()):
        if is_noise_page(by_page[pg]):
            removed_pages.append(pg)
        else:
            kept_sections.extend(by_page[pg])

    if removed_pages:
        log.info(f"  Filtered {len(removed_pages)} noise pages: {removed_pages}")

    return kept_sections, removed_pages


# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# PREPROCESSING STEP 3 โ€” Text Cleaning
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Raw text from PDFs often contains:
#   - Extra whitespace and blank lines between words
#   - Hyphenated line breaks ("competi-\ntive" โ†’ "competitive")
#   - Unicode noise characters (soft hyphens, zero-width spaces)
#   - Repeated whitespace inside sentences
#
# We apply a simple cleaning pipeline to fix these before chunking.
# Why clean BEFORE chunking?
#   โ†’ If we chunk first, each chunk inherits the noise
#   โ†’ The embedding model will encode noise as part of the meaning
#   โ†’ Clean text produces cleaner, more accurate embeddings
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

def clean_text(text: str) -> str:
    """
    Clean raw text extracted from a PDF.

    Steps:
      1. Fix hyphenated line breaks  ("competi-\\ntion" โ†’ "competition")
      2. Remove soft hyphens and zero-width characters
      3. Collapse multiple spaces into one
      4. Strip leading/trailing whitespace
    """
    if not text:
        return ""

    # Step 1: Fix hyphenated line breaks (common in PDFs)
    text = re.sub(r"-\n", "", text)

    # Step 2: Remove soft hyphens (U+00AD) and zero-width spaces (U+200B)
    text = text.replace("\u00ad", "").replace("\u200b", "")

    # Step 3: Collapse multiple spaces/tabs into single space
    text = re.sub(r"[ \t]+", " ", text)

    # Step 4: Collapse more than 2 consecutive newlines into 2
    text = re.sub(r"\n{3,}", "\n\n", text)

    return text.strip()


# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# PREPROCESSING STEP 3 โ€” Section Extraction
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Docling's document model organises content as a tree of items.
# We iterate over it and separate items into two types:
#
#   SectionHeaderItem โ†’ A heading (H1, H2, H3 etc.)
#   TextItem          โ†’ A paragraph of body text
#
# Why capture heading level?
#   โ†’ Heading level tells us where we are in the document hierarchy
#   โ†’ "Net Income" under H1 "Financial Statements" is different from
#     "Net Income" under H2 "Non-GAAP Reconciliation"
#   โ†’ We store this in metadata so retrieval can filter by section
#
# Why separate headers from text?
#   โ†’ Headers are short and don't chunk well alone
#   โ†’ We prefix each text chunk with its parent header for context
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

def extract_sections(doc) -> list[dict]:
    """
    Extract all text sections from a parsed Docling document.

    Returns a list of dicts:
      {type, level, text, page_num, cleaned_text}
    """
    from docling.datamodel.document import TextItem, SectionHeaderItem

    sections = []
    current_header = ""   # track the last seen heading for context

    for item, level in doc.iterate_items():
        text = getattr(item, "text", None)
        if not text or not text.strip():
            continue

        page_num = item.prov[0].page_no if item.prov else None

        if isinstance(item, SectionHeaderItem):
            current_header = text.strip()
            sections.append({
                "type"        : "header",
                "level"       : level,
                "text"        : text.strip(),
                "cleaned_text": clean_text(text),
                "page_num"    : page_num,
                "parent_header": "",
            })
        else:
            sections.append({
                "type"        : "text",
                "level"       : level,
                "text"        : text.strip(),
                "cleaned_text": clean_text(text),
                "page_num"    : page_num,
                "parent_header": current_header,  # context from last heading
            })

    log.info(f"  Extracted {len(sections)} sections "
             f"({sum(1 for s in sections if s['type']=='header')} headers, "
             f"{sum(1 for s in sections if s['type']=='text')} text blocks)")

    # Remove cover, TOC, and disclaimer pages
    sections, removed = filter_noise_pages(sections)
    log.info(f"  After noise filter: {len(sections)} sections remain")

    return sections


# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# PREPROCESSING STEP 4 โ€” Table Extraction
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Tables are the most important element in financial documents.
# Docling's TableFormer model reconstructs the row/column structure.
#
# For each table we extract:
#   markdown  โ†’ Human-readable, good for LLM context
#   data      โ†’ Raw list of lists for programmatic access
#   headers   โ†’ Column names for metadata tagging
#
# Why keep tables ATOMIC (never split)?
#   โ†’ A revenue table split across two chunks loses column alignment
#   โ†’ LLM receiving half a table gives wrong or hallucinated answers
#   โ†’ Each table is stored as ONE complete chunk, regardless of size
#
# Why convert to markdown?
#   โ†’ Markdown tables are easy for LLMs to read and parse
#   โ†’ They preserve column-row relationships in plain text
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

def extract_tables(doc, skip_pages: set = None) -> list[dict]:
    """
    Extract all tables from a parsed Docling document.

    Args:
        skip_pages: set of page numbers to skip (noise pages)

    Returns a list of dicts:
      {index, page_num, markdown, headers, rows, cols, data, is_atomic}
    """
    skip_pages = skip_pages or set()
    tables = []

    for i, table in enumerate(doc.tables):
        try:
            df       = table.export_to_dataframe(doc)
            markdown = table.export_to_markdown(doc)
            page_num = table.prov[0].page_no if table.prov else None

            # Skip tables on noise pages (cover, TOC, disclaimer)
            if page_num in skip_pages:
                continue

            # Skip empty or trivially small tables (1 row = probably a label)
            if df.empty or len(df) < 2:
                continue

            tables.append({
                "index"    : i,
                "page_num" : page_num,
                "markdown" : markdown,
                "headers"  : list(df.columns.astype(str)),
                "rows"     : len(df),
                "cols"     : len(df.columns),
                "data"     : df.fillna("").values.tolist(),
                "is_atomic": True,   # NEVER split this chunk
            })

        except Exception as e:
            log.warning(f"  Table {i} could not be extracted: {e}")

    log.info(f"  Extracted {len(tables)} tables")
    return tables


# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# PREPROCESSING STEP 5 โ€” Metadata Tagging
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Every element (section or table) gets a metadata dict attached.
# This metadata is stored alongside the vector in ChromaDB.
#
# Why metadata matters:
#   โ†’ Allows FILTERED retrieval ("only search 2024 10-K documents")
#   โ†’ Enables source citation ("found on page 12 of PTC report")
#   โ†’ Supports temporal queries ("Apple revenue in fiscal 2024")
#
# Fields we tag:
#   source        โ†’ which file this came from
#   doc_type      โ†’ research_report / 10-K / 10-Q / 8-K
#   company       โ†’ Apple / PTC / etc.
#   fiscal_year   โ†’ for time-aware retrieval
#   page_num      โ†’ for citations
#   section_title โ†’ which section this chunk belongs to
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

def build_metadata(pdf_path: Path, extra: dict = None) -> dict:
    """
    Build base metadata for a document from its file path and optional extras.
    """
    meta = {
        "file_name"   : pdf_path.name,
        "file_path"   : str(pdf_path),
        "source"      : "morningstar",
        "doc_type"    : "research_report",
        "license"     : "proprietary",
        "parsed_at"   : datetime.now(timezone.utc).isoformat(),
        "parser"      : "docling",
    }
    if extra:
        meta.update(extra)
    return meta


# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# PREPROCESSING STEP 6 โ€” Full Document Export
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# After extracting sections and tables, we also export the full document
# as a single markdown string.
#
# Why?
#   โ†’ Useful for quick inspection and debugging
#   โ†’ Can be used as a fallback if section-level chunking fails
#   โ†’ Gives the LLM a complete document view when needed
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

def export_full_markdown(doc) -> str:
    """Export the entire document as a single markdown string."""
    return doc.export_to_markdown()


# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# MAIN PROCESSOR CLASS
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

class PDFProcessor:
    """
    End-to-end PDF processor using Docling.

    Combines all preprocessing steps into a single callable interface.
    Idempotent โ€” skips files that have already been processed (checks cache).
    """

    def __init__(self, output_dir: Path = PROCESSED_DIR):
        self.output_dir = output_dir
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self._converter = None   # lazy load โ€” only initialise when first needed

    @property
    def converter(self):
        if self._converter is None:
            log.info("Loading Docling converter (first use) ...")
            self._converter = build_converter()
        return self._converter

    def process(self, pdf_path: str | Path, extra_meta: dict = None,
                force: bool = False) -> dict:
        """
        Process a single PDF file through all preprocessing steps.

        Args:
            pdf_path   : Path to the PDF file
            extra_meta : Optional extra metadata (company, fiscal_year, etc.)
            force      : If True, re-process even if output already exists

        Returns:
            Parsed document dict with metadata, sections, tables, markdown
        """
        pdf_path = Path(pdf_path)
        out_path = self.output_dir / f"{pdf_path.stem}.json"

        # Check cache โ€” skip if already processed
        if out_path.exists() and not force:
            log.info(f"SKIP {pdf_path.name} (already processed โ†’ {out_path.name})")
            with open(out_path) as f:
                return json.load(f)

        log.info(f"Processing: {pdf_path.name}")

        # โ”€โ”€ Step 1: Parse with Docling โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        result = self.converter.convert(str(pdf_path))
        doc    = result.document
        log.info(f"  Docling parse complete")

        # โ”€โ”€ Step 2 + 3: Extract sections (text cleaning happens inside) โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        sections = extract_sections(doc)

        # โ”€โ”€ Step 4: Extract tables โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        # Identify pages that were removed so we can skip their tables too
        from collections import defaultdict
        by_page = defaultdict(list)
        for s in sections:
            pg = s.get("page_num") or 0
            by_page[pg].append(s)

        # Get list of noise pages from raw doc (before filter was applied)
        raw_sections_for_filter = []
        for item, level in doc.iterate_items():
            from docling.datamodel.document import TextItem, SectionHeaderItem
            text = getattr(item, "text", None)
            if not text:
                continue
            page_num = item.prov[0].page_no if item.prov else None
            raw_sections_for_filter.append({
                "type"    : "header" if isinstance(item, SectionHeaderItem) else "text",
                "text"    : text.strip(),
                "page_num": page_num,
            })
        _, removed_pages = filter_noise_pages(raw_sections_for_filter)

        tables = extract_tables(doc, skip_pages=set(removed_pages))

        # โ”€โ”€ Step 5: Build metadata โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        metadata = build_metadata(pdf_path, extra_meta)
        metadata["total_sections"] = len(sections)
        metadata["total_tables"]   = len(tables)
        metadata["total_pages"]    = max(
            (s["page_num"] for s in sections if s["page_num"]), default=0
        )

        # โ”€โ”€ Step 6: Full markdown export โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        full_markdown = export_full_markdown(doc)

        # โ”€โ”€ Assemble final output โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        metadata["removed_pages"] = sorted(removed_pages)   # used by chunker

        parsed = {
            "metadata"     : metadata,
            "sections"     : sections,
            "tables"       : tables,
            "full_markdown": full_markdown,
        }

        # โ”€โ”€ Save custom processed JSON โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        with open(out_path, "w") as f:
            json.dump(parsed, f, indent=2, ensure_ascii=False, default=str)

        # โ”€โ”€ Save native DoclingDocument (for HybridChunker in Phase 3) โ”€โ”€โ”€โ”€โ”€โ”€โ”€
        # HybridChunker needs the original DoclingDocument object.
        # Docling's native format preserves full structural metadata
        # (heading hierarchy, table cell positions, reading order) that
        # our custom JSON does not capture.
        docling_path = out_path.with_name(out_path.stem + "_docling.json")
        with open(docling_path, "w") as f:
            f.write(doc.model_dump_json())
        log.info(f"  Saved DoclingDocument โ†’ {docling_path.name}  "
                 f"({docling_path.stat().st_size / 1024:.1f} KB)")

        size_kb = out_path.stat().st_size / 1024
        log.info(f"  Saved โ†’ {out_path.name}  ({size_kb:.1f} KB)")
        log.info(f"  Summary: {metadata['total_pages']} pages | "
                 f"{metadata['total_sections']} sections | "
                 f"{metadata['total_tables']} tables")

        return parsed

    def process_all(self, pdf_dir: Path = RAW_DIR,
                    force: bool = False) -> list[dict]:
        """Process all PDFs in a directory."""
        pdfs = sorted(pdf_dir.glob("*.pdf"))
        log.info(f"Found {len(pdfs)} PDFs in {pdf_dir}")

        results = []
        for pdf in pdfs:
            result = self.process(pdf, force=force)
            results.append(result)

        log.info(f"Processing complete โ€” {len(results)} documents")
        return results


# โ”€โ”€ Entry point โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
if __name__ == "__main__":
    processor = PDFProcessor()
    results   = processor.process_all()

    print("\n" + "=" * 55)
    print("PROCESSING SUMMARY")
    print("=" * 55)
    for r in results:
        m = r["metadata"]
        print(f"\nFile    : {m['file_name']}")
        print(f"Pages   : {m['total_pages']}")
        print(f"Sections: {m['total_sections']}")
        print(f"Tables  : {m['total_tables']}")
        if r["tables"]:
            print("Tables found:")
            for t in r["tables"]:
                print(f"  Page {t['page_num']} โ€” "
                      f"{t['rows']} rows ร— {t['cols']} cols | "
                      f"Headers: {t['headers'][:3]}")