File size: 4,154 Bytes
bad8b6c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
"""
PDF/HTML extraction with layout-aware preference and graceful fallback.

Default path: **Docling** (IBM, layout-aware). Preserves heading hierarchy,
list ordering, table structure, and de-duplicates page-header repetition.
Critical for multi-column source documents (e.g. Physiotherapy Standards
framework PDF, where pymupdf scrambled bullet ordering).

Fallback path: **markitdown**. Used when Docling import or convert fails
on a specific source. Markitdown is faster and lighter; it just doesn't
handle multi-column layouts well.

Usage in build scripts::

    from extract_pdf import extract_to_markdown

    body = extract_to_markdown(cache_path)

The function handles both PDFs and HTML uniformly. PDFs go through Docling
by default; HTML always uses markitdown (Docling's HTML support is less
mature than its PDF support, and our HTML pages are mostly SilverStripe
with semantic markup that markitdown handles well).
"""

from __future__ import annotations

from pathlib import Path

# Lazy-import flags — checked once, cached at module load
try:
    from docling.document_converter import DocumentConverter
    _DOCLING_AVAILABLE = True
except ImportError:
    _DOCLING_AVAILABLE = False
    DocumentConverter = None  # type: ignore[assignment]


def extract_to_markdown(path: str | Path, format_hint: str | None = None) -> str:
    """Convert a PDF or HTML file to markdown text.

    Args:
        path: filesystem path to the source file
        format_hint: optional "pdf" or "html". If omitted, inferred from file extension.

    Returns:
        Markdown text content (stripped of leading/trailing whitespace).

    Strategy:
        - .pdf → Docling preferred, markitdown fallback if Docling errors
        - .html / .htm → markitdown (Docling's HTML support less mature than PDF)
        - other → markitdown (general-purpose)
    """
    p = Path(path)
    fmt = format_hint or p.suffix.lstrip(".").lower()

    if fmt == "pdf" and _DOCLING_AVAILABLE:
        try:
            return _docling_extract(p)
        except Exception as e:
            print(f"  ⚠ Docling failed on {p.name}: {e!r}; falling back to markitdown")

    return _markitdown_extract(p)


def _docling_extract(path: Path) -> str:
    """Run Docling and return the markdown export."""
    converter = DocumentConverter()
    result = converter.convert(str(path))
    return result.document.export_to_markdown().strip()


def _markitdown_extract(path: Path) -> str:
    """Run markitdown and return the text content."""
    from markitdown import MarkItDown
    md = MarkItDown()
    result = md.convert(str(path))
    return result.text_content.strip()


def is_docling_available() -> bool:
    """Public flag — useful for build scripts wanting to log which extractor is in use."""
    return _DOCLING_AVAILABLE


# ---- Heading demotion --------------------------------------------------

import re as _re

_HEADING_LINE_RE = _re.compile(r"^(#{1,6}) ", _re.MULTILINE)


def demote_headings(text: str, levels: int = 1, max_depth: int = 6) -> str:
    """Demote markdown headings by ``levels`` levels, capped at H``max_depth``.

    Used by build scripts to nest extracted body content under a higher-level
    heading injected by the build script. Without demotion, Docling-extracted
    body H2s collide with the build script's source-level H2 wrapper at the
    same tree depth.

    Example::

        ## Source Title (build script wrapper)
        ## Introduction        <-- collides at H2 with other sources' Introductions
        Body...

    becomes::

        ## Source Title (build script wrapper)
        ### Introduction       <-- now a child of the wrapper
        Body...

    Multiple sources can each have their own "### Introduction" without
    colliding because each is scoped under its own H2 parent.

    Caps demotion at ``max_depth`` (default H6) since markdown beyond H6 is
    treated as paragraph text.
    """
    def _demote(m: "_re.Match[str]") -> str:
        hashes = m.group(1)
        new_count = min(len(hashes) + levels, max_depth)
        return "#" * new_count + " "

    return _HEADING_LINE_RE.sub(_demote, text)