# PDF ToC Generation Quick Start Optional: Run as App ```bash streamlit run app.py ``` This will open a local web page where you can upload a PDF, analyze fonts, and generate bookmarks with one click. ### Find Header Candidates If you don't know the font size/name of your chapters, this lists the top 25 largest text elements. ```bash python utils/list_longest_fonts.py ``` *Output: Font Name, Size, Physical Page, Logical Page Label.* ### Find Header by Context If you know a specific string (e.g., the first sentence of a chapter) but can't find the header itself, this finds the element *immediately preceding* that string. ```bash python utils/find_preceding.py "known text string" ``` ### Debug Text Artifacts If your bookmarks have weird characters (e.g., `??`), use this to see the raw byte codes (looking for soft hyphens `\xad`, non-breaking spaces `\xa0`, etc.). ```bash python utils/inspect_bytes.py "Problematic String" ``` --- ## Recipe Generation (pdfxmeta) Once you have identified the visual style of your headers (e.g., "Caslon 54pt"), you can inspect specific text or automatically generate recipe entries using `pdfxmeta`. ### Inspect Font Details To get the exact font name and size of a specific string (e.g., "Chapter 1"): ```bash pdfxmeta input.pdf "Chapter 1" ``` *Output will show `font.name`, `font.size`, etc.* ### Auto-Generate Recipe Entry To append a valid TOML filter directly to your recipe file (level 1 header): ```bash pdfxmeta -a 1 input.pdf "Chapter 1" >> recipe.toml ``` --- ## The Pipeline Run the full extraction and generation pipeline. ### Middleware: `modify_toc.py` We use a custom Python script to: 1. **Sanitize Text**: Removes soft hyphens (`\xad`) and cleans encodings. 2. **Format Labels**: Renames bookmarks to `001_Title_pgX`. 3. **Fix Encoding**: Forces UTF-8 handling to prevent pipe corruption. ### The Command **Git Bash** is recommended to avoid PowerShell encoding issues. ```bash pdftocgen -r recipe.toml input.pdf | python utils/modify_toc.py | pdftocio -o output.pdf input.pdf ```