Spaces:
Sleeping
Sleeping
| # PDF ToC Generation Quick Start | |
| Optional: Run as App | |
| ```bash | |
| streamlit run app.py | |
| ``` | |
| This will open a local web page where you can upload a PDF, analyze fonts, and generate bookmarks with one click. | |
| ### Find Header Candidates | |
| If you don't know the font size/name of your chapters, this lists the top 25 largest text elements. | |
| ```bash | |
| python utils/list_longest_fonts.py <input.pdf> | |
| ``` | |
| *Output: Font Name, Size, Physical Page, Logical Page Label.* | |
| ### Find Header by Context | |
| If you know a specific string (e.g., the first sentence of a chapter) but can't find the header itself, this finds the element *immediately preceding* that string. | |
| ```bash | |
| python utils/find_preceding.py <input.pdf> "known text string" | |
| ``` | |
| ### Debug Text Artifacts | |
| If your bookmarks have weird characters (e.g., `??`), use this to see the raw byte codes (looking for soft hyphens `\xad`, non-breaking spaces `\xa0`, etc.). | |
| ```bash | |
| python utils/inspect_bytes.py <input.pdf> "Problematic String" | |
| ``` | |
| --- | |
| ## Recipe Generation (pdfxmeta) | |
| Once you have identified the visual style of your headers (e.g., "Caslon 54pt"), you can inspect specific text or automatically generate recipe entries using `pdfxmeta`. | |
| ### Inspect Font Details | |
| To get the exact font name and size of a specific string (e.g., "Chapter 1"): | |
| ```bash | |
| pdfxmeta input.pdf "Chapter 1" | |
| ``` | |
| *Output will show `font.name`, `font.size`, etc.* | |
| ### Auto-Generate Recipe Entry | |
| To append a valid TOML filter directly to your recipe file (level 1 header): | |
| ```bash | |
| pdfxmeta -a 1 input.pdf "Chapter 1" >> recipe.toml | |
| ``` | |
| --- | |
| ## The Pipeline | |
| Run the full extraction and generation pipeline. | |
| ### Middleware: `modify_toc.py` | |
| We use a custom Python script to: | |
| 1. **Sanitize Text**: Removes soft hyphens (`\xad`) and cleans encodings. | |
| 2. **Format Labels**: Renames bookmarks to `001_Title_pgX`. | |
| 3. **Fix Encoding**: Forces UTF-8 handling to prevent pipe corruption. | |
| ### The Command | |
| **Git Bash** is recommended to avoid PowerShell encoding issues. | |
| ```bash | |
| pdftocgen -r recipe.toml input.pdf | python utils/modify_toc.py | pdftocio -o output.pdf input.pdf | |
| ``` | |