pdf.tocgen.split / QUICK_START.md
adelevett's picture
Upload 76 files
046e3b8 verified
# PDF ToC Generation Quick Start
Optional: Run as App
```bash
streamlit run app.py
```
This will open a local web page where you can upload a PDF, analyze fonts, and generate bookmarks with one click.
### Find Header Candidates
If you don't know the font size/name of your chapters, this lists the top 25 largest text elements.
```bash
python utils/list_longest_fonts.py <input.pdf>
```
*Output: Font Name, Size, Physical Page, Logical Page Label.*
### Find Header by Context
If you know a specific string (e.g., the first sentence of a chapter) but can't find the header itself, this finds the element *immediately preceding* that string.
```bash
python utils/find_preceding.py <input.pdf> "known text string"
```
### Debug Text Artifacts
If your bookmarks have weird characters (e.g., `??`), use this to see the raw byte codes (looking for soft hyphens `\xad`, non-breaking spaces `\xa0`, etc.).
```bash
python utils/inspect_bytes.py <input.pdf> "Problematic String"
```
---
## Recipe Generation (pdfxmeta)
Once you have identified the visual style of your headers (e.g., "Caslon 54pt"), you can inspect specific text or automatically generate recipe entries using `pdfxmeta`.
### Inspect Font Details
To get the exact font name and size of a specific string (e.g., "Chapter 1"):
```bash
pdfxmeta input.pdf "Chapter 1"
```
*Output will show `font.name`, `font.size`, etc.*
### Auto-Generate Recipe Entry
To append a valid TOML filter directly to your recipe file (level 1 header):
```bash
pdfxmeta -a 1 input.pdf "Chapter 1" >> recipe.toml
```
---
## The Pipeline
Run the full extraction and generation pipeline.
### Middleware: `modify_toc.py`
We use a custom Python script to:
1. **Sanitize Text**: Removes soft hyphens (`\xad`) and cleans encodings.
2. **Format Labels**: Renames bookmarks to `001_Title_pgX`.
3. **Fix Encoding**: Forces UTF-8 handling to prevent pipe corruption.
### The Command
**Git Bash** is recommended to avoid PowerShell encoding issues.
```bash
pdftocgen -r recipe.toml input.pdf | python utils/modify_toc.py | pdftocio -o output.pdf input.pdf
```