Spaces:
Sleeping
A newer version of the Streamlit SDK is available:
1.55.0
PDF ToC Generation Quick Start
Optional: Run as App
streamlit run app.py
This will open a local web page where you can upload a PDF, analyze fonts, and generate bookmarks with one click.
Find Header Candidates
If you don't know the font size/name of your chapters, this lists the top 25 largest text elements.
python utils/list_longest_fonts.py <input.pdf>
Output: Font Name, Size, Physical Page, Logical Page Label.
Find Header by Context
If you know a specific string (e.g., the first sentence of a chapter) but can't find the header itself, this finds the element immediately preceding that string.
python utils/find_preceding.py <input.pdf> "known text string"
Debug Text Artifacts
If your bookmarks have weird characters (e.g., ??), use this to see the raw byte codes (looking for soft hyphens \xad, non-breaking spaces \xa0, etc.).
python utils/inspect_bytes.py <input.pdf> "Problematic String"
Recipe Generation (pdfxmeta)
Once you have identified the visual style of your headers (e.g., "Caslon 54pt"), you can inspect specific text or automatically generate recipe entries using pdfxmeta.
Inspect Font Details
To get the exact font name and size of a specific string (e.g., "Chapter 1"):
pdfxmeta input.pdf "Chapter 1"
Output will show font.name, font.size, etc.
Auto-Generate Recipe Entry
To append a valid TOML filter directly to your recipe file (level 1 header):
pdfxmeta -a 1 input.pdf "Chapter 1" >> recipe.toml
The Pipeline
Run the full extraction and generation pipeline.
Middleware: modify_toc.py
We use a custom Python script to:
- Sanitize Text: Removes soft hyphens (
\xad) and cleans encodings. - Format Labels: Renames bookmarks to
001_Title_pgX. - Fix Encoding: Forces UTF-8 handling to prevent pipe corruption.
The Command
Git Bash is recommended to avoid PowerShell encoding issues.
pdftocgen -r recipe.toml input.pdf | python utils/modify_toc.py | pdftocio -o output.pdf input.pdf