Spaces:

adelevett
/

pdf.tocgen.split

Sleeping

App Files Files Community

pdf.tocgen.split / QUICK_START.md

adelevett

Upload 76 files

046e3b8 verified about 1 month ago

preview code

raw

history blame contribute delete

2.16 kB

A newer version of the Streamlit SDK is available: 1.55.0

Upgrade

PDF ToC Generation Quick Start

Optional: Run as App

streamlit run app.py

This will open a local web page where you can upload a PDF, analyze fonts, and generate bookmarks with one click.

Find Header Candidates

If you don't know the font size/name of your chapters, this lists the top 25 largest text elements.

python utils/list_longest_fonts.py <input.pdf>

Output: Font Name, Size, Physical Page, Logical Page Label.

Find Header by Context

If you know a specific string (e.g., the first sentence of a chapter) but can't find the header itself, this finds the element immediately preceding that string.

python utils/find_preceding.py <input.pdf> "known text string"

Debug Text Artifacts

If your bookmarks have weird characters (e.g., ??), use this to see the raw byte codes (looking for soft hyphens \xad, non-breaking spaces \xa0, etc.).

python utils/inspect_bytes.py <input.pdf> "Problematic String"

Recipe Generation (pdfxmeta)

Once you have identified the visual style of your headers (e.g., "Caslon 54pt"), you can inspect specific text or automatically generate recipe entries using pdfxmeta.

Inspect Font Details

To get the exact font name and size of a specific string (e.g., "Chapter 1"):

pdfxmeta input.pdf "Chapter 1"

Output will show font.name, font.size, etc.

Auto-Generate Recipe Entry

To append a valid TOML filter directly to your recipe file (level 1 header):

pdfxmeta -a 1 input.pdf "Chapter 1" >> recipe.toml

The Pipeline

Run the full extraction and generation pipeline.

Middleware: `modify_toc.py`

We use a custom Python script to:

Sanitize Text: Removes soft hyphens (\xad) and cleans encodings.
Format Labels: Renames bookmarks to 001_Title_pgX.
Fix Encoding: Forces UTF-8 handling to prevent pipe corruption.

The Command

Git Bash is recommended to avoid PowerShell encoding issues.

pdftocgen -r recipe.toml input.pdf | python utils/modify_toc.py | pdftocio -o output.pdf input.pdf