Spaces:
Sleeping
Sleeping
File size: 2,157 Bytes
046e3b8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | # PDF ToC Generation Quick Start
Optional: Run as App
```bash
streamlit run app.py
```
This will open a local web page where you can upload a PDF, analyze fonts, and generate bookmarks with one click.
### Find Header Candidates
If you don't know the font size/name of your chapters, this lists the top 25 largest text elements.
```bash
python utils/list_longest_fonts.py <input.pdf>
```
*Output: Font Name, Size, Physical Page, Logical Page Label.*
### Find Header by Context
If you know a specific string (e.g., the first sentence of a chapter) but can't find the header itself, this finds the element *immediately preceding* that string.
```bash
python utils/find_preceding.py <input.pdf> "known text string"
```
### Debug Text Artifacts
If your bookmarks have weird characters (e.g., `??`), use this to see the raw byte codes (looking for soft hyphens `\xad`, non-breaking spaces `\xa0`, etc.).
```bash
python utils/inspect_bytes.py <input.pdf> "Problematic String"
```
---
## Recipe Generation (pdfxmeta)
Once you have identified the visual style of your headers (e.g., "Caslon 54pt"), you can inspect specific text or automatically generate recipe entries using `pdfxmeta`.
### Inspect Font Details
To get the exact font name and size of a specific string (e.g., "Chapter 1"):
```bash
pdfxmeta input.pdf "Chapter 1"
```
*Output will show `font.name`, `font.size`, etc.*
### Auto-Generate Recipe Entry
To append a valid TOML filter directly to your recipe file (level 1 header):
```bash
pdfxmeta -a 1 input.pdf "Chapter 1" >> recipe.toml
```
---
## The Pipeline
Run the full extraction and generation pipeline.
### Middleware: `modify_toc.py`
We use a custom Python script to:
1. **Sanitize Text**: Removes soft hyphens (`\xad`) and cleans encodings.
2. **Format Labels**: Renames bookmarks to `001_Title_pgX`.
3. **Fix Encoding**: Forces UTF-8 handling to prevent pipe corruption.
### The Command
**Git Bash** is recommended to avoid PowerShell encoding issues.
```bash
pdftocgen -r recipe.toml input.pdf | python utils/modify_toc.py | pdftocio -o output.pdf input.pdf
```
|