Spaces:

adelevett
/

pdf.tocgen.split

Sleeping

pdf.tocgen.split

File size: 2,157 Bytes

046e3b8

# PDF ToC Generation Quick Start

Optional:  Run as App
```bash

streamlit run app.py

```
This will open a local web page where you can upload a PDF, analyze fonts, and generate bookmarks with one click.

### Find Header Candidates
If you don't know the font size/name of your chapters, this lists the top 25 largest text elements.
```bash

python utils/list_longest_fonts.py <input.pdf>

```
*Output: Font Name, Size, Physical Page, Logical Page Label.*

### Find Header by Context
If you know a specific string (e.g., the first sentence of a chapter) but can't find the header itself, this finds the element *immediately preceding* that string.
```bash

python utils/find_preceding.py <input.pdf> "known text string"

```

### Debug Text Artifacts
If your bookmarks have weird characters (e.g., `??`), use this to see the raw byte codes (looking for soft hyphens `\xad`, non-breaking spaces `\xa0`, etc.).
```bash

python utils/inspect_bytes.py <input.pdf> "Problematic String"

```

---

## Recipe Generation (pdfxmeta)
Once you have identified the visual style of your headers (e.g., "Caslon 54pt"), you can inspect specific text or automatically generate recipe entries using `pdfxmeta`.

### Inspect Font Details
To get the exact font name and size of a specific string (e.g., "Chapter 1"):
```bash

pdfxmeta input.pdf "Chapter 1"

```
*Output will show `font.name`, `font.size`, etc.*

### Auto-Generate Recipe Entry
To append a valid TOML filter directly to your recipe file (level 1 header):
```bash

pdfxmeta -a 1 input.pdf "Chapter 1" >> recipe.toml

```

---

## The Pipeline
Run the full extraction and generation pipeline.

### Middleware: `modify_toc.py`

We use a custom Python script to:

1.  **Sanitize Text**: Removes soft hyphens (`\xad`) and cleans encodings.

2.  **Format Labels**: Renames bookmarks to `001_Title_pgX`.

3.  **Fix Encoding**: Forces UTF-8 handling to prevent pipe corruption.



### The Command

**Git Bash** is recommended to avoid PowerShell encoding issues.



```bash

pdftocgen -r recipe.toml input.pdf | python utils/modify_toc.py | pdftocio -o output.pdf input.pdf
```