pdf-to-audio / REDESIGN.md
chiefkarim's picture
feat(pipeline): dual extraction path, dual TTS mode, parallel synthesis
aee40c6
# Redesign: Dual Extraction Path + Dual TTS Mode + Parallel Synthesis
**Status:** Design β€” not yet implemented
**Date:** 2026-05-18
**Scope:** Three targeted improvements to OCR path, TTS model selection, and synthesis parallelism. No other changes.
---
## 1. Architecture Diagram
```
POST /upload (file, format, mode)
β”‚
β–Ό
upload.py: _run_pipeline(job_id, tmp_path, fmt, mode)
β”‚
β”œβ”€β”€β”€ [A] ocr.py: extract_pages(pdf_path)
β”‚ β”‚
β”‚ β”œβ”€β”€ fitz.open(pdf_path)
β”‚ β”‚ └── page.get_text("text") ──► char_count >= 50?
β”‚ β”‚ β”‚ YES β†’ use direct text (fast path)
β”‚ β”‚ β”‚ NO β†’ rasterize page β†’ Tesseract (fallback)
β”‚ β”‚
β”‚ └── normalize_for_tts() + sentence split (unchanged)
β”‚
β”œβ”€β”€β”€ [B] tts.py: synthesise_parallel(sentences, mode)
β”‚ β”‚
β”‚ β”‚ mode="quality" mode="fast"
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ Coqui Tacotron2-DDC facebook/mms-tts-eng
β”‚ β”‚ (22 kHz, chunk ≀150 chars) (16 kHz, chunk ≀500 chars)
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β”‚ β”‚ ProcessPoolExecutor
β”‚ β”‚ (max_workers = min(cpu_count, n_sentences))
β”‚ β”‚ ordered list[np.ndarray]
β”‚ β”‚
└─── audio_chain.py: process_and_export(segments, sample_rate, fmt)
β”‚
pedalboard DSP chain (sample_rate from tts.get_sample_rate(mode))
└── MP3 / WAV bytes
```
---
## 2. API Contract
### `POST /upload` (multipart/form-data)
| Field | Type | Default | Notes |
|---|---|---|---|
| `file` | file | required | PDF, validated via `%PDF` magic bytes |
| `format` | `"mp3" \| "wav"` | `"mp3"` | Export format |
| `mode` | `"fast" \| "quality"` | `"fast"` | TTS model selection |
Response (202):
```json
{ "job_id": "...", "status": "queued", "mode": "fast" }
```
The `mode` field is echoed back so callers can confirm which path was accepted.
---
## 3. TTS Mode Comparison
| Property | `quality` | `fast` |
|---|---|---|
| Model | `tts_models/en/ljspeech/tacotron2-DDC` (Coqui) | `facebook/mms-tts-eng` (HuggingFace VITS) |
| Library | `TTS==0.22.*` | `transformers>=4.41.0` |
| Sample rate | 22 050 Hz | 16 000 Hz |
| Relative speed (CPU) | ~1x real-time | ~10x real-time |
| Char limit per chunk | 150 (Tacotron2 fixed decoder steps) | 500 (VITS has no hard decoder step limit) |
| Model size on disk | ~400 MB (model + vocoder) | ~80 MB |
| Output naturalness | Higher; richer prosody | Adequate; intelligible |
| DSP chain (pedalboard) | Yes | Yes |
Both modes pass through the same `audio_chain.process_and_export`. DSP parameters
(EQ, compression, high-pass, gain) and export logic are shared and sample-rate-agnostic.
---
## 4. Data Flow β€” Two PDF Extraction Paths
```
For each page in fitz document:
raw_text = page.get_text("text").strip()
if len(raw_text) >= TEXT_LAYER_MIN_CHARS: # fast path
text = raw_text
else: # fallback
pixmap = page.get_pixmap(dpi=200)
image = PIL.Image.frombytes(...)
text = pytesseract.image_to_string(image)
sentences = split_and_normalize(text)
```
The fallback uses the fitz `Pixmap` directly β€” no poppler/pdf2image needed on the
fast path. pdf2image remains in requirements as a fallback dependency.
---
## 5. Component Changes
### 5.1 `backend/app/models/schemas.py` β€” add `TtsMode`
```python
class TtsMode(str, Enum):
fast = "fast"
quality = "quality"
```
Add `mode: TtsMode = TtsMode.fast` to the `Job` model for observability and so the
pipeline can retrieve the original choice during background processing.
### 5.2 `backend/requirements.txt`
| Action | Package | Notes |
|--------|---------|-------|
| Add | `pymupdf>=1.24.0` | fitz bindings; PyPI name is `pymupdf` |
| Add | `transformers>=4.41.0` | HuggingFace VITS for MMS-TTS fast mode |
| Add | `scipy>=1.13.0` | transformers TTS output utilities |
| Keep | `TTS==0.22.*` | Coqui TTS required for `quality` mode |
| Keep | `torch==2.5.1` | both models share the same CPU-only torch install |
| Keep | `pdf2image==1.17.*` | OCR fallback path |
| Keep | `pytesseract==0.3.*` | OCR fallback path |
### 5.3 `backend/app/services/tts.py` β€” full rewrite
Two independent lazy singletons, one per model. Each loads only on first use for
that mode, so an operator running only `fast` jobs never pays the ~400 MB quality
model memory cost.
**Public interface:**
```python
def get_sample_rate(mode: TtsMode = TtsMode.fast) -> int:
# returns 22050 for "quality", 16000 for "fast"
def synthesise(sentence: str, mode: TtsMode = TtsMode.fast) -> np.ndarray:
# routes to _synthesise_quality or _synthesise_fast
def synthesise_parallel(
sentences: list[str], mode: TtsMode = TtsMode.fast
) -> list[np.ndarray]:
# ProcessPoolExecutor; preserves order
```
Internal details:
- `_quality_tts: TTS | None` β€” Coqui singleton, loaded via
`TTS("tts_models/en/ljspeech/tacotron2-DDC", gpu=False)`.
- `_fast_model: VitsModel | None`, `_fast_tokenizer: AutoTokenizer | None` β€”
MMS-TTS singletons loaded from `"facebook/mms-tts-eng"`.
- `synthesise` quality path: identical to current code β€” `_chunk()` at
`_MAX_CHARS = 150`, `tts.tts(text=c)` per chunk, concatenate.
- `synthesise` fast path: tokenize β†’ `VitsModel.generate(**inputs)` β†’
`.waveform.squeeze().numpy()` as `float32`. Chunks only as a safety guard at
`_MAX_CHARS_FAST = 500`; VITS handles long inputs natively.
- `len(cleaned) < 3` skip guard applies to both paths.
- `synthesise_parallel`: filters empty sentences, then
`ProcessPoolExecutor(max_workers=min(cpu_count, len(sentences)))` with
`executor.map(partial(synthesise, mode=mode), filtered, chunksize=1)`.
Pool is created fresh per call (see D9 below).
- For `quality` mode, cap `max_workers` at `min(cpu_count, 2)` to prevent OOM
from multiple workers each loading the ~400 MB Coqui model.
### 5.4 `backend/app/services/ocr.py` β€” full rewrite
**Signature unchanged** β€” callers in `upload.py` unaffected.
```python
TEXT_LAYER_MIN_CHARS: int = 50
def extract_pages(pdf_path: Path) -> list[list[str]]: ...
```
- Import `fitz` (pymupdf).
- For each page: `page.get_text("text")`. If `len(stripped) >= TEXT_LAYER_MIN_CHARS`,
use it. Otherwise convert via fitz `Pixmap` at 200 DPI β†’ PIL Image β†’ Tesseract.
- Log extraction method per page at DEBUG level.
**Threshold rationale:** 50 chars filters blank pages and fitz stray whitespace from
form fields. Low enough to avoid false OCR triggers on dense pages. Module-level
constant so tests can override without monkey-patching.
### 5.5 `backend/app/routers/upload.py` β€” targeted edits
1. Accept `mode: TtsMode = Form(TtsMode.fast)` alongside existing `format` field.
2. Pass `mode` to `storage.create_job` and `_run_pipeline`.
3. Inside the page loop, call `tts.synthesise_parallel(sentences, mode=mode)` instead
of iterating `tts.synthesise` per sentence.
4. Pass `mode` to `tts.get_sample_rate(mode)` when calling `audio_chain.process_and_export`.
5. Echo `mode` in the 202 response body.
Pause check granularity changes from per-sentence to per-page (one check before
`synthesise_parallel`). Per-page granularity is acceptable; intra-page pause would
require IPC into worker processes.
### 5.6 `backend/app/storage.py` β€” minor
`create_job` accepts `mode: TtsMode` and stores it on the `Job` object.
### 5.7 `backend/Dockerfile` β€” pre-bake both models
```dockerfile
# Pre-bake Coqui Tacotron2-DDC (quality mode) β€” existing line, unchanged
RUN python -c "from TTS.api import TTS; TTS('tts_models/en/ljspeech/tacotron2-DDC', gpu=False)"
# Pre-bake facebook/mms-tts-eng (fast mode) β€” new
ENV TRANSFORMERS_CACHE=/opt/hf_cache
ENV HF_HOME=/opt/hf_cache
RUN python -c "
from transformers import VitsModel, AutoTokenizer
VitsModel.from_pretrained('facebook/mms-tts-eng')
AutoTokenizer.from_pretrained('facebook/mms-tts-eng')
"
```
`TRANSFORMERS_CACHE` and `HF_HOME` are set explicitly so the cache is at a known
path regardless of which user runs the container, and the layer is deterministic.
`espeak-ng` apt package must remain β€” MMS-TTS tokenizer requires it for
phonemization via the `phonemizer` transitive dependency.
### 5.8 `backend/app/main.py` β€” one-line edit
The lifespan warmup should call `tts.get_sample_rate(TtsMode.fast)` to eagerly load
the fast model (default) at startup. The quality model loads lazily on the first
`quality` request β€” acceptable given its larger footprint.
### 5.9 `backend/app/services/audio_chain.py` β€” no change
The pedalboard chain is sample-rate-agnostic: all filter frequencies and compressor
times are in Hz / ms, not in samples. Callers already supply `sample_rate`; they
will now source it from `tts.get_sample_rate(mode)`.
---
## 6. File Change Summary
| File | Change type | Summary |
|------|-------------|---------|
| `backend/app/models/schemas.py` | Edit | Add `TtsMode` enum; add `mode` field to `Job` |
| `backend/requirements.txt` | Edit | Add pymupdf, transformers, scipy; keep TTS for quality mode |
| `backend/app/services/ocr.py` | Rewrite | fitz fast path + Tesseract fallback |
| `backend/app/services/tts.py` | Rewrite | Dual-mode lazy singletons; mode-aware `synthesise` + `synthesise_parallel` |
| `backend/app/routers/upload.py` | Edit | Accept `mode` form param; thread through pipeline; per-page parallel synthesis |
| `backend/app/storage.py` | Edit | Store `mode` on job |
| `backend/Dockerfile` | Edit | Pre-bake both models; add `TRANSFORMERS_CACHE` env vars |
| `backend/app/main.py` | Edit | Warmup fast model at startup |
| `backend/app/services/audio_chain.py` | None | Sample-rate-agnostic already |
---
## 7. Decision Log
| # | Decision | Alternatives considered | Rationale |
|---|----------|------------------------|-----------|
| D1 | Expose both modes as selectable via `mode` form field | Single model, two endpoints, query param | Form field is consistent with existing `format` param; no URL change; clean enum validation via Pydantic |
| D2 | Default mode = `"fast"` | Default to `"quality"` | Perceived latency is the primary UX metric for a web service. 10x speed-up turns multi-minute waits into tens of seconds for typical PDFs. Users who need highest fidelity opt in explicitly |
| D3 | Both models in the same Docker image | Two images + API-gateway routing | Single container avoids a routing layer, halves deployment complexity, and keeps cold-start behaviour identical across modes. Combined image growth ~480 MB (80 MB MMS-TTS + 400 MB Coqui) is acceptable for self-hosted use |
| D4 | Independent lazy singletons per model | Shared loader with mode key | Lazy loading means quality model (~400 MB) is never in memory if operator only runs fast jobs, and vice versa. Separate singletons are simpler than a dynamic registry and easier to test in isolation |
| D5 | Use `transformers.VitsModel` directly for MMS-TTS | Route through Coqui TTS wrapper | Coqui 0.22 does not support `facebook/mms-tts-eng` without a custom model config YAML; that approach is fragile across Coqui releases. `transformers` is the canonical Meta-endorsed path |
| D6 | Keep `_chunk()` at 150 chars for quality; new 500-char limit for fast | Unified chunk size | Tacotron2 fixed `max_decoder_steps` constraint does not exist in VITS. Larger fast-mode limit reduces join-point glitches and preserves prosody across longer phrases |
| D7 | Run both modes through the same pedalboard DSP chain | Skip DSP for fast mode to reduce latency | Chain is sample-rate-agnostic and costs ~2–5 ms/segment. Removing it for fast mode would produce inconsistently audible output between modes. It is an explicit quality differentiator |
| D8 | `TRANSFORMERS_CACHE=/opt/hf_cache` in Dockerfile | Default `~/.cache` | Build user may differ from runtime user; explicit path guarantees pre-baked weights are found at runtime and the layer is reproducible |
| D9 | Fresh ProcessPoolExecutor per job | Module-level persistent pool | FastAPI background tasks run in threads; a persistent pool created at module import adds lifecycle and signal-handling complexity inside Docker. 50 ms pool startup cost is negligible vs inference time |
| D10 | Cap `max_workers=2` for quality mode | Let it match `cpu_count` | Each quality worker loads Coqui (~400 MB). On a 4-core/1 GB host, 4 workers = 1.6 GB model memory alone. Capping at 2 keeps peak model memory at ~800 MB |
| D11 | `fitz.get_text("text")` as OCR fast path | `pdfplumber`, `pdfminer.six` | pymupdf is the fastest pure-Python PDF text extractor; already needed for the pixmap OCR fallback so it is a single dep |
| D12 | Threshold = 50 chars for direct-text path | 0, 100, 200 | 0 triggers on form-only pages; 100+ risks OCR-falling on sparse-but-valid text pages; 50 is a pragmatic middle |
| D13 | fitz Pixmap at 200 DPI for OCR fallback | pdf2image (poppler) at 300 DPI | Removes poppler as a hard runtime dep for the fast path; 200 DPI is sufficient for Tesseract on standard fonts |
---
## 8. Assumptions
1. Docker build environment has outbound internet at build time. Runtime is fully offline per SPEC.
2. `facebook/mms-tts-eng` is available on HuggingFace Hub; weights are MIT-licensed. Coqui Tacotron2-DDC is MPL-2.0 (unchanged).
3. `phonemizer` (pulled by transformers for MMS-TTS) requires `espeak-ng` at runtime; the existing apt install in Dockerfile covers this.
4. Deployment host has at least 1 CPU core. `synthesise_parallel` degrades gracefully to `max_workers=1` on a single-core host.
5. Memory budget: quality model ~400 MB + fast model ~80 MB = ~480 MB static footprint. Both models are warm simultaneously only if mixed-mode requests are served concurrently. For single-mode deployments, only one model is ever loaded.
6. `normalize_for_tts` in `services/normalize.py` needs no changes β€” it operates on raw strings before model input.
7. WAV output in fast mode will be 16 kHz PCM; in quality mode 22 kHz PCM. This is a documented, expected difference, not a bug.
---
## 9. Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| MMS-TTS quality unacceptable for some sentence structures | Medium | Medium | Evaluate against fixture sentences before shipping; `quality` mode is always available as a fallback |
| Both models warm simultaneously under mixed-mode concurrent load exceeds memory limit | Low | High | Document memory requirements in ops runbook; add `QUALITY_MAX_WORKERS` env var cap if needed post-deployment |
| fitz text extraction returns garbled Unicode on PDFs with custom encoding | Medium | Low | `normalize_for_tts` strips most noise; add printable-char ratio guard (>30%) post-testing if needed |
| ProcessPoolExecutor fork-safety with torch on Linux | Low | High | Torch CPU is fork-safe on Linux; if issues appear, switch pool start method to `spawn` via `mp.set_start_method("spawn")` in pool initializer |
| `espeak-ng` version mismatch between apt and phonemizer expectations | Low | High | Validate at build time; pin `espeak-ng` apt version if a mismatch is observed |
| 200 DPI OCR fallback misses fine print vs current 300 DPI pdf2image default | Medium | Low | Bump to 300 DPI in the fitz Pixmap call if OCR accuracy is reported degraded; one-integer change |
| Image size increase (~480 MB for both models) hits CI/CD artifact limits | Low | Medium | Models are in distinct `RUN` layers; layer caching means rebuilds only re-pull on dep changes |
---
## 10. Open Questions
1. Does `facebook/mms-tts-eng` produce acceptable audio quality for the target use case, or does it need evaluation against fixture sentences before committing to the implementation?
2. Should `TEXT_LAYER_MIN_CHARS = 50` be an env-var override (like `MAX_UPLOAD_MB`) or is a hardcoded constant sufficient?
3. Is there a soft memory limit on the Docker container in the target deployment? Determines whether `QUALITY_MAX_WORKERS` should be env-configurable.
4. Should the 202 response echo `mode`, or is it sufficient to retrieve it via `GET /jobs/{job_id}`?
5. Should `pdf2image` / `poppler-utils` be removed now that fitz Pixmap covers the OCR fallback, or deferred to a follow-up cleanup PR?