pdf-to-audio / REDESIGN.md
chiefkarim's picture
feat(pipeline): dual extraction path, dual TTS mode, parallel synthesis
aee40c6

Redesign: Dual Extraction Path + Dual TTS Mode + Parallel Synthesis

Status: Design β€” not yet implemented
Date: 2026-05-18
Scope: Three targeted improvements to OCR path, TTS model selection, and synthesis parallelism. No other changes.


1. Architecture Diagram

POST /upload  (file, format, mode)
     β”‚
     β–Ό
upload.py: _run_pipeline(job_id, tmp_path, fmt, mode)
     β”‚
     β”œβ”€β”€β”€ [A] ocr.py: extract_pages(pdf_path)
     β”‚         β”‚
     β”‚         β”œβ”€β”€ fitz.open(pdf_path)
     β”‚         β”‚    └── page.get_text("text")  ──► char_count >= 50?
     β”‚         β”‚                                       β”‚ YES β†’ use direct text (fast path)
     β”‚         β”‚                                       β”‚ NO  β†’ rasterize page β†’ Tesseract (fallback)
     β”‚         β”‚
     β”‚         └── normalize_for_tts() + sentence split (unchanged)
     β”‚
     β”œβ”€β”€β”€ [B] tts.py: synthesise_parallel(sentences, mode)
     β”‚         β”‚
     β”‚         β”‚   mode="quality"                    mode="fast"
     β”‚         β”‚        β”‚                                 β”‚
     β”‚         β”‚   Coqui Tacotron2-DDC           facebook/mms-tts-eng
     β”‚         β”‚   (22 kHz, chunk ≀150 chars)    (16 kHz, chunk ≀500 chars)
     β”‚         β”‚        β”‚                                 β”‚
     β”‚         β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚         β”‚                      β”‚
     β”‚         β”‚              ProcessPoolExecutor
     β”‚         β”‚         (max_workers = min(cpu_count, n_sentences))
     β”‚         β”‚              ordered list[np.ndarray]
     β”‚         β”‚
     └─── audio_chain.py: process_and_export(segments, sample_rate, fmt)
               β”‚
               pedalboard DSP chain (sample_rate from tts.get_sample_rate(mode))
               └── MP3 / WAV bytes

2. API Contract

POST /upload (multipart/form-data)

Field Type Default Notes
file file required PDF, validated via %PDF magic bytes
format "mp3" | "wav" "mp3" Export format
mode "fast" | "quality" "fast" TTS model selection

Response (202):

{ "job_id": "...", "status": "queued", "mode": "fast" }

The mode field is echoed back so callers can confirm which path was accepted.


3. TTS Mode Comparison

Property quality fast
Model tts_models/en/ljspeech/tacotron2-DDC (Coqui) facebook/mms-tts-eng (HuggingFace VITS)
Library TTS==0.22.* transformers>=4.41.0
Sample rate 22 050 Hz 16 000 Hz
Relative speed (CPU) ~1x real-time ~10x real-time
Char limit per chunk 150 (Tacotron2 fixed decoder steps) 500 (VITS has no hard decoder step limit)
Model size on disk ~400 MB (model + vocoder) ~80 MB
Output naturalness Higher; richer prosody Adequate; intelligible
DSP chain (pedalboard) Yes Yes

Both modes pass through the same audio_chain.process_and_export. DSP parameters (EQ, compression, high-pass, gain) and export logic are shared and sample-rate-agnostic.


4. Data Flow β€” Two PDF Extraction Paths

For each page in fitz document:
  raw_text = page.get_text("text").strip()

  if len(raw_text) >= TEXT_LAYER_MIN_CHARS:   # fast path
      text = raw_text
  else:                                        # fallback
      pixmap = page.get_pixmap(dpi=200)
      image  = PIL.Image.frombytes(...)
      text   = pytesseract.image_to_string(image)

  sentences = split_and_normalize(text)

The fallback uses the fitz Pixmap directly β€” no poppler/pdf2image needed on the fast path. pdf2image remains in requirements as a fallback dependency.


5. Component Changes

5.1 backend/app/models/schemas.py β€” add TtsMode

class TtsMode(str, Enum):
    fast = "fast"
    quality = "quality"

Add mode: TtsMode = TtsMode.fast to the Job model for observability and so the pipeline can retrieve the original choice during background processing.

5.2 backend/requirements.txt

Action Package Notes
Add pymupdf>=1.24.0 fitz bindings; PyPI name is pymupdf
Add transformers>=4.41.0 HuggingFace VITS for MMS-TTS fast mode
Add scipy>=1.13.0 transformers TTS output utilities
Keep TTS==0.22.* Coqui TTS required for quality mode
Keep torch==2.5.1 both models share the same CPU-only torch install
Keep pdf2image==1.17.* OCR fallback path
Keep pytesseract==0.3.* OCR fallback path

5.3 backend/app/services/tts.py β€” full rewrite

Two independent lazy singletons, one per model. Each loads only on first use for that mode, so an operator running only fast jobs never pays the ~400 MB quality model memory cost.

Public interface:

def get_sample_rate(mode: TtsMode = TtsMode.fast) -> int:
    # returns 22050 for "quality", 16000 for "fast"

def synthesise(sentence: str, mode: TtsMode = TtsMode.fast) -> np.ndarray:
    # routes to _synthesise_quality or _synthesise_fast

def synthesise_parallel(
    sentences: list[str], mode: TtsMode = TtsMode.fast
) -> list[np.ndarray]:
    # ProcessPoolExecutor; preserves order

Internal details:

  • _quality_tts: TTS | None β€” Coqui singleton, loaded via TTS("tts_models/en/ljspeech/tacotron2-DDC", gpu=False).
  • _fast_model: VitsModel | None, _fast_tokenizer: AutoTokenizer | None β€” MMS-TTS singletons loaded from "facebook/mms-tts-eng".
  • synthesise quality path: identical to current code β€” _chunk() at _MAX_CHARS = 150, tts.tts(text=c) per chunk, concatenate.
  • synthesise fast path: tokenize β†’ VitsModel.generate(**inputs) β†’ .waveform.squeeze().numpy() as float32. Chunks only as a safety guard at _MAX_CHARS_FAST = 500; VITS handles long inputs natively.
  • len(cleaned) < 3 skip guard applies to both paths.
  • synthesise_parallel: filters empty sentences, then ProcessPoolExecutor(max_workers=min(cpu_count, len(sentences))) with executor.map(partial(synthesise, mode=mode), filtered, chunksize=1). Pool is created fresh per call (see D9 below).
  • For quality mode, cap max_workers at min(cpu_count, 2) to prevent OOM from multiple workers each loading the ~400 MB Coqui model.

5.4 backend/app/services/ocr.py β€” full rewrite

Signature unchanged β€” callers in upload.py unaffected.

TEXT_LAYER_MIN_CHARS: int = 50

def extract_pages(pdf_path: Path) -> list[list[str]]: ...
  • Import fitz (pymupdf).
  • For each page: page.get_text("text"). If len(stripped) >= TEXT_LAYER_MIN_CHARS, use it. Otherwise convert via fitz Pixmap at 200 DPI β†’ PIL Image β†’ Tesseract.
  • Log extraction method per page at DEBUG level.

Threshold rationale: 50 chars filters blank pages and fitz stray whitespace from form fields. Low enough to avoid false OCR triggers on dense pages. Module-level constant so tests can override without monkey-patching.

5.5 backend/app/routers/upload.py β€” targeted edits

  1. Accept mode: TtsMode = Form(TtsMode.fast) alongside existing format field.
  2. Pass mode to storage.create_job and _run_pipeline.
  3. Inside the page loop, call tts.synthesise_parallel(sentences, mode=mode) instead of iterating tts.synthesise per sentence.
  4. Pass mode to tts.get_sample_rate(mode) when calling audio_chain.process_and_export.
  5. Echo mode in the 202 response body.

Pause check granularity changes from per-sentence to per-page (one check before synthesise_parallel). Per-page granularity is acceptable; intra-page pause would require IPC into worker processes.

5.6 backend/app/storage.py β€” minor

create_job accepts mode: TtsMode and stores it on the Job object.

5.7 backend/Dockerfile β€” pre-bake both models

# Pre-bake Coqui Tacotron2-DDC (quality mode) β€” existing line, unchanged
RUN python -c "from TTS.api import TTS; TTS('tts_models/en/ljspeech/tacotron2-DDC', gpu=False)"

# Pre-bake facebook/mms-tts-eng (fast mode) β€” new
ENV TRANSFORMERS_CACHE=/opt/hf_cache
ENV HF_HOME=/opt/hf_cache
RUN python -c "
from transformers import VitsModel, AutoTokenizer
VitsModel.from_pretrained('facebook/mms-tts-eng')
AutoTokenizer.from_pretrained('facebook/mms-tts-eng')
"

TRANSFORMERS_CACHE and HF_HOME are set explicitly so the cache is at a known path regardless of which user runs the container, and the layer is deterministic.

espeak-ng apt package must remain β€” MMS-TTS tokenizer requires it for phonemization via the phonemizer transitive dependency.

5.8 backend/app/main.py β€” one-line edit

The lifespan warmup should call tts.get_sample_rate(TtsMode.fast) to eagerly load the fast model (default) at startup. The quality model loads lazily on the first quality request β€” acceptable given its larger footprint.

5.9 backend/app/services/audio_chain.py β€” no change

The pedalboard chain is sample-rate-agnostic: all filter frequencies and compressor times are in Hz / ms, not in samples. Callers already supply sample_rate; they will now source it from tts.get_sample_rate(mode).


6. File Change Summary

File Change type Summary
backend/app/models/schemas.py Edit Add TtsMode enum; add mode field to Job
backend/requirements.txt Edit Add pymupdf, transformers, scipy; keep TTS for quality mode
backend/app/services/ocr.py Rewrite fitz fast path + Tesseract fallback
backend/app/services/tts.py Rewrite Dual-mode lazy singletons; mode-aware synthesise + synthesise_parallel
backend/app/routers/upload.py Edit Accept mode form param; thread through pipeline; per-page parallel synthesis
backend/app/storage.py Edit Store mode on job
backend/Dockerfile Edit Pre-bake both models; add TRANSFORMERS_CACHE env vars
backend/app/main.py Edit Warmup fast model at startup
backend/app/services/audio_chain.py None Sample-rate-agnostic already

7. Decision Log

# Decision Alternatives considered Rationale
D1 Expose both modes as selectable via mode form field Single model, two endpoints, query param Form field is consistent with existing format param; no URL change; clean enum validation via Pydantic
D2 Default mode = "fast" Default to "quality" Perceived latency is the primary UX metric for a web service. 10x speed-up turns multi-minute waits into tens of seconds for typical PDFs. Users who need highest fidelity opt in explicitly
D3 Both models in the same Docker image Two images + API-gateway routing Single container avoids a routing layer, halves deployment complexity, and keeps cold-start behaviour identical across modes. Combined image growth ~480 MB (80 MB MMS-TTS + 400 MB Coqui) is acceptable for self-hosted use
D4 Independent lazy singletons per model Shared loader with mode key Lazy loading means quality model (~400 MB) is never in memory if operator only runs fast jobs, and vice versa. Separate singletons are simpler than a dynamic registry and easier to test in isolation
D5 Use transformers.VitsModel directly for MMS-TTS Route through Coqui TTS wrapper Coqui 0.22 does not support facebook/mms-tts-eng without a custom model config YAML; that approach is fragile across Coqui releases. transformers is the canonical Meta-endorsed path
D6 Keep _chunk() at 150 chars for quality; new 500-char limit for fast Unified chunk size Tacotron2 fixed max_decoder_steps constraint does not exist in VITS. Larger fast-mode limit reduces join-point glitches and preserves prosody across longer phrases
D7 Run both modes through the same pedalboard DSP chain Skip DSP for fast mode to reduce latency Chain is sample-rate-agnostic and costs ~2–5 ms/segment. Removing it for fast mode would produce inconsistently audible output between modes. It is an explicit quality differentiator
D8 TRANSFORMERS_CACHE=/opt/hf_cache in Dockerfile Default ~/.cache Build user may differ from runtime user; explicit path guarantees pre-baked weights are found at runtime and the layer is reproducible
D9 Fresh ProcessPoolExecutor per job Module-level persistent pool FastAPI background tasks run in threads; a persistent pool created at module import adds lifecycle and signal-handling complexity inside Docker. 50 ms pool startup cost is negligible vs inference time
D10 Cap max_workers=2 for quality mode Let it match cpu_count Each quality worker loads Coqui (~400 MB). On a 4-core/1 GB host, 4 workers = 1.6 GB model memory alone. Capping at 2 keeps peak model memory at ~800 MB
D11 fitz.get_text("text") as OCR fast path pdfplumber, pdfminer.six pymupdf is the fastest pure-Python PDF text extractor; already needed for the pixmap OCR fallback so it is a single dep
D12 Threshold = 50 chars for direct-text path 0, 100, 200 0 triggers on form-only pages; 100+ risks OCR-falling on sparse-but-valid text pages; 50 is a pragmatic middle
D13 fitz Pixmap at 200 DPI for OCR fallback pdf2image (poppler) at 300 DPI Removes poppler as a hard runtime dep for the fast path; 200 DPI is sufficient for Tesseract on standard fonts

8. Assumptions

  1. Docker build environment has outbound internet at build time. Runtime is fully offline per SPEC.
  2. facebook/mms-tts-eng is available on HuggingFace Hub; weights are MIT-licensed. Coqui Tacotron2-DDC is MPL-2.0 (unchanged).
  3. phonemizer (pulled by transformers for MMS-TTS) requires espeak-ng at runtime; the existing apt install in Dockerfile covers this.
  4. Deployment host has at least 1 CPU core. synthesise_parallel degrades gracefully to max_workers=1 on a single-core host.
  5. Memory budget: quality model ~400 MB + fast model ~80 MB = ~480 MB static footprint. Both models are warm simultaneously only if mixed-mode requests are served concurrently. For single-mode deployments, only one model is ever loaded.
  6. normalize_for_tts in services/normalize.py needs no changes β€” it operates on raw strings before model input.
  7. WAV output in fast mode will be 16 kHz PCM; in quality mode 22 kHz PCM. This is a documented, expected difference, not a bug.

9. Risks and Mitigations

Risk Likelihood Impact Mitigation
MMS-TTS quality unacceptable for some sentence structures Medium Medium Evaluate against fixture sentences before shipping; quality mode is always available as a fallback
Both models warm simultaneously under mixed-mode concurrent load exceeds memory limit Low High Document memory requirements in ops runbook; add QUALITY_MAX_WORKERS env var cap if needed post-deployment
fitz text extraction returns garbled Unicode on PDFs with custom encoding Medium Low normalize_for_tts strips most noise; add printable-char ratio guard (>30%) post-testing if needed
ProcessPoolExecutor fork-safety with torch on Linux Low High Torch CPU is fork-safe on Linux; if issues appear, switch pool start method to spawn via mp.set_start_method("spawn") in pool initializer
espeak-ng version mismatch between apt and phonemizer expectations Low High Validate at build time; pin espeak-ng apt version if a mismatch is observed
200 DPI OCR fallback misses fine print vs current 300 DPI pdf2image default Medium Low Bump to 300 DPI in the fitz Pixmap call if OCR accuracy is reported degraded; one-integer change
Image size increase (~480 MB for both models) hits CI/CD artifact limits Low Medium Models are in distinct RUN layers; layer caching means rebuilds only re-pull on dep changes

10. Open Questions

  1. Does facebook/mms-tts-eng produce acceptable audio quality for the target use case, or does it need evaluation against fixture sentences before committing to the implementation?
  2. Should TEXT_LAYER_MIN_CHARS = 50 be an env-var override (like MAX_UPLOAD_MB) or is a hardcoded constant sufficient?
  3. Is there a soft memory limit on the Docker container in the target deployment? Determines whether QUALITY_MAX_WORKERS should be env-configurable.
  4. Should the 202 response echo mode, or is it sufficient to retrieve it via GET /jobs/{job_id}?
  5. Should pdf2image / poppler-utils be removed now that fitz Pixmap covers the OCR fallback, or deferred to a follow-up cleanup PR?