Spaces:

chiefkarim
/

pdf-to-audio

Sleeping

App Files Files Community

pdf-to-audio / REDESIGN.md

chiefkarim

feat(pipeline): dual extraction path, dual TTS mode, parallel synthesis

aee40c6 22 days ago

preview code

raw

history blame contribute delete

16.8 kB

Redesign: Dual Extraction Path + Dual TTS Mode + Parallel Synthesis

Status: Design — not yet implemented
Date: 2026-05-18
Scope: Three targeted improvements to OCR path, TTS model selection, and synthesis parallelism. No other changes.

1. Architecture Diagram

POST /upload  (file, format, mode)
     │
     ▼
upload.py: _run_pipeline(job_id, tmp_path, fmt, mode)
     │
     ├─── [A] ocr.py: extract_pages(pdf_path)
     │         │
     │         ├── fitz.open(pdf_path)
     │         │    └── page.get_text("text")  ──► char_count >= 50?
     │         │                                       │ YES → use direct text (fast path)
     │         │                                       │ NO  → rasterize page → Tesseract (fallback)
     │         │
     │         └── normalize_for_tts() + sentence split (unchanged)
     │
     ├─── [B] tts.py: synthesise_parallel(sentences, mode)
     │         │
     │         │   mode="quality"                    mode="fast"
     │         │        │                                 │
     │         │   Coqui Tacotron2-DDC           facebook/mms-tts-eng
     │         │   (22 kHz, chunk ≤150 chars)    (16 kHz, chunk ≤500 chars)
     │         │        │                                 │
     │         │        └─────────────┬───────────────────┘
     │         │                      │
     │         │              ProcessPoolExecutor
     │         │         (max_workers = min(cpu_count, n_sentences))
     │         │              ordered list[np.ndarray]
     │         │
     └─── audio_chain.py: process_and_export(segments, sample_rate, fmt)
               │
               pedalboard DSP chain (sample_rate from tts.get_sample_rate(mode))
               └── MP3 / WAV bytes

2. API Contract

`POST /upload` (multipart/form-data)

Field	Type	Default	Notes
`file`	file	required	PDF, validated via `%PDF` magic bytes
`format`	`"mp3" \| "wav"`	`"mp3"`	Export format
`mode`	`"fast" \| "quality"`	`"fast"`	TTS model selection

Response (202):

{ "job_id": "...", "status": "queued", "mode": "fast" }

The mode field is echoed back so callers can confirm which path was accepted.

3. TTS Mode Comparison

Property	`quality`	`fast`
Model	`tts_models/en/ljspeech/tacotron2-DDC` (Coqui)	`facebook/mms-tts-eng` (HuggingFace VITS)
Library	`TTS==0.22.*`	`transformers>=4.41.0`
Sample rate	22 050 Hz	16 000 Hz
Relative speed (CPU)	~1x real-time	~10x real-time
Char limit per chunk	150 (Tacotron2 fixed decoder steps)	500 (VITS has no hard decoder step limit)
Model size on disk	~400 MB (model + vocoder)	~80 MB
Output naturalness	Higher; richer prosody	Adequate; intelligible
DSP chain (pedalboard)	Yes	Yes

Both modes pass through the same audio_chain.process_and_export. DSP parameters (EQ, compression, high-pass, gain) and export logic are shared and sample-rate-agnostic.

4. Data Flow — Two PDF Extraction Paths

For each page in fitz document:
  raw_text = page.get_text("text").strip()

  if len(raw_text) >= TEXT_LAYER_MIN_CHARS:   # fast path
      text = raw_text
  else:                                        # fallback
      pixmap = page.get_pixmap(dpi=200)
      image  = PIL.Image.frombytes(...)
      text   = pytesseract.image_to_string(image)

  sentences = split_and_normalize(text)

The fallback uses the fitz Pixmap directly — no poppler/pdf2image needed on the fast path. pdf2image remains in requirements as a fallback dependency.

5. Component Changes

5.1 `backend/app/models/schemas.py` — add `TtsMode`

class TtsMode(str, Enum):
    fast = "fast"
    quality = "quality"

Add mode: TtsMode = TtsMode.fast to the Job model for observability and so the pipeline can retrieve the original choice during background processing.

5.2 `backend/requirements.txt`

Action	Package	Notes
Add	`pymupdf>=1.24.0`	fitz bindings; PyPI name is `pymupdf`
Add	`transformers>=4.41.0`	HuggingFace VITS for MMS-TTS fast mode
Add	`scipy>=1.13.0`	transformers TTS output utilities
Keep	`TTS==0.22.*`	Coqui TTS required for `quality` mode
Keep	`torch==2.5.1`	both models share the same CPU-only torch install
Keep	`pdf2image==1.17.*`	OCR fallback path
Keep	`pytesseract==0.3.*`	OCR fallback path

5.3 `backend/app/services/tts.py` — full rewrite

Two independent lazy singletons, one per model. Each loads only on first use for that mode, so an operator running only fast jobs never pays the ~400 MB quality model memory cost.

Public interface:

def get_sample_rate(mode: TtsMode = TtsMode.fast) -> int:
    # returns 22050 for "quality", 16000 for "fast"

def synthesise(sentence: str, mode: TtsMode = TtsMode.fast) -> np.ndarray:
    # routes to _synthesise_quality or _synthesise_fast

def synthesise_parallel(
    sentences: list[str], mode: TtsMode = TtsMode.fast
) -> list[np.ndarray]:
    # ProcessPoolExecutor; preserves order

Internal details:

_quality_tts: TTS | None — Coqui singleton, loaded via TTS("tts_models/en/ljspeech/tacotron2-DDC", gpu=False).
_fast_model: VitsModel | None, _fast_tokenizer: AutoTokenizer | None — MMS-TTS singletons loaded from "facebook/mms-tts-eng".
synthesise quality path: identical to current code — _chunk() at _MAX_CHARS = 150, tts.tts(text=c) per chunk, concatenate.
synthesise fast path: tokenize → VitsModel.generate(**inputs) → .waveform.squeeze().numpy() as float32. Chunks only as a safety guard at _MAX_CHARS_FAST = 500; VITS handles long inputs natively.
len(cleaned) < 3 skip guard applies to both paths.
synthesise_parallel: filters empty sentences, then ProcessPoolExecutor(max_workers=min(cpu_count, len(sentences))) with executor.map(partial(synthesise, mode=mode), filtered, chunksize=1). Pool is created fresh per call (see D9 below).
For quality mode, cap max_workers at min(cpu_count, 2) to prevent OOM from multiple workers each loading the ~400 MB Coqui model.

5.4 `backend/app/services/ocr.py` — full rewrite

Signature unchanged — callers in upload.py unaffected.

TEXT_LAYER_MIN_CHARS: int = 50

def extract_pages(pdf_path: Path) -> list[list[str]]: ...

Import fitz (pymupdf).
For each page: page.get_text("text"). If len(stripped) >= TEXT_LAYER_MIN_CHARS, use it. Otherwise convert via fitz Pixmap at 200 DPI → PIL Image → Tesseract.
Log extraction method per page at DEBUG level.

Threshold rationale: 50 chars filters blank pages and fitz stray whitespace from form fields. Low enough to avoid false OCR triggers on dense pages. Module-level constant so tests can override without monkey-patching.

5.5 `backend/app/routers/upload.py` — targeted edits

Accept mode: TtsMode = Form(TtsMode.fast) alongside existing format field.
Pass mode to storage.create_job and _run_pipeline.
Inside the page loop, call tts.synthesise_parallel(sentences, mode=mode) instead of iterating tts.synthesise per sentence.
Pass mode to tts.get_sample_rate(mode) when calling audio_chain.process_and_export.
Echo mode in the 202 response body.

Pause check granularity changes from per-sentence to per-page (one check before synthesise_parallel). Per-page granularity is acceptable; intra-page pause would require IPC into worker processes.

5.6 `backend/app/storage.py` — minor

create_job accepts mode: TtsMode and stores it on the Job object.

5.7 `backend/Dockerfile` — pre-bake both models

# Pre-bake Coqui Tacotron2-DDC (quality mode) — existing line, unchanged
RUN python -c "from TTS.api import TTS; TTS('tts_models/en/ljspeech/tacotron2-DDC', gpu=False)"

# Pre-bake facebook/mms-tts-eng (fast mode) — new
ENV TRANSFORMERS_CACHE=/opt/hf_cache
ENV HF_HOME=/opt/hf_cache
RUN python -c "
from transformers import VitsModel, AutoTokenizer
VitsModel.from_pretrained('facebook/mms-tts-eng')
AutoTokenizer.from_pretrained('facebook/mms-tts-eng')
"

TRANSFORMERS_CACHE and HF_HOME are set explicitly so the cache is at a known path regardless of which user runs the container, and the layer is deterministic.

espeak-ng apt package must remain — MMS-TTS tokenizer requires it for phonemization via the phonemizer transitive dependency.

5.8 `backend/app/main.py` — one-line edit

The lifespan warmup should call tts.get_sample_rate(TtsMode.fast) to eagerly load the fast model (default) at startup. The quality model loads lazily on the first quality request — acceptable given its larger footprint.

5.9 `backend/app/services/audio_chain.py` — no change

The pedalboard chain is sample-rate-agnostic: all filter frequencies and compressor times are in Hz / ms, not in samples. Callers already supply sample_rate; they will now source it from tts.get_sample_rate(mode).

6. File Change Summary

File	Change type	Summary
`backend/app/models/schemas.py`	Edit	Add `TtsMode` enum; add `mode` field to `Job`
`backend/requirements.txt`	Edit	Add pymupdf, transformers, scipy; keep TTS for quality mode
`backend/app/services/ocr.py`	Rewrite	fitz fast path + Tesseract fallback
`backend/app/services/tts.py`	Rewrite	Dual-mode lazy singletons; mode-aware `synthesise` + `synthesise_parallel`
`backend/app/routers/upload.py`	Edit	Accept `mode` form param; thread through pipeline; per-page parallel synthesis
`backend/app/storage.py`	Edit	Store `mode` on job
`backend/Dockerfile`	Edit	Pre-bake both models; add `TRANSFORMERS_CACHE` env vars
`backend/app/main.py`	Edit	Warmup fast model at startup
`backend/app/services/audio_chain.py`	None	Sample-rate-agnostic already

7. Decision Log

#	Decision	Alternatives considered	Rationale
D1	Expose both modes as selectable via `mode` form field	Single model, two endpoints, query param	Form field is consistent with existing `format` param; no URL change; clean enum validation via Pydantic
D2	Default mode = `"fast"`	Default to `"quality"`	Perceived latency is the primary UX metric for a web service. 10x speed-up turns multi-minute waits into tens of seconds for typical PDFs. Users who need highest fidelity opt in explicitly
D3	Both models in the same Docker image	Two images + API-gateway routing	Single container avoids a routing layer, halves deployment complexity, and keeps cold-start behaviour identical across modes. Combined image growth ~480 MB (80 MB MMS-TTS + 400 MB Coqui) is acceptable for self-hosted use
D4	Independent lazy singletons per model	Shared loader with mode key	Lazy loading means quality model (~400 MB) is never in memory if operator only runs fast jobs, and vice versa. Separate singletons are simpler than a dynamic registry and easier to test in isolation
D5	Use `transformers.VitsModel` directly for MMS-TTS	Route through Coqui TTS wrapper	Coqui 0.22 does not support `facebook/mms-tts-eng` without a custom model config YAML; that approach is fragile across Coqui releases. `transformers` is the canonical Meta-endorsed path
D6	Keep `_chunk()` at 150 chars for quality; new 500-char limit for fast	Unified chunk size	Tacotron2 fixed `max_decoder_steps` constraint does not exist in VITS. Larger fast-mode limit reduces join-point glitches and preserves prosody across longer phrases
D7	Run both modes through the same pedalboard DSP chain	Skip DSP for fast mode to reduce latency	Chain is sample-rate-agnostic and costs ~2–5 ms/segment. Removing it for fast mode would produce inconsistently audible output between modes. It is an explicit quality differentiator
D8	`TRANSFORMERS_CACHE=/opt/hf_cache` in Dockerfile	Default `~/.cache`	Build user may differ from runtime user; explicit path guarantees pre-baked weights are found at runtime and the layer is reproducible
D9	Fresh ProcessPoolExecutor per job	Module-level persistent pool	FastAPI background tasks run in threads; a persistent pool created at module import adds lifecycle and signal-handling complexity inside Docker. 50 ms pool startup cost is negligible vs inference time
D10	Cap `max_workers=2` for quality mode	Let it match `cpu_count`	Each quality worker loads Coqui (~400 MB). On a 4-core/1 GB host, 4 workers = 1.6 GB model memory alone. Capping at 2 keeps peak model memory at ~800 MB
D11	`fitz.get_text("text")` as OCR fast path	`pdfplumber`, `pdfminer.six`	pymupdf is the fastest pure-Python PDF text extractor; already needed for the pixmap OCR fallback so it is a single dep
D12	Threshold = 50 chars for direct-text path	0, 100, 200	0 triggers on form-only pages; 100+ risks OCR-falling on sparse-but-valid text pages; 50 is a pragmatic middle
D13	fitz Pixmap at 200 DPI for OCR fallback	pdf2image (poppler) at 300 DPI	Removes poppler as a hard runtime dep for the fast path; 200 DPI is sufficient for Tesseract on standard fonts

8. Assumptions

Docker build environment has outbound internet at build time. Runtime is fully offline per SPEC.
facebook/mms-tts-eng is available on HuggingFace Hub; weights are MIT-licensed. Coqui Tacotron2-DDC is MPL-2.0 (unchanged).
phonemizer (pulled by transformers for MMS-TTS) requires espeak-ng at runtime; the existing apt install in Dockerfile covers this.
Deployment host has at least 1 CPU core. synthesise_parallel degrades gracefully to max_workers=1 on a single-core host.
Memory budget: quality model ~400 MB + fast model ~80 MB = ~480 MB static footprint. Both models are warm simultaneously only if mixed-mode requests are served concurrently. For single-mode deployments, only one model is ever loaded.
normalize_for_tts in services/normalize.py needs no changes — it operates on raw strings before model input.
WAV output in fast mode will be 16 kHz PCM; in quality mode 22 kHz PCM. This is a documented, expected difference, not a bug.

9. Risks and Mitigations

Risk	Likelihood	Impact	Mitigation
MMS-TTS quality unacceptable for some sentence structures	Medium	Medium	Evaluate against fixture sentences before shipping; `quality` mode is always available as a fallback
Both models warm simultaneously under mixed-mode concurrent load exceeds memory limit	Low	High	Document memory requirements in ops runbook; add `QUALITY_MAX_WORKERS` env var cap if needed post-deployment
fitz text extraction returns garbled Unicode on PDFs with custom encoding	Medium	Low	`normalize_for_tts` strips most noise; add printable-char ratio guard (>30%) post-testing if needed
ProcessPoolExecutor fork-safety with torch on Linux	Low	High	Torch CPU is fork-safe on Linux; if issues appear, switch pool start method to `spawn` via `mp.set_start_method("spawn")` in pool initializer
`espeak-ng` version mismatch between apt and phonemizer expectations	Low	High	Validate at build time; pin `espeak-ng` apt version if a mismatch is observed
200 DPI OCR fallback misses fine print vs current 300 DPI pdf2image default	Medium	Low	Bump to 300 DPI in the fitz Pixmap call if OCR accuracy is reported degraded; one-integer change
Image size increase (~480 MB for both models) hits CI/CD artifact limits	Low	Medium	Models are in distinct `RUN` layers; layer caching means rebuilds only re-pull on dep changes

10. Open Questions

Does facebook/mms-tts-eng produce acceptable audio quality for the target use case, or does it need evaluation against fixture sentences before committing to the implementation?
Should TEXT_LAYER_MIN_CHARS = 50 be an env-var override (like MAX_UPLOAD_MB) or is a hardcoded constant sufficient?
Is there a soft memory limit on the Docker container in the target deployment? Determines whether QUALITY_MAX_WORKERS should be env-configurable.
Should the 202 response echo mode, or is it sufficient to retrieve it via GET /jobs/{job_id}?
Should pdf2image / poppler-utils be removed now that fitz Pixmap covers the OCR fallback, or deferred to a follow-up cleanup PR?

Redesign: Dual Extraction Path + Dual TTS Mode + Parallel Synthesis

1. Architecture Diagram

2. API Contract

POST /upload (multipart/form-data)

3. TTS Mode Comparison

4. Data Flow — Two PDF Extraction Paths

5. Component Changes

5.1 backend/app/models/schemas.py — add TtsMode

5.2 backend/requirements.txt

5.3 backend/app/services/tts.py — full rewrite

5.4 backend/app/services/ocr.py — full rewrite

5.5 backend/app/routers/upload.py — targeted edits

5.6 backend/app/storage.py — minor

5.7 backend/Dockerfile — pre-bake both models

5.8 backend/app/main.py — one-line edit

5.9 backend/app/services/audio_chain.py — no change