Spaces:
Sleeping
Redesign: Dual Extraction Path + Dual TTS Mode + Parallel Synthesis
Status: Design β not yet implemented
Date: 2026-05-18
Scope: Three targeted improvements to OCR path, TTS model selection, and synthesis parallelism. No other changes.
1. Architecture Diagram
POST /upload (file, format, mode)
β
βΌ
upload.py: _run_pipeline(job_id, tmp_path, fmt, mode)
β
ββββ [A] ocr.py: extract_pages(pdf_path)
β β
β βββ fitz.open(pdf_path)
β β βββ page.get_text("text") βββΊ char_count >= 50?
β β β YES β use direct text (fast path)
β β β NO β rasterize page β Tesseract (fallback)
β β
β βββ normalize_for_tts() + sentence split (unchanged)
β
ββββ [B] tts.py: synthesise_parallel(sentences, mode)
β β
β β mode="quality" mode="fast"
β β β β
β β Coqui Tacotron2-DDC facebook/mms-tts-eng
β β (22 kHz, chunk β€150 chars) (16 kHz, chunk β€500 chars)
β β β β
β β βββββββββββββββ¬ββββββββββββββββββββ
β β β
β β ProcessPoolExecutor
β β (max_workers = min(cpu_count, n_sentences))
β β ordered list[np.ndarray]
β β
ββββ audio_chain.py: process_and_export(segments, sample_rate, fmt)
β
pedalboard DSP chain (sample_rate from tts.get_sample_rate(mode))
βββ MP3 / WAV bytes
2. API Contract
POST /upload (multipart/form-data)
| Field | Type | Default | Notes |
|---|---|---|---|
file |
file | required | PDF, validated via %PDF magic bytes |
format |
"mp3" | "wav" |
"mp3" |
Export format |
mode |
"fast" | "quality" |
"fast" |
TTS model selection |
Response (202):
{ "job_id": "...", "status": "queued", "mode": "fast" }
The mode field is echoed back so callers can confirm which path was accepted.
3. TTS Mode Comparison
| Property | quality |
fast |
|---|---|---|
| Model | tts_models/en/ljspeech/tacotron2-DDC (Coqui) |
facebook/mms-tts-eng (HuggingFace VITS) |
| Library | TTS==0.22.* |
transformers>=4.41.0 |
| Sample rate | 22 050 Hz | 16 000 Hz |
| Relative speed (CPU) | ~1x real-time | ~10x real-time |
| Char limit per chunk | 150 (Tacotron2 fixed decoder steps) | 500 (VITS has no hard decoder step limit) |
| Model size on disk | ~400 MB (model + vocoder) | ~80 MB |
| Output naturalness | Higher; richer prosody | Adequate; intelligible |
| DSP chain (pedalboard) | Yes | Yes |
Both modes pass through the same audio_chain.process_and_export. DSP parameters
(EQ, compression, high-pass, gain) and export logic are shared and sample-rate-agnostic.
4. Data Flow β Two PDF Extraction Paths
For each page in fitz document:
raw_text = page.get_text("text").strip()
if len(raw_text) >= TEXT_LAYER_MIN_CHARS: # fast path
text = raw_text
else: # fallback
pixmap = page.get_pixmap(dpi=200)
image = PIL.Image.frombytes(...)
text = pytesseract.image_to_string(image)
sentences = split_and_normalize(text)
The fallback uses the fitz Pixmap directly β no poppler/pdf2image needed on the
fast path. pdf2image remains in requirements as a fallback dependency.
5. Component Changes
5.1 backend/app/models/schemas.py β add TtsMode
class TtsMode(str, Enum):
fast = "fast"
quality = "quality"
Add mode: TtsMode = TtsMode.fast to the Job model for observability and so the
pipeline can retrieve the original choice during background processing.
5.2 backend/requirements.txt
| Action | Package | Notes |
|---|---|---|
| Add | pymupdf>=1.24.0 |
fitz bindings; PyPI name is pymupdf |
| Add | transformers>=4.41.0 |
HuggingFace VITS for MMS-TTS fast mode |
| Add | scipy>=1.13.0 |
transformers TTS output utilities |
| Keep | TTS==0.22.* |
Coqui TTS required for quality mode |
| Keep | torch==2.5.1 |
both models share the same CPU-only torch install |
| Keep | pdf2image==1.17.* |
OCR fallback path |
| Keep | pytesseract==0.3.* |
OCR fallback path |
5.3 backend/app/services/tts.py β full rewrite
Two independent lazy singletons, one per model. Each loads only on first use for
that mode, so an operator running only fast jobs never pays the ~400 MB quality
model memory cost.
Public interface:
def get_sample_rate(mode: TtsMode = TtsMode.fast) -> int:
# returns 22050 for "quality", 16000 for "fast"
def synthesise(sentence: str, mode: TtsMode = TtsMode.fast) -> np.ndarray:
# routes to _synthesise_quality or _synthesise_fast
def synthesise_parallel(
sentences: list[str], mode: TtsMode = TtsMode.fast
) -> list[np.ndarray]:
# ProcessPoolExecutor; preserves order
Internal details:
_quality_tts: TTS | Noneβ Coqui singleton, loaded viaTTS("tts_models/en/ljspeech/tacotron2-DDC", gpu=False)._fast_model: VitsModel | None,_fast_tokenizer: AutoTokenizer | Noneβ MMS-TTS singletons loaded from"facebook/mms-tts-eng".synthesisequality path: identical to current code β_chunk()at_MAX_CHARS = 150,tts.tts(text=c)per chunk, concatenate.synthesisefast path: tokenize βVitsModel.generate(**inputs)β.waveform.squeeze().numpy()asfloat32. Chunks only as a safety guard at_MAX_CHARS_FAST = 500; VITS handles long inputs natively.len(cleaned) < 3skip guard applies to both paths.synthesise_parallel: filters empty sentences, thenProcessPoolExecutor(max_workers=min(cpu_count, len(sentences)))withexecutor.map(partial(synthesise, mode=mode), filtered, chunksize=1). Pool is created fresh per call (see D9 below).- For
qualitymode, capmax_workersatmin(cpu_count, 2)to prevent OOM from multiple workers each loading the ~400 MB Coqui model.
5.4 backend/app/services/ocr.py β full rewrite
Signature unchanged β callers in upload.py unaffected.
TEXT_LAYER_MIN_CHARS: int = 50
def extract_pages(pdf_path: Path) -> list[list[str]]: ...
- Import
fitz(pymupdf). - For each page:
page.get_text("text"). Iflen(stripped) >= TEXT_LAYER_MIN_CHARS, use it. Otherwise convert via fitzPixmapat 200 DPI β PIL Image β Tesseract. - Log extraction method per page at DEBUG level.
Threshold rationale: 50 chars filters blank pages and fitz stray whitespace from form fields. Low enough to avoid false OCR triggers on dense pages. Module-level constant so tests can override without monkey-patching.
5.5 backend/app/routers/upload.py β targeted edits
- Accept
mode: TtsMode = Form(TtsMode.fast)alongside existingformatfield. - Pass
modetostorage.create_joband_run_pipeline. - Inside the page loop, call
tts.synthesise_parallel(sentences, mode=mode)instead of iteratingtts.synthesiseper sentence. - Pass
modetotts.get_sample_rate(mode)when callingaudio_chain.process_and_export. - Echo
modein the 202 response body.
Pause check granularity changes from per-sentence to per-page (one check before
synthesise_parallel). Per-page granularity is acceptable; intra-page pause would
require IPC into worker processes.
5.6 backend/app/storage.py β minor
create_job accepts mode: TtsMode and stores it on the Job object.
5.7 backend/Dockerfile β pre-bake both models
# Pre-bake Coqui Tacotron2-DDC (quality mode) β existing line, unchanged
RUN python -c "from TTS.api import TTS; TTS('tts_models/en/ljspeech/tacotron2-DDC', gpu=False)"
# Pre-bake facebook/mms-tts-eng (fast mode) β new
ENV TRANSFORMERS_CACHE=/opt/hf_cache
ENV HF_HOME=/opt/hf_cache
RUN python -c "
from transformers import VitsModel, AutoTokenizer
VitsModel.from_pretrained('facebook/mms-tts-eng')
AutoTokenizer.from_pretrained('facebook/mms-tts-eng')
"
TRANSFORMERS_CACHE and HF_HOME are set explicitly so the cache is at a known
path regardless of which user runs the container, and the layer is deterministic.
espeak-ng apt package must remain β MMS-TTS tokenizer requires it for
phonemization via the phonemizer transitive dependency.
5.8 backend/app/main.py β one-line edit
The lifespan warmup should call tts.get_sample_rate(TtsMode.fast) to eagerly load
the fast model (default) at startup. The quality model loads lazily on the first
quality request β acceptable given its larger footprint.
5.9 backend/app/services/audio_chain.py β no change
The pedalboard chain is sample-rate-agnostic: all filter frequencies and compressor
times are in Hz / ms, not in samples. Callers already supply sample_rate; they
will now source it from tts.get_sample_rate(mode).
6. File Change Summary
| File | Change type | Summary |
|---|---|---|
backend/app/models/schemas.py |
Edit | Add TtsMode enum; add mode field to Job |
backend/requirements.txt |
Edit | Add pymupdf, transformers, scipy; keep TTS for quality mode |
backend/app/services/ocr.py |
Rewrite | fitz fast path + Tesseract fallback |
backend/app/services/tts.py |
Rewrite | Dual-mode lazy singletons; mode-aware synthesise + synthesise_parallel |
backend/app/routers/upload.py |
Edit | Accept mode form param; thread through pipeline; per-page parallel synthesis |
backend/app/storage.py |
Edit | Store mode on job |
backend/Dockerfile |
Edit | Pre-bake both models; add TRANSFORMERS_CACHE env vars |
backend/app/main.py |
Edit | Warmup fast model at startup |
backend/app/services/audio_chain.py |
None | Sample-rate-agnostic already |
7. Decision Log
| # | Decision | Alternatives considered | Rationale |
|---|---|---|---|
| D1 | Expose both modes as selectable via mode form field |
Single model, two endpoints, query param | Form field is consistent with existing format param; no URL change; clean enum validation via Pydantic |
| D2 | Default mode = "fast" |
Default to "quality" |
Perceived latency is the primary UX metric for a web service. 10x speed-up turns multi-minute waits into tens of seconds for typical PDFs. Users who need highest fidelity opt in explicitly |
| D3 | Both models in the same Docker image | Two images + API-gateway routing | Single container avoids a routing layer, halves deployment complexity, and keeps cold-start behaviour identical across modes. Combined image growth ~480 MB (80 MB MMS-TTS + 400 MB Coqui) is acceptable for self-hosted use |
| D4 | Independent lazy singletons per model | Shared loader with mode key | Lazy loading means quality model (~400 MB) is never in memory if operator only runs fast jobs, and vice versa. Separate singletons are simpler than a dynamic registry and easier to test in isolation |
| D5 | Use transformers.VitsModel directly for MMS-TTS |
Route through Coqui TTS wrapper | Coqui 0.22 does not support facebook/mms-tts-eng without a custom model config YAML; that approach is fragile across Coqui releases. transformers is the canonical Meta-endorsed path |
| D6 | Keep _chunk() at 150 chars for quality; new 500-char limit for fast |
Unified chunk size | Tacotron2 fixed max_decoder_steps constraint does not exist in VITS. Larger fast-mode limit reduces join-point glitches and preserves prosody across longer phrases |
| D7 | Run both modes through the same pedalboard DSP chain | Skip DSP for fast mode to reduce latency | Chain is sample-rate-agnostic and costs ~2β5 ms/segment. Removing it for fast mode would produce inconsistently audible output between modes. It is an explicit quality differentiator |
| D8 | TRANSFORMERS_CACHE=/opt/hf_cache in Dockerfile |
Default ~/.cache |
Build user may differ from runtime user; explicit path guarantees pre-baked weights are found at runtime and the layer is reproducible |
| D9 | Fresh ProcessPoolExecutor per job | Module-level persistent pool | FastAPI background tasks run in threads; a persistent pool created at module import adds lifecycle and signal-handling complexity inside Docker. 50 ms pool startup cost is negligible vs inference time |
| D10 | Cap max_workers=2 for quality mode |
Let it match cpu_count |
Each quality worker loads Coqui (~400 MB). On a 4-core/1 GB host, 4 workers = 1.6 GB model memory alone. Capping at 2 keeps peak model memory at ~800 MB |
| D11 | fitz.get_text("text") as OCR fast path |
pdfplumber, pdfminer.six |
pymupdf is the fastest pure-Python PDF text extractor; already needed for the pixmap OCR fallback so it is a single dep |
| D12 | Threshold = 50 chars for direct-text path | 0, 100, 200 | 0 triggers on form-only pages; 100+ risks OCR-falling on sparse-but-valid text pages; 50 is a pragmatic middle |
| D13 | fitz Pixmap at 200 DPI for OCR fallback | pdf2image (poppler) at 300 DPI | Removes poppler as a hard runtime dep for the fast path; 200 DPI is sufficient for Tesseract on standard fonts |
8. Assumptions
- Docker build environment has outbound internet at build time. Runtime is fully offline per SPEC.
facebook/mms-tts-engis available on HuggingFace Hub; weights are MIT-licensed. Coqui Tacotron2-DDC is MPL-2.0 (unchanged).phonemizer(pulled by transformers for MMS-TTS) requiresespeak-ngat runtime; the existing apt install in Dockerfile covers this.- Deployment host has at least 1 CPU core.
synthesise_paralleldegrades gracefully tomax_workers=1on a single-core host. - Memory budget: quality model ~400 MB + fast model ~80 MB = ~480 MB static footprint. Both models are warm simultaneously only if mixed-mode requests are served concurrently. For single-mode deployments, only one model is ever loaded.
normalize_for_ttsinservices/normalize.pyneeds no changes β it operates on raw strings before model input.- WAV output in fast mode will be 16 kHz PCM; in quality mode 22 kHz PCM. This is a documented, expected difference, not a bug.
9. Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| MMS-TTS quality unacceptable for some sentence structures | Medium | Medium | Evaluate against fixture sentences before shipping; quality mode is always available as a fallback |
| Both models warm simultaneously under mixed-mode concurrent load exceeds memory limit | Low | High | Document memory requirements in ops runbook; add QUALITY_MAX_WORKERS env var cap if needed post-deployment |
| fitz text extraction returns garbled Unicode on PDFs with custom encoding | Medium | Low | normalize_for_tts strips most noise; add printable-char ratio guard (>30%) post-testing if needed |
| ProcessPoolExecutor fork-safety with torch on Linux | Low | High | Torch CPU is fork-safe on Linux; if issues appear, switch pool start method to spawn via mp.set_start_method("spawn") in pool initializer |
espeak-ng version mismatch between apt and phonemizer expectations |
Low | High | Validate at build time; pin espeak-ng apt version if a mismatch is observed |
| 200 DPI OCR fallback misses fine print vs current 300 DPI pdf2image default | Medium | Low | Bump to 300 DPI in the fitz Pixmap call if OCR accuracy is reported degraded; one-integer change |
| Image size increase (~480 MB for both models) hits CI/CD artifact limits | Low | Medium | Models are in distinct RUN layers; layer caching means rebuilds only re-pull on dep changes |
10. Open Questions
- Does
facebook/mms-tts-engproduce acceptable audio quality for the target use case, or does it need evaluation against fixture sentences before committing to the implementation? - Should
TEXT_LAYER_MIN_CHARS = 50be an env-var override (likeMAX_UPLOAD_MB) or is a hardcoded constant sufficient? - Is there a soft memory limit on the Docker container in the target deployment? Determines whether
QUALITY_MAX_WORKERSshould be env-configurable. - Should the 202 response echo
mode, or is it sufficient to retrieve it viaGET /jobs/{job_id}? - Should
pdf2image/poppler-utilsbe removed now that fitz Pixmap covers the OCR fallback, or deferred to a follow-up cleanup PR?