Spaces:
Sleeping
Sleeping
| # Redesign: Dual Extraction Path + Dual TTS Mode + Parallel Synthesis | |
| **Status:** Design β not yet implemented | |
| **Date:** 2026-05-18 | |
| **Scope:** Three targeted improvements to OCR path, TTS model selection, and synthesis parallelism. No other changes. | |
| --- | |
| ## 1. Architecture Diagram | |
| ``` | |
| POST /upload (file, format, mode) | |
| β | |
| βΌ | |
| upload.py: _run_pipeline(job_id, tmp_path, fmt, mode) | |
| β | |
| ββββ [A] ocr.py: extract_pages(pdf_path) | |
| β β | |
| β βββ fitz.open(pdf_path) | |
| β β βββ page.get_text("text") βββΊ char_count >= 50? | |
| β β β YES β use direct text (fast path) | |
| β β β NO β rasterize page β Tesseract (fallback) | |
| β β | |
| β βββ normalize_for_tts() + sentence split (unchanged) | |
| β | |
| ββββ [B] tts.py: synthesise_parallel(sentences, mode) | |
| β β | |
| β β mode="quality" mode="fast" | |
| β β β β | |
| β β Coqui Tacotron2-DDC facebook/mms-tts-eng | |
| β β (22 kHz, chunk β€150 chars) (16 kHz, chunk β€500 chars) | |
| β β β β | |
| β β βββββββββββββββ¬ββββββββββββββββββββ | |
| β β β | |
| β β ProcessPoolExecutor | |
| β β (max_workers = min(cpu_count, n_sentences)) | |
| β β ordered list[np.ndarray] | |
| β β | |
| ββββ audio_chain.py: process_and_export(segments, sample_rate, fmt) | |
| β | |
| pedalboard DSP chain (sample_rate from tts.get_sample_rate(mode)) | |
| βββ MP3 / WAV bytes | |
| ``` | |
| --- | |
| ## 2. API Contract | |
| ### `POST /upload` (multipart/form-data) | |
| | Field | Type | Default | Notes | | |
| |---|---|---|---| | |
| | `file` | file | required | PDF, validated via `%PDF` magic bytes | | |
| | `format` | `"mp3" \| "wav"` | `"mp3"` | Export format | | |
| | `mode` | `"fast" \| "quality"` | `"fast"` | TTS model selection | | |
| Response (202): | |
| ```json | |
| { "job_id": "...", "status": "queued", "mode": "fast" } | |
| ``` | |
| The `mode` field is echoed back so callers can confirm which path was accepted. | |
| --- | |
| ## 3. TTS Mode Comparison | |
| | Property | `quality` | `fast` | | |
| |---|---|---| | |
| | Model | `tts_models/en/ljspeech/tacotron2-DDC` (Coqui) | `facebook/mms-tts-eng` (HuggingFace VITS) | | |
| | Library | `TTS==0.22.*` | `transformers>=4.41.0` | | |
| | Sample rate | 22 050 Hz | 16 000 Hz | | |
| | Relative speed (CPU) | ~1x real-time | ~10x real-time | | |
| | Char limit per chunk | 150 (Tacotron2 fixed decoder steps) | 500 (VITS has no hard decoder step limit) | | |
| | Model size on disk | ~400 MB (model + vocoder) | ~80 MB | | |
| | Output naturalness | Higher; richer prosody | Adequate; intelligible | | |
| | DSP chain (pedalboard) | Yes | Yes | | |
| Both modes pass through the same `audio_chain.process_and_export`. DSP parameters | |
| (EQ, compression, high-pass, gain) and export logic are shared and sample-rate-agnostic. | |
| --- | |
| ## 4. Data Flow β Two PDF Extraction Paths | |
| ``` | |
| For each page in fitz document: | |
| raw_text = page.get_text("text").strip() | |
| if len(raw_text) >= TEXT_LAYER_MIN_CHARS: # fast path | |
| text = raw_text | |
| else: # fallback | |
| pixmap = page.get_pixmap(dpi=200) | |
| image = PIL.Image.frombytes(...) | |
| text = pytesseract.image_to_string(image) | |
| sentences = split_and_normalize(text) | |
| ``` | |
| The fallback uses the fitz `Pixmap` directly β no poppler/pdf2image needed on the | |
| fast path. pdf2image remains in requirements as a fallback dependency. | |
| --- | |
| ## 5. Component Changes | |
| ### 5.1 `backend/app/models/schemas.py` β add `TtsMode` | |
| ```python | |
| class TtsMode(str, Enum): | |
| fast = "fast" | |
| quality = "quality" | |
| ``` | |
| Add `mode: TtsMode = TtsMode.fast` to the `Job` model for observability and so the | |
| pipeline can retrieve the original choice during background processing. | |
| ### 5.2 `backend/requirements.txt` | |
| | Action | Package | Notes | | |
| |--------|---------|-------| | |
| | Add | `pymupdf>=1.24.0` | fitz bindings; PyPI name is `pymupdf` | | |
| | Add | `transformers>=4.41.0` | HuggingFace VITS for MMS-TTS fast mode | | |
| | Add | `scipy>=1.13.0` | transformers TTS output utilities | | |
| | Keep | `TTS==0.22.*` | Coqui TTS required for `quality` mode | | |
| | Keep | `torch==2.5.1` | both models share the same CPU-only torch install | | |
| | Keep | `pdf2image==1.17.*` | OCR fallback path | | |
| | Keep | `pytesseract==0.3.*` | OCR fallback path | | |
| ### 5.3 `backend/app/services/tts.py` β full rewrite | |
| Two independent lazy singletons, one per model. Each loads only on first use for | |
| that mode, so an operator running only `fast` jobs never pays the ~400 MB quality | |
| model memory cost. | |
| **Public interface:** | |
| ```python | |
| def get_sample_rate(mode: TtsMode = TtsMode.fast) -> int: | |
| # returns 22050 for "quality", 16000 for "fast" | |
| def synthesise(sentence: str, mode: TtsMode = TtsMode.fast) -> np.ndarray: | |
| # routes to _synthesise_quality or _synthesise_fast | |
| def synthesise_parallel( | |
| sentences: list[str], mode: TtsMode = TtsMode.fast | |
| ) -> list[np.ndarray]: | |
| # ProcessPoolExecutor; preserves order | |
| ``` | |
| Internal details: | |
| - `_quality_tts: TTS | None` β Coqui singleton, loaded via | |
| `TTS("tts_models/en/ljspeech/tacotron2-DDC", gpu=False)`. | |
| - `_fast_model: VitsModel | None`, `_fast_tokenizer: AutoTokenizer | None` β | |
| MMS-TTS singletons loaded from `"facebook/mms-tts-eng"`. | |
| - `synthesise` quality path: identical to current code β `_chunk()` at | |
| `_MAX_CHARS = 150`, `tts.tts(text=c)` per chunk, concatenate. | |
| - `synthesise` fast path: tokenize β `VitsModel.generate(**inputs)` β | |
| `.waveform.squeeze().numpy()` as `float32`. Chunks only as a safety guard at | |
| `_MAX_CHARS_FAST = 500`; VITS handles long inputs natively. | |
| - `len(cleaned) < 3` skip guard applies to both paths. | |
| - `synthesise_parallel`: filters empty sentences, then | |
| `ProcessPoolExecutor(max_workers=min(cpu_count, len(sentences)))` with | |
| `executor.map(partial(synthesise, mode=mode), filtered, chunksize=1)`. | |
| Pool is created fresh per call (see D9 below). | |
| - For `quality` mode, cap `max_workers` at `min(cpu_count, 2)` to prevent OOM | |
| from multiple workers each loading the ~400 MB Coqui model. | |
| ### 5.4 `backend/app/services/ocr.py` β full rewrite | |
| **Signature unchanged** β callers in `upload.py` unaffected. | |
| ```python | |
| TEXT_LAYER_MIN_CHARS: int = 50 | |
| def extract_pages(pdf_path: Path) -> list[list[str]]: ... | |
| ``` | |
| - Import `fitz` (pymupdf). | |
| - For each page: `page.get_text("text")`. If `len(stripped) >= TEXT_LAYER_MIN_CHARS`, | |
| use it. Otherwise convert via fitz `Pixmap` at 200 DPI β PIL Image β Tesseract. | |
| - Log extraction method per page at DEBUG level. | |
| **Threshold rationale:** 50 chars filters blank pages and fitz stray whitespace from | |
| form fields. Low enough to avoid false OCR triggers on dense pages. Module-level | |
| constant so tests can override without monkey-patching. | |
| ### 5.5 `backend/app/routers/upload.py` β targeted edits | |
| 1. Accept `mode: TtsMode = Form(TtsMode.fast)` alongside existing `format` field. | |
| 2. Pass `mode` to `storage.create_job` and `_run_pipeline`. | |
| 3. Inside the page loop, call `tts.synthesise_parallel(sentences, mode=mode)` instead | |
| of iterating `tts.synthesise` per sentence. | |
| 4. Pass `mode` to `tts.get_sample_rate(mode)` when calling `audio_chain.process_and_export`. | |
| 5. Echo `mode` in the 202 response body. | |
| Pause check granularity changes from per-sentence to per-page (one check before | |
| `synthesise_parallel`). Per-page granularity is acceptable; intra-page pause would | |
| require IPC into worker processes. | |
| ### 5.6 `backend/app/storage.py` β minor | |
| `create_job` accepts `mode: TtsMode` and stores it on the `Job` object. | |
| ### 5.7 `backend/Dockerfile` β pre-bake both models | |
| ```dockerfile | |
| # Pre-bake Coqui Tacotron2-DDC (quality mode) β existing line, unchanged | |
| RUN python -c "from TTS.api import TTS; TTS('tts_models/en/ljspeech/tacotron2-DDC', gpu=False)" | |
| # Pre-bake facebook/mms-tts-eng (fast mode) β new | |
| ENV TRANSFORMERS_CACHE=/opt/hf_cache | |
| ENV HF_HOME=/opt/hf_cache | |
| RUN python -c " | |
| from transformers import VitsModel, AutoTokenizer | |
| VitsModel.from_pretrained('facebook/mms-tts-eng') | |
| AutoTokenizer.from_pretrained('facebook/mms-tts-eng') | |
| " | |
| ``` | |
| `TRANSFORMERS_CACHE` and `HF_HOME` are set explicitly so the cache is at a known | |
| path regardless of which user runs the container, and the layer is deterministic. | |
| `espeak-ng` apt package must remain β MMS-TTS tokenizer requires it for | |
| phonemization via the `phonemizer` transitive dependency. | |
| ### 5.8 `backend/app/main.py` β one-line edit | |
| The lifespan warmup should call `tts.get_sample_rate(TtsMode.fast)` to eagerly load | |
| the fast model (default) at startup. The quality model loads lazily on the first | |
| `quality` request β acceptable given its larger footprint. | |
| ### 5.9 `backend/app/services/audio_chain.py` β no change | |
| The pedalboard chain is sample-rate-agnostic: all filter frequencies and compressor | |
| times are in Hz / ms, not in samples. Callers already supply `sample_rate`; they | |
| will now source it from `tts.get_sample_rate(mode)`. | |
| --- | |
| ## 6. File Change Summary | |
| | File | Change type | Summary | | |
| |------|-------------|---------| | |
| | `backend/app/models/schemas.py` | Edit | Add `TtsMode` enum; add `mode` field to `Job` | | |
| | `backend/requirements.txt` | Edit | Add pymupdf, transformers, scipy; keep TTS for quality mode | | |
| | `backend/app/services/ocr.py` | Rewrite | fitz fast path + Tesseract fallback | | |
| | `backend/app/services/tts.py` | Rewrite | Dual-mode lazy singletons; mode-aware `synthesise` + `synthesise_parallel` | | |
| | `backend/app/routers/upload.py` | Edit | Accept `mode` form param; thread through pipeline; per-page parallel synthesis | | |
| | `backend/app/storage.py` | Edit | Store `mode` on job | | |
| | `backend/Dockerfile` | Edit | Pre-bake both models; add `TRANSFORMERS_CACHE` env vars | | |
| | `backend/app/main.py` | Edit | Warmup fast model at startup | | |
| | `backend/app/services/audio_chain.py` | None | Sample-rate-agnostic already | | |
| --- | |
| ## 7. Decision Log | |
| | # | Decision | Alternatives considered | Rationale | | |
| |---|----------|------------------------|-----------| | |
| | D1 | Expose both modes as selectable via `mode` form field | Single model, two endpoints, query param | Form field is consistent with existing `format` param; no URL change; clean enum validation via Pydantic | | |
| | D2 | Default mode = `"fast"` | Default to `"quality"` | Perceived latency is the primary UX metric for a web service. 10x speed-up turns multi-minute waits into tens of seconds for typical PDFs. Users who need highest fidelity opt in explicitly | | |
| | D3 | Both models in the same Docker image | Two images + API-gateway routing | Single container avoids a routing layer, halves deployment complexity, and keeps cold-start behaviour identical across modes. Combined image growth ~480 MB (80 MB MMS-TTS + 400 MB Coqui) is acceptable for self-hosted use | | |
| | D4 | Independent lazy singletons per model | Shared loader with mode key | Lazy loading means quality model (~400 MB) is never in memory if operator only runs fast jobs, and vice versa. Separate singletons are simpler than a dynamic registry and easier to test in isolation | | |
| | D5 | Use `transformers.VitsModel` directly for MMS-TTS | Route through Coqui TTS wrapper | Coqui 0.22 does not support `facebook/mms-tts-eng` without a custom model config YAML; that approach is fragile across Coqui releases. `transformers` is the canonical Meta-endorsed path | | |
| | D6 | Keep `_chunk()` at 150 chars for quality; new 500-char limit for fast | Unified chunk size | Tacotron2 fixed `max_decoder_steps` constraint does not exist in VITS. Larger fast-mode limit reduces join-point glitches and preserves prosody across longer phrases | | |
| | D7 | Run both modes through the same pedalboard DSP chain | Skip DSP for fast mode to reduce latency | Chain is sample-rate-agnostic and costs ~2β5 ms/segment. Removing it for fast mode would produce inconsistently audible output between modes. It is an explicit quality differentiator | | |
| | D8 | `TRANSFORMERS_CACHE=/opt/hf_cache` in Dockerfile | Default `~/.cache` | Build user may differ from runtime user; explicit path guarantees pre-baked weights are found at runtime and the layer is reproducible | | |
| | D9 | Fresh ProcessPoolExecutor per job | Module-level persistent pool | FastAPI background tasks run in threads; a persistent pool created at module import adds lifecycle and signal-handling complexity inside Docker. 50 ms pool startup cost is negligible vs inference time | | |
| | D10 | Cap `max_workers=2` for quality mode | Let it match `cpu_count` | Each quality worker loads Coqui (~400 MB). On a 4-core/1 GB host, 4 workers = 1.6 GB model memory alone. Capping at 2 keeps peak model memory at ~800 MB | | |
| | D11 | `fitz.get_text("text")` as OCR fast path | `pdfplumber`, `pdfminer.six` | pymupdf is the fastest pure-Python PDF text extractor; already needed for the pixmap OCR fallback so it is a single dep | | |
| | D12 | Threshold = 50 chars for direct-text path | 0, 100, 200 | 0 triggers on form-only pages; 100+ risks OCR-falling on sparse-but-valid text pages; 50 is a pragmatic middle | | |
| | D13 | fitz Pixmap at 200 DPI for OCR fallback | pdf2image (poppler) at 300 DPI | Removes poppler as a hard runtime dep for the fast path; 200 DPI is sufficient for Tesseract on standard fonts | | |
| --- | |
| ## 8. Assumptions | |
| 1. Docker build environment has outbound internet at build time. Runtime is fully offline per SPEC. | |
| 2. `facebook/mms-tts-eng` is available on HuggingFace Hub; weights are MIT-licensed. Coqui Tacotron2-DDC is MPL-2.0 (unchanged). | |
| 3. `phonemizer` (pulled by transformers for MMS-TTS) requires `espeak-ng` at runtime; the existing apt install in Dockerfile covers this. | |
| 4. Deployment host has at least 1 CPU core. `synthesise_parallel` degrades gracefully to `max_workers=1` on a single-core host. | |
| 5. Memory budget: quality model ~400 MB + fast model ~80 MB = ~480 MB static footprint. Both models are warm simultaneously only if mixed-mode requests are served concurrently. For single-mode deployments, only one model is ever loaded. | |
| 6. `normalize_for_tts` in `services/normalize.py` needs no changes β it operates on raw strings before model input. | |
| 7. WAV output in fast mode will be 16 kHz PCM; in quality mode 22 kHz PCM. This is a documented, expected difference, not a bug. | |
| --- | |
| ## 9. Risks and Mitigations | |
| | Risk | Likelihood | Impact | Mitigation | | |
| |------|-----------|--------|------------| | |
| | MMS-TTS quality unacceptable for some sentence structures | Medium | Medium | Evaluate against fixture sentences before shipping; `quality` mode is always available as a fallback | | |
| | Both models warm simultaneously under mixed-mode concurrent load exceeds memory limit | Low | High | Document memory requirements in ops runbook; add `QUALITY_MAX_WORKERS` env var cap if needed post-deployment | | |
| | fitz text extraction returns garbled Unicode on PDFs with custom encoding | Medium | Low | `normalize_for_tts` strips most noise; add printable-char ratio guard (>30%) post-testing if needed | | |
| | ProcessPoolExecutor fork-safety with torch on Linux | Low | High | Torch CPU is fork-safe on Linux; if issues appear, switch pool start method to `spawn` via `mp.set_start_method("spawn")` in pool initializer | | |
| | `espeak-ng` version mismatch between apt and phonemizer expectations | Low | High | Validate at build time; pin `espeak-ng` apt version if a mismatch is observed | | |
| | 200 DPI OCR fallback misses fine print vs current 300 DPI pdf2image default | Medium | Low | Bump to 300 DPI in the fitz Pixmap call if OCR accuracy is reported degraded; one-integer change | | |
| | Image size increase (~480 MB for both models) hits CI/CD artifact limits | Low | Medium | Models are in distinct `RUN` layers; layer caching means rebuilds only re-pull on dep changes | | |
| --- | |
| ## 10. Open Questions | |
| 1. Does `facebook/mms-tts-eng` produce acceptable audio quality for the target use case, or does it need evaluation against fixture sentences before committing to the implementation? | |
| 2. Should `TEXT_LAYER_MIN_CHARS = 50` be an env-var override (like `MAX_UPLOAD_MB`) or is a hardcoded constant sufficient? | |
| 3. Is there a soft memory limit on the Docker container in the target deployment? Determines whether `QUALITY_MAX_WORKERS` should be env-configurable. | |
| 4. Should the 202 response echo `mode`, or is it sufficient to retrieve it via `GET /jobs/{job_id}`? | |
| 5. Should `pdf2image` / `poppler-utils` be removed now that fitz Pixmap covers the OCR fallback, or deferred to a follow-up cleanup PR? | |