| # Subtrans — System Architecture V2 |
|
|
| This document details the updated end-to-end architecture and data flow of the **Subtrans** pipeline, reflecting the integration of robust Gemini adapters, strict LLM validation, and TDD-hardened length loop checks. |
|
|
|
|
| --- |
|
|
| ## High-Level Architecture Flowchart |
|
|
| Below is the complete data flow from raw video file input to the final self-corrected translated subtitles, mapped across the three translation backends and the final LLM validation pass: |
|
|
|  |
|
|
| ```mermaid |
| graph TD |
| %% Input |
| A[Input Video File] -->|FFmpeg Extraction| B(Mono WAV Audio @ 16kHz) |
| |
| %% Transcription |
| B -->|Local Offline| C[faster-whisper Engine] |
| C -->|Model Size: medium + Phonetic Bias| D[English Audio Transcription] |
| D -->|Precision Patching| DP[LLM Entity Corrector] |
| DP -->|Segments Parsing| E[English SRT File / Raw Lists] |
| |
| %% Translation Branching |
| E -->|Select Translation Engine| F{Translation Selector} |
| |
| %% Google Translate Path |
| F -->|deep-translator| G[DeepTranslatorAdapter] |
| G -->|Line-by-Line Request| H[Translated Subtitles Draft] |
| |
| %% Groq LLM Path |
| F -->|Groq Cloud LLM| I[GroqAdapter] |
| I -->|Contextual Batching: 10 Lines| J[Llama 3.3 70B Engine] |
| J -->|Idiomatic, Natural Translation| H |
| |
| %% Gemini LLM Path |
| F -->|Gemini API| K[GeminiAdapter] |
| K -->|Full Context Batching: Entire File| L[Gemini 2.5 Flash / 3.1 Pro] |
| L -->|Content Isolation & Glossary Prompting| H |
| |
| %% Validation & Correction Path (Automatic) |
| H -->|LLM Reviewer Pass| M[Validation Service] |
| M -->|30-Line Batches| N[Gemini 3.1 Pro / Llama 3.3 70B Quality Editor] |
| N -->|Conservative Rules Audit| O{Errors Found?} |
| |
| %% Validation Output |
| O -->|Yes| P[Classify & Auto-Correct] |
| P -->|Logs to JSONL Dataset| Q[Parse Corrected Line] |
| O -->|No| R[ALL_CORRECT — Keep original] |
| |
| %% Final Integration |
| Q --> S[Merge Corrections into SRT Generator] |
| R --> S |
| |
| S --> T[Final Target Language SRT File] |
| |
| %% Styles |
| classDef main fill:#e3f2fd,stroke:#1565c0,stroke-width:2px; |
| classDef process fill:#f1f8e9,stroke:#558b2f,stroke-width:1.5px; |
| classDef warning fill:#fff8e1,stroke:#f57f17,stroke-width:1.5px; |
| class A,T main; |
| class C,J,L,N process; |
| class M,P warning; |
| ``` |
|
|
| --- |
|
|
| ## Detailed Component Breakdown |
|
|
| ### 1. Audio Extraction & Transcription Stage |
| - **Extraction**: Utilizing Python FFmpeg, the system extracts the audio stream from the target video file and normalizes it to a single-channel, 16kHz WAV file (`pcm_s16le`). |
| - **Engine**: Transcribes audio locally and offline using the `faster-whisper` engine. |
| - **Model**: Configured to use the **`medium`** (769M parameters) model for maximum semantic precision. |
| - **Phonetic Bias**: Injects a custom `initial_prompt` into the Whisper decoder to bias it toward specific technical terms and brand names (e.g., "Naukri", "NotebookLM"). |
| - **Precision Patching**: A dedicated LLM pass (Gemini) that scans for low-confidence entities and corrects them before translation, ensuring name consistency. |
|
|
| ### 2. Security & Integrity: Content Isolation |
| - **Escrow Tags**: All transcript content sent to LLMs is wrapped in `<l>...</l>` isolation tags. |
| - **Instruction Proofing**: System prompts are hardened to treat all content within tags as inert data, preventing "Instruction Leakage" if the transcript mentions AI-related keywords. |
|
|
| ### 2. Translation Stage |
| Subtitles can be translated using three unique adapter pathways implementing the `Translator` interface: |
| - **`DeepTranslatorAdapter` (Google Translate)**: Processes subtitles line-by-line using free endpoints. This approach is highly literal and safe from semantic hallucinations, but lacks conversational flow and can be stylistically repetitive. |
| - **`GroqAdapter` (Llama 3.3 70B)**: Processes subtitles in conversational **batches of 10 lines** with contextual system prompts. Preserves conversational threads and flow. |
| - **`GeminiAdapter` (Gemini 2.5 Flash / 3.1 Pro)**: Now uses **Full-Context Batching**. It processes the entire subtitle file in a single request (optimized for Gemini's massive 1M+ token window). |
| - **Glossary Injection**: Dynamically injects project-specific translation rules and cultural mappings (idioms) into the system prompt. |
| - **Singleton Pattern**: Managed via a class-level singleton to ensure zero redundant resource overhead and clean session logging. |
|
|
| ### 3. LLM Reviewer & Validation Stage (Self-Correction Pass) |
| To eliminate severe semantic errors (meaning inversions, dropped sentences, severe mistranslations) introduced by LLM adapters, a self-correction validation engine runs after the translation draft is generated: |
| - **Batching**: English/Translated pairs are processed in **batches of 30 lines**. |
| - **Model Cascade**: Leverages `gemini-3.1-pro-preview` with native fallbacks to `2.5-pro` and `3-flash`, or natively falls back to `llama-3.3-70b-versatile` if Gemini is missing or exhausted. |
| - **Conservative System Rules**: The LLM adopts a "hands-off-by-default" strategy. It is forbidden from changing lines for formatting or style, ensuring zero false positives. |
| - **Reason Classification Dataset**: Catches, corrects, and logs fixes to `logs/translation_failures_{timestamp}.jsonl` for observability: |
| - `NEGATION_FAILURE` |
| - `SLANG_FAILURE` |
| - `PRONOUN_CONFUSION` |
| - `SPEAKER_CONFUSION` |
| - `MISSING_CONTEXT` |
| - `TOO_LITERAL` |
| - `CULTURAL_REFERENCE` |
| - `HALLUCINATION` |
| - `OMISSION` |
| - `OTHER` |
| - **Parser & Integrator**: Corrections are parsed out of `[LINE_NUMBER][CATEGORY]` tags, replaced back in the timeline, and logged to the terminal console with a categorized review summary. |
|
|
| --- |
|
|
| ## Technical Performance Stats |
| - **Transcription Speed**: Fast CPU/GPU processing via Whisper `medium`. |
| - **Gemini Throughput**: Batches of 30 lines successfully handled per API request. Zero translation truncation due to TDD-verified loop retries. |
| - **Validation Fallback Resiliency**: If rate limits hit, the validator seamlessly cascades down through models to preserve CI/CD test stability. |
|
|