Subtrans / ARCHITECTURE.md
arjun-ms's picture
Initial commit: Subtrans Subtitle Pipeline
57bbccb

Subtrans — System Architecture V2

This document details the updated end-to-end architecture and data flow of the Subtrans pipeline, reflecting the integration of robust Gemini adapters, strict LLM validation, and TDD-hardened length loop checks.


High-Level Architecture Flowchart

Below is the complete data flow from raw video file input to the final self-corrected translated subtitles, mapped across the three translation backends and the final LLM validation pass:

System Architecture Diagram

graph TD
    %% Input
    A[Input Video File] -->|FFmpeg Extraction| B(Mono WAV Audio @ 16kHz)
    
    %% Transcription
    B -->|Local Offline| C[faster-whisper Engine]
    C -->|Model Size: medium + Phonetic Bias| D[English Audio Transcription]
    D -->|Precision Patching| DP[LLM Entity Corrector]
    DP -->|Segments Parsing| E[English SRT File / Raw Lists]
    
    %% Translation Branching
    E -->|Select Translation Engine| F{Translation Selector}
    
    %% Google Translate Path
    F -->|deep-translator| G[DeepTranslatorAdapter]
    G -->|Line-by-Line Request| H[Translated Subtitles Draft]
    
    %% Groq LLM Path
    F -->|Groq Cloud LLM| I[GroqAdapter]
    I -->|Contextual Batching: 10 Lines| J[Llama 3.3 70B Engine]
    J -->|Idiomatic, Natural Translation| H
    
    %% Gemini LLM Path
    F -->|Gemini API| K[GeminiAdapter]
    K -->|Full Context Batching: Entire File| L[Gemini 2.5 Flash / 3.1 Pro]
    L -->|Content Isolation & Glossary Prompting| H
    
    %% Validation & Correction Path (Automatic)
    H -->|LLM Reviewer Pass| M[Validation Service]
    M -->|30-Line Batches| N[Gemini 3.1 Pro / Llama 3.3 70B Quality Editor]
    N -->|Conservative Rules Audit| O{Errors Found?}
    
    %% Validation Output
    O -->|Yes| P[Classify & Auto-Correct]
    P -->|Logs to JSONL Dataset| Q[Parse Corrected Line]
    O -->|No| R[ALL_CORRECT — Keep original]
    
    %% Final Integration
    Q --> S[Merge Corrections into SRT Generator]
    R --> S
    
    S --> T[Final Target Language SRT File]
    
    %% Styles
    classDef main fill:#e3f2fd,stroke:#1565c0,stroke-width:2px;
    classDef process fill:#f1f8e9,stroke:#558b2f,stroke-width:1.5px;
    classDef warning fill:#fff8e1,stroke:#f57f17,stroke-width:1.5px;
    class A,T main;
    class C,J,L,N process;
    class M,P warning;

Detailed Component Breakdown

1. Audio Extraction & Transcription Stage

  • Extraction: Utilizing Python FFmpeg, the system extracts the audio stream from the target video file and normalizes it to a single-channel, 16kHz WAV file (pcm_s16le).
  • Engine: Transcribes audio locally and offline using the faster-whisper engine.
  • Model: Configured to use the medium (769M parameters) model for maximum semantic precision.
  • Phonetic Bias: Injects a custom initial_prompt into the Whisper decoder to bias it toward specific technical terms and brand names (e.g., "Naukri", "NotebookLM").
  • Precision Patching: A dedicated LLM pass (Gemini) that scans for low-confidence entities and corrects them before translation, ensuring name consistency.

2. Security & Integrity: Content Isolation

  • Escrow Tags: All transcript content sent to LLMs is wrapped in <l>...</l> isolation tags.
  • Instruction Proofing: System prompts are hardened to treat all content within tags as inert data, preventing "Instruction Leakage" if the transcript mentions AI-related keywords.

2. Translation Stage

Subtitles can be translated using three unique adapter pathways implementing the Translator interface:

  • DeepTranslatorAdapter (Google Translate): Processes subtitles line-by-line using free endpoints. This approach is highly literal and safe from semantic hallucinations, but lacks conversational flow and can be stylistically repetitive.
  • GroqAdapter (Llama 3.3 70B): Processes subtitles in conversational batches of 10 lines with contextual system prompts. Preserves conversational threads and flow.
  • GeminiAdapter (Gemini 2.5 Flash / 3.1 Pro): Now uses Full-Context Batching. It processes the entire subtitle file in a single request (optimized for Gemini's massive 1M+ token window).
  • Glossary Injection: Dynamically injects project-specific translation rules and cultural mappings (idioms) into the system prompt.
  • Singleton Pattern: Managed via a class-level singleton to ensure zero redundant resource overhead and clean session logging.

3. LLM Reviewer & Validation Stage (Self-Correction Pass)

To eliminate severe semantic errors (meaning inversions, dropped sentences, severe mistranslations) introduced by LLM adapters, a self-correction validation engine runs after the translation draft is generated:

  • Batching: English/Translated pairs are processed in batches of 30 lines.
  • Model Cascade: Leverages gemini-3.1-pro-preview with native fallbacks to 2.5-pro and 3-flash, or natively falls back to llama-3.3-70b-versatile if Gemini is missing or exhausted.
  • Conservative System Rules: The LLM adopts a "hands-off-by-default" strategy. It is forbidden from changing lines for formatting or style, ensuring zero false positives.
  • Reason Classification Dataset: Catches, corrects, and logs fixes to logs/translation_failures_{timestamp}.jsonl for observability:
    • NEGATION_FAILURE
    • SLANG_FAILURE
    • PRONOUN_CONFUSION
    • SPEAKER_CONFUSION
    • MISSING_CONTEXT
    • TOO_LITERAL
    • CULTURAL_REFERENCE
    • HALLUCINATION
    • OMISSION
    • OTHER
  • Parser & Integrator: Corrections are parsed out of [LINE_NUMBER][CATEGORY] tags, replaced back in the timeline, and logged to the terminal console with a categorized review summary.

Technical Performance Stats

  • Transcription Speed: Fast CPU/GPU processing via Whisper medium.
  • Gemini Throughput: Batches of 30 lines successfully handled per API request. Zero translation truncation due to TDD-verified loop retries.
  • Validation Fallback Resiliency: If rate limits hit, the validator seamlessly cascades down through models to preserve CI/CD test stability.