Spaces:

arjun-ms
/

Subtrans

Sleeping

App Files Files Community

Subtrans / ARCHITECTURE.md

arjun-ms

Initial commit: Subtrans Subtitle Pipeline

57bbccb 11 days ago

preview code

raw

history blame contribute delete

6.16 kB

	# Subtrans — System Architecture V2

	This document details the updated end-to-end architecture and data flow of the Subtrans pipeline, reflecting the integration of robust Gemini adapters, strict LLM validation, and TDD-hardened length loop checks.


	---

	## High-Level Architecture Flowchart

	Below is the complete data flow from raw video file input to the final self-corrected translated subtitles, mapped across the three translation backends and the final LLM validation pass:

	![System Architecture Diagram](architecture.png)

	```mermaid
	graph TD
	%% Input
	A[Input Video File] -->\|FFmpeg Extraction\| B(Mono WAV Audio @ 16kHz)

	%% Transcription
	B -->\|Local Offline\| C[faster-whisper Engine]
	C -->\|Model Size: medium + Phonetic Bias\| D[English Audio Transcription]
	D -->\|Precision Patching\| DP[LLM Entity Corrector]
	DP -->\|Segments Parsing\| E[English SRT File / Raw Lists]

	%% Translation Branching
	E -->\|Select Translation Engine\| F{Translation Selector}

	%% Google Translate Path
	F -->\|deep-translator\| G[DeepTranslatorAdapter]
	G -->\|Line-by-Line Request\| H[Translated Subtitles Draft]

	%% Groq LLM Path
	F -->\|Groq Cloud LLM\| I[GroqAdapter]
	I -->\|Contextual Batching: 10 Lines\| J[Llama 3.3 70B Engine]
	J -->\|Idiomatic, Natural Translation\| H

	%% Gemini LLM Path
	F -->\|Gemini API\| K[GeminiAdapter]
	K -->\|Full Context Batching: Entire File\| L[Gemini 2.5 Flash / 3.1 Pro]
	L -->\|Content Isolation & Glossary Prompting\| H

	%% Validation & Correction Path (Automatic)
	H -->\|LLM Reviewer Pass\| M[Validation Service]
	M -->\|30-Line Batches\| N[Gemini 3.1 Pro / Llama 3.3 70B Quality Editor]
	N -->\|Conservative Rules Audit\| O{Errors Found?}

	%% Validation Output
	O -->\|Yes\| P[Classify & Auto-Correct]
	P -->\|Logs to JSONL Dataset\| Q[Parse Corrected Line]
	O -->\|No\| R[ALL_CORRECT — Keep original]

	%% Final Integration
	Q --> S[Merge Corrections into SRT Generator]
	R --> S

	S --> T[Final Target Language SRT File]

	%% Styles
	classDef main fill:#e3f2fd,stroke:#1565c0,stroke-width:2px;
	classDef process fill:#f1f8e9,stroke:#558b2f,stroke-width:1.5px;
	classDef warning fill:#fff8e1,stroke:#f57f17,stroke-width:1.5px;
	class A,T main;
	class C,J,L,N process;
	class M,P warning;
	```

	---

	## Detailed Component Breakdown

	### 1. Audio Extraction & Transcription Stage
	- Extraction: Utilizing Python FFmpeg, the system extracts the audio stream from the target video file and normalizes it to a single-channel, 16kHz WAV file (`pcm_s16le`).
	- Engine: Transcribes audio locally and offline using the `faster-whisper` engine.
	- Model: Configured to use the `medium` (769M parameters) model for maximum semantic precision.
	- Phonetic Bias: Injects a custom `initial_prompt` into the Whisper decoder to bias it toward specific technical terms and brand names (e.g., "Naukri", "NotebookLM").
	- Precision Patching: A dedicated LLM pass (Gemini) that scans for low-confidence entities and corrects them before translation, ensuring name consistency.

	### 2. Security & Integrity: Content Isolation
	- Escrow Tags: All transcript content sent to LLMs is wrapped in `<l>...</l>` isolation tags.
	- Instruction Proofing: System prompts are hardened to treat all content within tags as inert data, preventing "Instruction Leakage" if the transcript mentions AI-related keywords.

	### 2. Translation Stage
	Subtitles can be translated using three unique adapter pathways implementing the `Translator` interface:
	- `DeepTranslatorAdapter` (Google Translate): Processes subtitles line-by-line using free endpoints. This approach is highly literal and safe from semantic hallucinations, but lacks conversational flow and can be stylistically repetitive.
	- `GroqAdapter` (Llama 3.3 70B): Processes subtitles in conversational batches of 10 lines with contextual system prompts. Preserves conversational threads and flow.
	- `GeminiAdapter` (Gemini 2.5 Flash / 3.1 Pro): Now uses Full-Context Batching. It processes the entire subtitle file in a single request (optimized for Gemini's massive 1M+ token window).
	- Glossary Injection: Dynamically injects project-specific translation rules and cultural mappings (idioms) into the system prompt.
	- Singleton Pattern: Managed via a class-level singleton to ensure zero redundant resource overhead and clean session logging.

	### 3. LLM Reviewer & Validation Stage (Self-Correction Pass)
	To eliminate severe semantic errors (meaning inversions, dropped sentences, severe mistranslations) introduced by LLM adapters, a self-correction validation engine runs after the translation draft is generated:
	- Batching: English/Translated pairs are processed in batches of 30 lines.
	- Model Cascade: Leverages `gemini-3.1-pro-preview` with native fallbacks to `2.5-pro` and `3-flash`, or natively falls back to `llama-3.3-70b-versatile` if Gemini is missing or exhausted.
	- Conservative System Rules: The LLM adopts a "hands-off-by-default" strategy. It is forbidden from changing lines for formatting or style, ensuring zero false positives.
	- Reason Classification Dataset: Catches, corrects, and logs fixes to `logs/translation_failures_{timestamp}.jsonl` for observability:
	- `NEGATION_FAILURE`
	- `SLANG_FAILURE`
	- `PRONOUN_CONFUSION`
	- `SPEAKER_CONFUSION`
	- `MISSING_CONTEXT`
	- `TOO_LITERAL`
	- `CULTURAL_REFERENCE`
	- `HALLUCINATION`
	- `OMISSION`
	- `OTHER`
	- Parser & Integrator: Corrections are parsed out of `[LINE_NUMBER][CATEGORY]` tags, replaced back in the timeline, and logged to the terminal console with a categorized review summary.

	---

	## Technical Performance Stats
	- Transcription Speed: Fast CPU/GPU processing via Whisper `medium`.
	- Gemini Throughput: Batches of 30 lines successfully handled per API request. Zero translation truncation due to TDD-verified loop retries.
	- Validation Fallback Resiliency: If rate limits hit, the validator seamlessly cascades down through models to preserve CI/CD test stability.