Subtrans / tasks.md
arjun-ms's picture
Initial commit: Subtrans Subtitle Pipeline
57bbccb
# Tasks: Glossary Bias & Idiom Handling Implementation
This document tracks the tasks required to implement Option C (Context-Aware Glossary Prompting) and Option A (Whisper Decoder Bias list) to solve subtitle errors in `ai-job-hunt.mp4`.
## 📋 Status Overview
- **Base Branch:** `feat/gemini-adapter-from-whisper-medium`
- **Target Branch:** `feat/glossary-idiom-handling-from-feat-gemini-adapter-from-whisper-medium`
- **TDD Test Suite:** Already drafted at [test_glossary_and_context.py](file:///e:/Work/AI%20translator%20antigravity/app/tests/test_glossary_and_context.py)
---
## 🛠️ Tasks list
### Phase 1: Git Branch Setup
- [x] Stash current working directory changes to keep them safe.
- [x] Checkout base branch `feat/gemini-adapter-from-whisper-medium`.
- [x] Create and checkout the new feature branch:
`feat/glossary-idiom-handling-from-feat-gemini-adapter-from-whisper-medium`
- [x] Unstash/Apply the working directory changes onto the new branch.
### Phase 2: Whisper-Level Decoder Biasing (Option A)
- [x] Define the target words for Whisper phonetic bias:
- `"Naukri"`, `"NotebookLM"`, `"Razorpay"`, `"LinkedIn"`, `"Bay Area"`, `"San Francisco"`, `"notebooklm.google.com"`
- [x] Update `app/services/transcribe.py` to accept and pass `initial_prompt` into `model.transcribe()` for both GPU and CPU execution paths.
- [x] Verify that Whisper transcribe tests in [test_glossary_and_context.py](file:///e:/Work/AI%20translator%20antigravity/app/tests/test_glossary_and_context.py) pass cleanly.
### Phase 3: Translation-Level Glossary Prompting (Option C)
- [x] Define a structured glossary schema (source word/phrase $\rightarrow$ translation/rule).
- [x] Update `GeminiAdapter.translate_batch()` in `app/services/translators/gemini_adapter.py` to accept the optional `glossary` parameter.
- [x] Format and inject glossary directives into the Native `system_instruction` configuration when instantiating `GenerativeModel`.
- Brand names and URLs should be protected: *"Do NOT translate or transliterate."*
- Slang and idioms should map to culturally correct expressions: (e.g. *"nerve-wracking"* $\rightarrow$ *"ആവേശകരമായ"* in Malayalam).
- [x] Verify that the glossary injection tests in [test_glossary_and_context.py](file:///e:/Work/AI%20translator%20antigravity/app/tests/test_glossary_and_context.py) pass cleanly.
### Phase 4: Full-Context Subtitle Translation (Prevention of Batch Edge Context Loss)
- [x] Modify `translate_srt()` in `app/services/srt_generator.py` to accept and forward the `glossary` dict.
- [x] Refactor `_translate_batched()` in `app/services/srt_generator.py` to send **ALL** subtitle lines in a single `translate_batch()` call rather than splitting into 30-line batches.
- Since a typical 10-minute video has only ~300 subtitle lines (~6k tokens), this easily fits inside Gemini 2.5 Flash's 1M+ token limit. This guarantees the LLM sees the complete conversation context from start to finish.
- [x] Verify that the full-context batch tests in [test_glossary_and_context.py](file:///e:/Work/AI%20translator%20antigravity/app/tests/test_glossary_and_context.py) pass cleanly.
### Phase 5: Verification & End-to-End Validation
- [x] Run the complete test suite: `python -m pytest app/tests/ -v`.
- [x] Run an end-to-end subtitle generation test on `ai-job-hunt.mp4` to verify the generated Malayalam SRT preserves Naukri, NotebookLM, San Francisco, and handles idioms perfectly.