Tasks: Glossary Bias & Idiom Handling Implementation
This document tracks the tasks required to implement Option C (Context-Aware Glossary Prompting) and Option A (Whisper Decoder Bias list) to solve subtitle errors in ai-job-hunt.mp4.
📋 Status Overview
- Base Branch:
feat/gemini-adapter-from-whisper-medium - Target Branch:
feat/glossary-idiom-handling-from-feat-gemini-adapter-from-whisper-medium - TDD Test Suite: Already drafted at test_glossary_and_context.py
🛠️ Tasks list
Phase 1: Git Branch Setup
- Stash current working directory changes to keep them safe.
- Checkout base branch
feat/gemini-adapter-from-whisper-medium. - Create and checkout the new feature branch:
feat/glossary-idiom-handling-from-feat-gemini-adapter-from-whisper-medium - Unstash/Apply the working directory changes onto the new branch.
Phase 2: Whisper-Level Decoder Biasing (Option A)
- Define the target words for Whisper phonetic bias:
"Naukri","NotebookLM","Razorpay","LinkedIn","Bay Area","San Francisco","notebooklm.google.com"
- Update
app/services/transcribe.pyto accept and passinitial_promptintomodel.transcribe()for both GPU and CPU execution paths. - Verify that Whisper transcribe tests in test_glossary_and_context.py pass cleanly.
Phase 3: Translation-Level Glossary Prompting (Option C)
- Define a structured glossary schema (source word/phrase $\rightarrow$ translation/rule).
- Update
GeminiAdapter.translate_batch()inapp/services/translators/gemini_adapter.pyto accept the optionalglossaryparameter. - Format and inject glossary directives into the Native
system_instructionconfiguration when instantiatingGenerativeModel.- Brand names and URLs should be protected: "Do NOT translate or transliterate."
- Slang and idioms should map to culturally correct expressions: (e.g. "nerve-wracking" $\rightarrow$ "ആവേശകരമായ" in Malayalam).
- Verify that the glossary injection tests in test_glossary_and_context.py pass cleanly.
Phase 4: Full-Context Subtitle Translation (Prevention of Batch Edge Context Loss)
- Modify
translate_srt()inapp/services/srt_generator.pyto accept and forward theglossarydict. - Refactor
_translate_batched()inapp/services/srt_generator.pyto send ALL subtitle lines in a singletranslate_batch()call rather than splitting into 30-line batches.- Since a typical 10-minute video has only
300 subtitle lines (6k tokens), this easily fits inside Gemini 2.5 Flash's 1M+ token limit. This guarantees the LLM sees the complete conversation context from start to finish.
- Since a typical 10-minute video has only
- Verify that the full-context batch tests in test_glossary_and_context.py pass cleanly.
Phase 5: Verification & End-to-End Validation
- Run the complete test suite:
python -m pytest app/tests/ -v. - Run an end-to-end subtitle generation test on
ai-job-hunt.mp4to verify the generated Malayalam SRT preserves Naukri, NotebookLM, San Francisco, and handles idioms perfectly.