Subtrans / tasks.md
arjun-ms's picture
Initial commit: Subtrans Subtitle Pipeline
57bbccb

Tasks: Glossary Bias & Idiom Handling Implementation

This document tracks the tasks required to implement Option C (Context-Aware Glossary Prompting) and Option A (Whisper Decoder Bias list) to solve subtitle errors in ai-job-hunt.mp4.

📋 Status Overview

  • Base Branch: feat/gemini-adapter-from-whisper-medium
  • Target Branch: feat/glossary-idiom-handling-from-feat-gemini-adapter-from-whisper-medium
  • TDD Test Suite: Already drafted at test_glossary_and_context.py

🛠️ Tasks list

Phase 1: Git Branch Setup

  • Stash current working directory changes to keep them safe.
  • Checkout base branch feat/gemini-adapter-from-whisper-medium.
  • Create and checkout the new feature branch: feat/glossary-idiom-handling-from-feat-gemini-adapter-from-whisper-medium
  • Unstash/Apply the working directory changes onto the new branch.

Phase 2: Whisper-Level Decoder Biasing (Option A)

  • Define the target words for Whisper phonetic bias:
    • "Naukri", "NotebookLM", "Razorpay", "LinkedIn", "Bay Area", "San Francisco", "notebooklm.google.com"
  • Update app/services/transcribe.py to accept and pass initial_prompt into model.transcribe() for both GPU and CPU execution paths.
  • Verify that Whisper transcribe tests in test_glossary_and_context.py pass cleanly.

Phase 3: Translation-Level Glossary Prompting (Option C)

  • Define a structured glossary schema (source word/phrase $\rightarrow$ translation/rule).
  • Update GeminiAdapter.translate_batch() in app/services/translators/gemini_adapter.py to accept the optional glossary parameter.
  • Format and inject glossary directives into the Native system_instruction configuration when instantiating GenerativeModel.
    • Brand names and URLs should be protected: "Do NOT translate or transliterate."
    • Slang and idioms should map to culturally correct expressions: (e.g. "nerve-wracking" $\rightarrow$ "ആവേശകരമായ" in Malayalam).
  • Verify that the glossary injection tests in test_glossary_and_context.py pass cleanly.

Phase 4: Full-Context Subtitle Translation (Prevention of Batch Edge Context Loss)

  • Modify translate_srt() in app/services/srt_generator.py to accept and forward the glossary dict.
  • Refactor _translate_batched() in app/services/srt_generator.py to send ALL subtitle lines in a single translate_batch() call rather than splitting into 30-line batches.
    • Since a typical 10-minute video has only 300 subtitle lines (6k tokens), this easily fits inside Gemini 2.5 Flash's 1M+ token limit. This guarantees the LLM sees the complete conversation context from start to finish.
  • Verify that the full-context batch tests in test_glossary_and_context.py pass cleanly.

Phase 5: Verification & End-to-End Validation

  • Run the complete test suite: python -m pytest app/tests/ -v.
  • Run an end-to-end subtitle generation test on ai-job-hunt.mp4 to verify the generated Malayalam SRT preserves Naukri, NotebookLM, San Francisco, and handles idioms perfectly.