Spaces:

arjun-ms
/

Subtrans

Sleeping

App Files Files Community

Subtrans / tasks.md

arjun-ms

Initial commit: Subtrans Subtitle Pipeline

57bbccb 15 days ago

preview code

raw

history blame contribute delete

3.46 kB

	# Tasks: Glossary Bias & Idiom Handling Implementation

	This document tracks the tasks required to implement Option C (Context-Aware Glossary Prompting) and Option A (Whisper Decoder Bias list) to solve subtitle errors in `ai-job-hunt.mp4`.

	## 📋 Status Overview
	- Base Branch: `feat/gemini-adapter-from-whisper-medium`
	- Target Branch: `feat/glossary-idiom-handling-from-feat-gemini-adapter-from-whisper-medium`
	- TDD Test Suite: Already drafted at [test_glossary_and_context.py](file:///e:/Work/AI%20translator%20antigravity/app/tests/test_glossary_and_context.py)

	---

	## 🛠️ Tasks list

	### Phase 1: Git Branch Setup
	- [x] Stash current working directory changes to keep them safe.
	- [x] Checkout base branch `feat/gemini-adapter-from-whisper-medium`.
	- [x] Create and checkout the new feature branch:
	`feat/glossary-idiom-handling-from-feat-gemini-adapter-from-whisper-medium`
	- [x] Unstash/Apply the working directory changes onto the new branch.

	### Phase 2: Whisper-Level Decoder Biasing (Option A)
	- [x] Define the target words for Whisper phonetic bias:
	- `"Naukri"`, `"NotebookLM"`, `"Razorpay"`, `"LinkedIn"`, `"Bay Area"`, `"San Francisco"`, `"notebooklm.google.com"`
	- [x] Update `app/services/transcribe.py` to accept and pass `initial_prompt` into `model.transcribe()` for both GPU and CPU execution paths.
	- [x] Verify that Whisper transcribe tests in [test_glossary_and_context.py](file:///e:/Work/AI%20translator%20antigravity/app/tests/test_glossary_and_context.py) pass cleanly.

	### Phase 3: Translation-Level Glossary Prompting (Option C)
	- [x] Define a structured glossary schema (source word/phrase $\rightarrow$ translation/rule).
	- [x] Update `GeminiAdapter.translate_batch()` in `app/services/translators/gemini_adapter.py` to accept the optional `glossary` parameter.
	- [x] Format and inject glossary directives into the Native `system_instruction` configuration when instantiating `GenerativeModel`.
	- Brand names and URLs should be protected: "Do NOT translate or transliterate."
	- Slang and idioms should map to culturally correct expressions: (e.g. "nerve-wracking" $\rightarrow$ "ആവേശകരമായ" in Malayalam).
	- [x] Verify that the glossary injection tests in [test_glossary_and_context.py](file:///e:/Work/AI%20translator%20antigravity/app/tests/test_glossary_and_context.py) pass cleanly.

	### Phase 4: Full-Context Subtitle Translation (Prevention of Batch Edge Context Loss)
	- [x] Modify `translate_srt()` in `app/services/srt_generator.py` to accept and forward the `glossary` dict.
	- [x] Refactor `_translate_batched()` in `app/services/srt_generator.py` to send ALL subtitle lines in a single `translate_batch()` call rather than splitting into 30-line batches.
	- Since a typical 10-minute video has only ~300 subtitle lines (~6k tokens), this easily fits inside Gemini 2.5 Flash's 1M+ token limit. This guarantees the LLM sees the complete conversation context from start to finish.
	- [x] Verify that the full-context batch tests in [test_glossary_and_context.py](file:///e:/Work/AI%20translator%20antigravity/app/tests/test_glossary_and_context.py) pass cleanly.

	### Phase 5: Verification & End-to-End Validation
	- [x] Run the complete test suite: `python -m pytest app/tests/ -v`.
	- [x] Run an end-to-end subtitle generation test on `ai-job-hunt.mp4` to verify the generated Malayalam SRT preserves Naukri, NotebookLM, San Francisco, and handles idioms perfectly.