# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Commit Guidelines - Never include Claude branding (e.g. `Co-Authored-By: Claude`) in commits or anywhere in the codebase. ## Project Overview IndicTrans2 Translation Tool — a Gradio web application deployed on HuggingFace Spaces that translates between English and 22+ Indian languages using the IndicTrans2 1B model (ai4bharat). Supports text translation and document translation (PDF/DOCX) with formatting preservation. ## Architecture ### Single-File Application Everything lives in `app.py` (~1022 lines). There is no build system, no test framework, and no subdirectories. ### Core Components in app.py - **Authentication** (lines 24-50): Session-based auth using `USERNAME`/`PASSWORD` environment variables. Sessions stored in an in-memory set. - **IndicTrans2Translator class** (lines 52-437): Main translator. Loads two model pairs: - `en_indic_model` (ai4bharat/indictrans2-en-indic-1B) — English to Indian languages - `indic_en_model` (ai4bharat/indictrans2-indic-en-1B) — Indian languages to English - Indic-to-Indic translation chains through English as an intermediate step - **Language mappings** (lines 439-491): `LANGUAGES` dict (display name → 2-letter code) and `LANGUAGE_SCRIPT_MAPPING` (2-letter code → IndicTrans2 BCP-47 format like `hin_Deva`) - **Document processing** (lines 493-680): PDF extraction via PyPDF2, DOCX extraction/creation via python-docx with formatting metadata preservation - **Translation handlers** (lines 682-819): `translate_text_input()` and `translate_document()` — Gradio-facing functions decorated with `@spaces.GPU` - **Gradio UI** (lines 825-1017): Login panel, text translation tab, document translation tab, examples, logout ### Translation Pipeline 1. Text split into sentences preserving paragraph structure (`split_into_sentences`) 2. Sentences batched (adaptive batch size: 1-4 based on sentence length) 3. IndicProcessor preprocesses batches → tokenizer encodes → model generates → tokenizer decodes → IndicProcessor postprocesses 4. Paragraph structure reconstructed (`reconstruct_formatting`) ### Key Dependencies - `IndicTransToolkit` (from GitHub: VarunGumma/IndicTransToolkit) — preprocessing/postprocessing for IndicTrans2 - `transformers` — model loading and inference - `gradio` — web UI - `spaces` — HuggingFace Spaces GPU allocation (`@spaces.GPU` decorator) - `torch` — GPU inference with bfloat16/float16 optimization ## Running Locally ```bash pip install -r requirements.txt python app.py ``` Requires CUDA GPU for reasonable performance (falls back to CPU with float32). Set environment variables `USERNAME` and `PASSWORD` for authentication credentials. ## Deployment Deployed as a HuggingFace Space. Configuration in `config.yaml` specifies Gradio SDK, T4-small GPU, and small storage. Git LFS configured in `gitattributes` for model weight files. ## Key Design Decisions - **No test suite**: Manual testing only. Changes should be verified by running the app locally. - **GPU optimization**: Uses `device_map="auto"` with accelerate, bfloat16 when supported, `torch.compile` when not using device_map, and `torch.inference_mode()` for generation. - **Batch translation fallback**: If a batch fails, retries each sentence individually (lines 387-428). - **DOCX formatting preservation**: Extracts paragraph-level and run-level formatting from source documents and applies "dominant formatting" (majority vote on bold/italic/underline, most common font) to translated output.