Spaces:
Sleeping
Sleeping
| # CLAUDE.md | |
| This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | |
| ## Commit Guidelines | |
| - Never include Claude branding (e.g. `Co-Authored-By: Claude`) in commits or anywhere in the codebase. | |
| ## Project Overview | |
| IndicTrans2 Translation Tool β a Gradio web application deployed on HuggingFace Spaces that translates between English and 22+ Indian languages using the IndicTrans2 1B model (ai4bharat). Supports text translation and document translation (PDF/DOCX) with formatting preservation. | |
| ## Architecture | |
| ### Single-File Application | |
| Everything lives in `app.py` (~1022 lines). There is no build system, no test framework, and no subdirectories. | |
| ### Core Components in app.py | |
| - **Authentication** (lines 24-50): Session-based auth using `USERNAME`/`PASSWORD` environment variables. Sessions stored in an in-memory set. | |
| - **IndicTrans2Translator class** (lines 52-437): Main translator. Loads two model pairs: | |
| - `en_indic_model` (ai4bharat/indictrans2-en-indic-1B) β English to Indian languages | |
| - `indic_en_model` (ai4bharat/indictrans2-indic-en-1B) β Indian languages to English | |
| - Indic-to-Indic translation chains through English as an intermediate step | |
| - **Language mappings** (lines 439-491): `LANGUAGES` dict (display name β 2-letter code) and `LANGUAGE_SCRIPT_MAPPING` (2-letter code β IndicTrans2 BCP-47 format like `hin_Deva`) | |
| - **Document processing** (lines 493-680): PDF extraction via PyPDF2, DOCX extraction/creation via python-docx with formatting metadata preservation | |
| - **Translation handlers** (lines 682-819): `translate_text_input()` and `translate_document()` β Gradio-facing functions decorated with `@spaces.GPU` | |
| - **Gradio UI** (lines 825-1017): Login panel, text translation tab, document translation tab, examples, logout | |
| ### Translation Pipeline | |
| 1. Text split into sentences preserving paragraph structure (`split_into_sentences`) | |
| 2. Sentences batched (adaptive batch size: 1-4 based on sentence length) | |
| 3. IndicProcessor preprocesses batches β tokenizer encodes β model generates β tokenizer decodes β IndicProcessor postprocesses | |
| 4. Paragraph structure reconstructed (`reconstruct_formatting`) | |
| ### Key Dependencies | |
| - `IndicTransToolkit` (from GitHub: VarunGumma/IndicTransToolkit) β preprocessing/postprocessing for IndicTrans2 | |
| - `transformers` β model loading and inference | |
| - `gradio` β web UI | |
| - `spaces` β HuggingFace Spaces GPU allocation (`@spaces.GPU` decorator) | |
| - `torch` β GPU inference with bfloat16/float16 optimization | |
| ## Running Locally | |
| ```bash | |
| pip install -r requirements.txt | |
| python app.py | |
| ``` | |
| Requires CUDA GPU for reasonable performance (falls back to CPU with float32). Set environment variables `USERNAME` and `PASSWORD` for authentication credentials. | |
| ## Deployment | |
| Deployed as a HuggingFace Space. Configuration in `config.yaml` specifies Gradio SDK, T4-small GPU, and small storage. Git LFS configured in `gitattributes` for model weight files. | |
| ## Key Design Decisions | |
| - **No test suite**: Manual testing only. Changes should be verified by running the app locally. | |
| - **GPU optimization**: Uses `device_map="auto"` with accelerate, bfloat16 when supported, `torch.compile` when not using device_map, and `torch.inference_mode()` for generation. | |
| - **Batch translation fallback**: If a batch fails, retries each sentence individually (lines 387-428). | |
| - **DOCX formatting preservation**: Extracts paragraph-level and run-level formatting from source documents and applies "dominant formatting" (majority vote on bold/italic/underline, most common font) to translated output. | |