Justin Black
Add no-Claude-branding rule to CLAUDE.md
5eed86e
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Commit Guidelines
- Never include Claude branding (e.g. `Co-Authored-By: Claude`) in commits or anywhere in the codebase.
## Project Overview
IndicTrans2 Translation Tool β€” a Gradio web application deployed on HuggingFace Spaces that translates between English and 22+ Indian languages using the IndicTrans2 1B model (ai4bharat). Supports text translation and document translation (PDF/DOCX) with formatting preservation.
## Architecture
### Single-File Application
Everything lives in `app.py` (~1022 lines). There is no build system, no test framework, and no subdirectories.
### Core Components in app.py
- **Authentication** (lines 24-50): Session-based auth using `USERNAME`/`PASSWORD` environment variables. Sessions stored in an in-memory set.
- **IndicTrans2Translator class** (lines 52-437): Main translator. Loads two model pairs:
- `en_indic_model` (ai4bharat/indictrans2-en-indic-1B) β€” English to Indian languages
- `indic_en_model` (ai4bharat/indictrans2-indic-en-1B) β€” Indian languages to English
- Indic-to-Indic translation chains through English as an intermediate step
- **Language mappings** (lines 439-491): `LANGUAGES` dict (display name β†’ 2-letter code) and `LANGUAGE_SCRIPT_MAPPING` (2-letter code β†’ IndicTrans2 BCP-47 format like `hin_Deva`)
- **Document processing** (lines 493-680): PDF extraction via PyPDF2, DOCX extraction/creation via python-docx with formatting metadata preservation
- **Translation handlers** (lines 682-819): `translate_text_input()` and `translate_document()` β€” Gradio-facing functions decorated with `@spaces.GPU`
- **Gradio UI** (lines 825-1017): Login panel, text translation tab, document translation tab, examples, logout
### Translation Pipeline
1. Text split into sentences preserving paragraph structure (`split_into_sentences`)
2. Sentences batched (adaptive batch size: 1-4 based on sentence length)
3. IndicProcessor preprocesses batches β†’ tokenizer encodes β†’ model generates β†’ tokenizer decodes β†’ IndicProcessor postprocesses
4. Paragraph structure reconstructed (`reconstruct_formatting`)
### Key Dependencies
- `IndicTransToolkit` (from GitHub: VarunGumma/IndicTransToolkit) β€” preprocessing/postprocessing for IndicTrans2
- `transformers` β€” model loading and inference
- `gradio` β€” web UI
- `spaces` β€” HuggingFace Spaces GPU allocation (`@spaces.GPU` decorator)
- `torch` β€” GPU inference with bfloat16/float16 optimization
## Running Locally
```bash
pip install -r requirements.txt
python app.py
```
Requires CUDA GPU for reasonable performance (falls back to CPU with float32). Set environment variables `USERNAME` and `PASSWORD` for authentication credentials.
## Deployment
Deployed as a HuggingFace Space. Configuration in `config.yaml` specifies Gradio SDK, T4-small GPU, and small storage. Git LFS configured in `gitattributes` for model weight files.
## Key Design Decisions
- **No test suite**: Manual testing only. Changes should be verified by running the app locally.
- **GPU optimization**: Uses `device_map="auto"` with accelerate, bfloat16 when supported, `torch.compile` when not using device_map, and `torch.inference_mode()` for generation.
- **Batch translation fallback**: If a batch fails, retries each sentence individually (lines 387-428).
- **DOCX formatting preservation**: Extracts paragraph-level and run-level formatting from source documents and applies "dominant formatting" (majority vote on bold/italic/underline, most common font) to translated output.