# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Commit Guidelines

- Never include Claude branding (e.g. `Co-Authored-By: Claude`) in commits or anywhere in the codebase.

## Project Overview

IndicTrans2 Translation Tool — a Gradio web application deployed on HuggingFace Spaces that translates between English and 22+ Indian languages using the IndicTrans2 1B model (ai4bharat). Supports text translation and document translation (PDF/DOCX) with formatting preservation.

## Architecture

### Single-File Application
Everything lives in `app.py` (~1022 lines). There is no build system, no test framework, and no subdirectories.

### Core Components in app.py

- **Authentication** (lines 24-50): Session-based auth using `USERNAME`/`PASSWORD` environment variables. Sessions stored in an in-memory set.
- **IndicTrans2Translator class** (lines 52-437): Main translator. Loads two model pairs:
  - `en_indic_model` (ai4bharat/indictrans2-en-indic-1B) — English to Indian languages
  - `indic_en_model` (ai4bharat/indictrans2-indic-en-1B) — Indian languages to English
  - Indic-to-Indic translation chains through English as an intermediate step
- **Language mappings** (lines 439-491): `LANGUAGES` dict (display name → 2-letter code) and `LANGUAGE_SCRIPT_MAPPING` (2-letter code → IndicTrans2 BCP-47 format like `hin_Deva`)
- **Document processing** (lines 493-680): PDF extraction via PyPDF2, DOCX extraction/creation via python-docx with formatting metadata preservation
- **Translation handlers** (lines 682-819): `translate_text_input()` and `translate_document()` — Gradio-facing functions decorated with `@spaces.GPU`
- **Gradio UI** (lines 825-1017): Login panel, text translation tab, document translation tab, examples, logout

### Translation Pipeline
1. Text split into sentences preserving paragraph structure (`split_into_sentences`)
2. Sentences batched (adaptive batch size: 1-4 based on sentence length)
3. IndicProcessor preprocesses batches → tokenizer encodes → model generates → tokenizer decodes → IndicProcessor postprocesses
4. Paragraph structure reconstructed (`reconstruct_formatting`)

### Key Dependencies
- `IndicTransToolkit` (from GitHub: VarunGumma/IndicTransToolkit) — preprocessing/postprocessing for IndicTrans2
- `transformers` — model loading and inference
- `gradio` — web UI
- `spaces` — HuggingFace Spaces GPU allocation (`@spaces.GPU` decorator)
- `torch` — GPU inference with bfloat16/float16 optimization

## Running Locally

```bash
pip install -r requirements.txt
python app.py
```

Requires CUDA GPU for reasonable performance (falls back to CPU with float32). Set environment variables `USERNAME` and `PASSWORD` for authentication credentials.

## Deployment

Deployed as a HuggingFace Space. Configuration in `config.yaml` specifies Gradio SDK, T4-small GPU, and small storage. Git LFS configured in `gitattributes` for model weight files.

## Key Design Decisions

- **No test suite**: Manual testing only. Changes should be verified by running the app locally.
- **GPU optimization**: Uses `device_map="auto"` with accelerate, bfloat16 when supported, `torch.compile` when not using device_map, and `torch.inference_mode()` for generation.
- **Batch translation fallback**: If a batch fails, retries each sentence individually (lines 387-428).
- **DOCX formatting preservation**: Extracts paragraph-level and run-level formatting from source documents and applies "dominant formatting" (majority vote on bold/italic/underline, most common font) to translated output.