Spaces:

jb100
/

IndicTrans2Translator

Sleeping

App Files Files Community

IndicTrans2Translator / CLAUDE.md

Justin Black

Add no-Claude-branding rule to CLAUDE.md

5eed86e about 1 month ago

preview code

raw

history blame contribute delete

3.62 kB

	# CLAUDE.md

	This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

	## Commit Guidelines

	- Never include Claude branding (e.g. `Co-Authored-By: Claude`) in commits or anywhere in the codebase.

	## Project Overview

	IndicTrans2 Translation Tool — a Gradio web application deployed on HuggingFace Spaces that translates between English and 22+ Indian languages using the IndicTrans2 1B model (ai4bharat). Supports text translation and document translation (PDF/DOCX) with formatting preservation.

	## Architecture

	### Single-File Application
	Everything lives in `app.py` (~1022 lines). There is no build system, no test framework, and no subdirectories.

	### Core Components in app.py

	- Authentication (lines 24-50): Session-based auth using `USERNAME`/`PASSWORD` environment variables. Sessions stored in an in-memory set.
	- IndicTrans2Translator class (lines 52-437): Main translator. Loads two model pairs:
	- `en_indic_model` (ai4bharat/indictrans2-en-indic-1B) — English to Indian languages
	- `indic_en_model` (ai4bharat/indictrans2-indic-en-1B) — Indian languages to English
	- Indic-to-Indic translation chains through English as an intermediate step
	- Language mappings (lines 439-491): `LANGUAGES` dict (display name → 2-letter code) and `LANGUAGE_SCRIPT_MAPPING` (2-letter code → IndicTrans2 BCP-47 format like `hin_Deva`)
	- Document processing (lines 493-680): PDF extraction via PyPDF2, DOCX extraction/creation via python-docx with formatting metadata preservation
	- Translation handlers (lines 682-819): `translate_text_input()` and `translate_document()` — Gradio-facing functions decorated with `@spaces.GPU`
	- Gradio UI (lines 825-1017): Login panel, text translation tab, document translation tab, examples, logout

	### Translation Pipeline
	1. Text split into sentences preserving paragraph structure (`split_into_sentences`)
	2. Sentences batched (adaptive batch size: 1-4 based on sentence length)
	3. IndicProcessor preprocesses batches → tokenizer encodes → model generates → tokenizer decodes → IndicProcessor postprocesses
	4. Paragraph structure reconstructed (`reconstruct_formatting`)

	### Key Dependencies
	- `IndicTransToolkit` (from GitHub: VarunGumma/IndicTransToolkit) — preprocessing/postprocessing for IndicTrans2
	- `transformers` — model loading and inference
	- `gradio` — web UI
	- `spaces` — HuggingFace Spaces GPU allocation (`@spaces.GPU` decorator)
	- `torch` — GPU inference with bfloat16/float16 optimization

	## Running Locally

	```bash
	pip install -r requirements.txt
	python app.py
	```

	Requires CUDA GPU for reasonable performance (falls back to CPU with float32). Set environment variables `USERNAME` and `PASSWORD` for authentication credentials.

	## Deployment

	Deployed as a HuggingFace Space. Configuration in `config.yaml` specifies Gradio SDK, T4-small GPU, and small storage. Git LFS configured in `gitattributes` for model weight files.

	## Key Design Decisions

	- No test suite: Manual testing only. Changes should be verified by running the app locally.
	- GPU optimization: Uses `device_map="auto"` with accelerate, bfloat16 when supported, `torch.compile` when not using device_map, and `torch.inference_mode()` for generation.
	- Batch translation fallback: If a batch fails, retries each sentence individually (lines 387-428).
	- DOCX formatting preservation: Extracts paragraph-level and run-level formatting from source documents and applies "dominant formatting" (majority vote on bold/italic/underline, most common font) to translated output.