Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.9.0
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Commit Guidelines
- Never include Claude branding (e.g.
Co-Authored-By: Claude) in commits or anywhere in the codebase.
Project Overview
IndicTrans2 Translation Tool β a Gradio web application deployed on HuggingFace Spaces that translates between English and 22+ Indian languages using the IndicTrans2 1B model (ai4bharat). Supports text translation and document translation (PDF/DOCX) with formatting preservation.
Architecture
Single-File Application
Everything lives in app.py (~1022 lines). There is no build system, no test framework, and no subdirectories.
Core Components in app.py
- Authentication (lines 24-50): Session-based auth using
USERNAME/PASSWORDenvironment variables. Sessions stored in an in-memory set. - IndicTrans2Translator class (lines 52-437): Main translator. Loads two model pairs:
en_indic_model(ai4bharat/indictrans2-en-indic-1B) β English to Indian languagesindic_en_model(ai4bharat/indictrans2-indic-en-1B) β Indian languages to English- Indic-to-Indic translation chains through English as an intermediate step
- Language mappings (lines 439-491):
LANGUAGESdict (display name β 2-letter code) andLANGUAGE_SCRIPT_MAPPING(2-letter code β IndicTrans2 BCP-47 format likehin_Deva) - Document processing (lines 493-680): PDF extraction via PyPDF2, DOCX extraction/creation via python-docx with formatting metadata preservation
- Translation handlers (lines 682-819):
translate_text_input()andtranslate_document()β Gradio-facing functions decorated with@spaces.GPU - Gradio UI (lines 825-1017): Login panel, text translation tab, document translation tab, examples, logout
Translation Pipeline
- Text split into sentences preserving paragraph structure (
split_into_sentences) - Sentences batched (adaptive batch size: 1-4 based on sentence length)
- IndicProcessor preprocesses batches β tokenizer encodes β model generates β tokenizer decodes β IndicProcessor postprocesses
- Paragraph structure reconstructed (
reconstruct_formatting)
Key Dependencies
IndicTransToolkit(from GitHub: VarunGumma/IndicTransToolkit) β preprocessing/postprocessing for IndicTrans2transformersβ model loading and inferencegradioβ web UIspacesβ HuggingFace Spaces GPU allocation (@spaces.GPUdecorator)torchβ GPU inference with bfloat16/float16 optimization
Running Locally
pip install -r requirements.txt
python app.py
Requires CUDA GPU for reasonable performance (falls back to CPU with float32). Set environment variables USERNAME and PASSWORD for authentication credentials.
Deployment
Deployed as a HuggingFace Space. Configuration in config.yaml specifies Gradio SDK, T4-small GPU, and small storage. Git LFS configured in gitattributes for model weight files.
Key Design Decisions
- No test suite: Manual testing only. Changes should be verified by running the app locally.
- GPU optimization: Uses
device_map="auto"with accelerate, bfloat16 when supported,torch.compilewhen not using device_map, andtorch.inference_mode()for generation. - Batch translation fallback: If a batch fails, retries each sentence individually (lines 387-428).
- DOCX formatting preservation: Extracts paragraph-level and run-level formatting from source documents and applies "dominant formatting" (majority vote on bold/italic/underline, most common font) to translated output.