Spaces:

jb100
/

IndicTrans2Translator

Sleeping

App Files Files Community

IndicTrans2Translator / CLAUDE.md

Justin Black

Add no-Claude-branding rule to CLAUDE.md

5eed86e about 1 month ago

preview code

raw

history blame contribute delete

3.62 kB

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Commit Guidelines

Never include Claude branding (e.g. Co-Authored-By: Claude) in commits or anywhere in the codebase.

Project Overview

IndicTrans2 Translation Tool — a Gradio web application deployed on HuggingFace Spaces that translates between English and 22+ Indian languages using the IndicTrans2 1B model (ai4bharat). Supports text translation and document translation (PDF/DOCX) with formatting preservation.

Architecture

Single-File Application

Everything lives in app.py (~1022 lines). There is no build system, no test framework, and no subdirectories.

Core Components in app.py

Authentication (lines 24-50): Session-based auth using USERNAME/PASSWORD environment variables. Sessions stored in an in-memory set.
IndicTrans2Translator class (lines 52-437): Main translator. Loads two model pairs:
- en_indic_model (ai4bharat/indictrans2-en-indic-1B) — English to Indian languages
- indic_en_model (ai4bharat/indictrans2-indic-en-1B) — Indian languages to English
- Indic-to-Indic translation chains through English as an intermediate step
Language mappings (lines 439-491): LANGUAGES dict (display name → 2-letter code) and LANGUAGE_SCRIPT_MAPPING (2-letter code → IndicTrans2 BCP-47 format like hin_Deva)
Document processing (lines 493-680): PDF extraction via PyPDF2, DOCX extraction/creation via python-docx with formatting metadata preservation
Translation handlers (lines 682-819): translate_text_input() and translate_document() — Gradio-facing functions decorated with @spaces.GPU
Gradio UI (lines 825-1017): Login panel, text translation tab, document translation tab, examples, logout

Translation Pipeline

Text split into sentences preserving paragraph structure (split_into_sentences)
Sentences batched (adaptive batch size: 1-4 based on sentence length)
IndicProcessor preprocesses batches → tokenizer encodes → model generates → tokenizer decodes → IndicProcessor postprocesses
Paragraph structure reconstructed (reconstruct_formatting)

Key Dependencies

IndicTransToolkit (from GitHub: VarunGumma/IndicTransToolkit) — preprocessing/postprocessing for IndicTrans2
transformers — model loading and inference
gradio — web UI
spaces — HuggingFace Spaces GPU allocation (@spaces.GPU decorator)
torch — GPU inference with bfloat16/float16 optimization

Running Locally

pip install -r requirements.txt
python app.py

Requires CUDA GPU for reasonable performance (falls back to CPU with float32). Set environment variables USERNAME and PASSWORD for authentication credentials.

Deployment

Deployed as a HuggingFace Space. Configuration in config.yaml specifies Gradio SDK, T4-small GPU, and small storage. Git LFS configured in gitattributes for model weight files.

Key Design Decisions

No test suite: Manual testing only. Changes should be verified by running the app locally.
GPU optimization: Uses device_map="auto" with accelerate, bfloat16 when supported, torch.compile when not using device_map, and torch.inference_mode() for generation.
Batch translation fallback: If a batch fails, retries each sentence individually (lines 387-428).
DOCX formatting preservation: Extracts paragraph-level and run-level formatting from source documents and applies "dominant formatting" (majority vote on bold/italic/underline, most common font) to translated output.