Justin Black
Add no-Claude-branding rule to CLAUDE.md
5eed86e

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Commit Guidelines

  • Never include Claude branding (e.g. Co-Authored-By: Claude) in commits or anywhere in the codebase.

Project Overview

IndicTrans2 Translation Tool β€” a Gradio web application deployed on HuggingFace Spaces that translates between English and 22+ Indian languages using the IndicTrans2 1B model (ai4bharat). Supports text translation and document translation (PDF/DOCX) with formatting preservation.

Architecture

Single-File Application

Everything lives in app.py (~1022 lines). There is no build system, no test framework, and no subdirectories.

Core Components in app.py

  • Authentication (lines 24-50): Session-based auth using USERNAME/PASSWORD environment variables. Sessions stored in an in-memory set.
  • IndicTrans2Translator class (lines 52-437): Main translator. Loads two model pairs:
    • en_indic_model (ai4bharat/indictrans2-en-indic-1B) β€” English to Indian languages
    • indic_en_model (ai4bharat/indictrans2-indic-en-1B) β€” Indian languages to English
    • Indic-to-Indic translation chains through English as an intermediate step
  • Language mappings (lines 439-491): LANGUAGES dict (display name β†’ 2-letter code) and LANGUAGE_SCRIPT_MAPPING (2-letter code β†’ IndicTrans2 BCP-47 format like hin_Deva)
  • Document processing (lines 493-680): PDF extraction via PyPDF2, DOCX extraction/creation via python-docx with formatting metadata preservation
  • Translation handlers (lines 682-819): translate_text_input() and translate_document() β€” Gradio-facing functions decorated with @spaces.GPU
  • Gradio UI (lines 825-1017): Login panel, text translation tab, document translation tab, examples, logout

Translation Pipeline

  1. Text split into sentences preserving paragraph structure (split_into_sentences)
  2. Sentences batched (adaptive batch size: 1-4 based on sentence length)
  3. IndicProcessor preprocesses batches β†’ tokenizer encodes β†’ model generates β†’ tokenizer decodes β†’ IndicProcessor postprocesses
  4. Paragraph structure reconstructed (reconstruct_formatting)

Key Dependencies

  • IndicTransToolkit (from GitHub: VarunGumma/IndicTransToolkit) β€” preprocessing/postprocessing for IndicTrans2
  • transformers β€” model loading and inference
  • gradio β€” web UI
  • spaces β€” HuggingFace Spaces GPU allocation (@spaces.GPU decorator)
  • torch β€” GPU inference with bfloat16/float16 optimization

Running Locally

pip install -r requirements.txt
python app.py

Requires CUDA GPU for reasonable performance (falls back to CPU with float32). Set environment variables USERNAME and PASSWORD for authentication credentials.

Deployment

Deployed as a HuggingFace Space. Configuration in config.yaml specifies Gradio SDK, T4-small GPU, and small storage. Git LFS configured in gitattributes for model weight files.

Key Design Decisions

  • No test suite: Manual testing only. Changes should be verified by running the app locally.
  • GPU optimization: Uses device_map="auto" with accelerate, bfloat16 when supported, torch.compile when not using device_map, and torch.inference_mode() for generation.
  • Batch translation fallback: If a batch fails, retries each sentence individually (lines 387-428).
  • DOCX formatting preservation: Extracts paragraph-level and run-level formatting from source documents and applies "dominant formatting" (majority vote on bold/italic/underline, most common font) to translated output.