Justin Black commited on
Commit
d1f8e44
·
0 Parent(s):

Initial commit: IndicTrans2 Translation Tool

Browse files

Gradio-based translation app supporting 22+ Indian languages using the IndicTrans2 1B model. Includes text translation, document translation (PDF/DOCX) with formatting preservation, and session-based authentication. Configured for HuggingFace Spaces deployment.

Files changed (7) hide show
  1. .gitattributes +35 -0
  2. .gitignore +12 -0
  3. CLAUDE.md +57 -0
  4. README.md +83 -0
  5. app.py +1023 -0
  6. config.yaml +12 -0
  7. requirements.txt +13 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.py[cod]
3
+ *.egg-info/
4
+ dist/
5
+ build/
6
+ .env
7
+ *.pth
8
+ *.bin
9
+ *.onnx
10
+ *.safetensors
11
+ .venv/
12
+ venv/
CLAUDE.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ IndicTrans2 Translation Tool — a Gradio web application deployed on HuggingFace Spaces that translates between English and 22+ Indian languages using the IndicTrans2 1B model (ai4bharat). Supports text translation and document translation (PDF/DOCX) with formatting preservation.
8
+
9
+ ## Architecture
10
+
11
+ ### Single-File Application
12
+ Everything lives in `app.py` (~1022 lines). There is no build system, no test framework, and no subdirectories.
13
+
14
+ ### Core Components in app.py
15
+
16
+ - **Authentication** (lines 24-50): Session-based auth using `USERNAME`/`PASSWORD` environment variables. Sessions stored in an in-memory set.
17
+ - **IndicTrans2Translator class** (lines 52-437): Main translator. Loads two model pairs:
18
+ - `en_indic_model` (ai4bharat/indictrans2-en-indic-1B) — English to Indian languages
19
+ - `indic_en_model` (ai4bharat/indictrans2-indic-en-1B) — Indian languages to English
20
+ - Indic-to-Indic translation chains through English as an intermediate step
21
+ - **Language mappings** (lines 439-491): `LANGUAGES` dict (display name → 2-letter code) and `LANGUAGE_SCRIPT_MAPPING` (2-letter code → IndicTrans2 BCP-47 format like `hin_Deva`)
22
+ - **Document processing** (lines 493-680): PDF extraction via PyPDF2, DOCX extraction/creation via python-docx with formatting metadata preservation
23
+ - **Translation handlers** (lines 682-819): `translate_text_input()` and `translate_document()` — Gradio-facing functions decorated with `@spaces.GPU`
24
+ - **Gradio UI** (lines 825-1017): Login panel, text translation tab, document translation tab, examples, logout
25
+
26
+ ### Translation Pipeline
27
+ 1. Text split into sentences preserving paragraph structure (`split_into_sentences`)
28
+ 2. Sentences batched (adaptive batch size: 1-4 based on sentence length)
29
+ 3. IndicProcessor preprocesses batches → tokenizer encodes → model generates → tokenizer decodes → IndicProcessor postprocesses
30
+ 4. Paragraph structure reconstructed (`reconstruct_formatting`)
31
+
32
+ ### Key Dependencies
33
+ - `IndicTransToolkit` (from GitHub: VarunGumma/IndicTransToolkit) — preprocessing/postprocessing for IndicTrans2
34
+ - `transformers` — model loading and inference
35
+ - `gradio` — web UI
36
+ - `spaces` — HuggingFace Spaces GPU allocation (`@spaces.GPU` decorator)
37
+ - `torch` — GPU inference with bfloat16/float16 optimization
38
+
39
+ ## Running Locally
40
+
41
+ ```bash
42
+ pip install -r requirements.txt
43
+ python app.py
44
+ ```
45
+
46
+ Requires CUDA GPU for reasonable performance (falls back to CPU with float32). Set environment variables `USERNAME` and `PASSWORD` for authentication credentials.
47
+
48
+ ## Deployment
49
+
50
+ Deployed as a HuggingFace Space. Configuration in `config.yaml` specifies Gradio SDK, T4-small GPU, and small storage. Git LFS configured in `gitattributes` for model weight files.
51
+
52
+ ## Key Design Decisions
53
+
54
+ - **No test suite**: Manual testing only. Changes should be verified by running the app locally.
55
+ - **GPU optimization**: Uses `device_map="auto"` with accelerate, bfloat16 when supported, `torch.compile` when not using device_map, and `torch.inference_mode()` for generation.
56
+ - **Batch translation fallback**: If a batch fails, retries each sentence individually (lines 387-428).
57
+ - **DOCX formatting preservation**: Extracts paragraph-level and run-level formatting from source documents and applies "dominant formatting" (majority vote on bold/italic/underline, most common font) to translated output.
README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: IndicTrans2Translator
3
+ emoji: 🔥
4
+ colorFrom: red
5
+ colorTo: red
6
+ sdk: gradio
7
+ sdk_version: 5.35.0
8
+ app_file: app.py
9
+ pinned: false
10
+ short_description: IndicTrans2 1B Model Translation
11
+ ---
12
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
13
+
14
+
15
+ # IndicTrans2 Translation Tool
16
+
17
+ A comprehensive translation application using the IndicTrans2 1B model for translating between English and Indian languages.
18
+
19
+ ## Features
20
+
21
+ - **Text Translation**: Direct text input and output
22
+ - **Document Translation**: Upload PDF/DOCX files and download translated documents
23
+ - **Multi-language Support**: 22+ Indian languages including Hindi, Bengali, Tamil, Telugu, and more
24
+ - **User-friendly Interface**: Clean, intuitive design with tabbed interface
25
+
26
+ ## Supported Languages
27
+
28
+ - English (en)
29
+ - Assamese (asm)
30
+ - Bengali (ben)
31
+ - Bodo (brx)
32
+ - Dogri (doi)
33
+ - Gujarati (guj)
34
+ - Hindi (hin)
35
+ - Kannada (kan)
36
+ - Kashmiri (kas)
37
+ - Konkani (gom)
38
+ - Maithili (mai)
39
+ - Malayalam (mal)
40
+ - Manipuri (mni)
41
+ - Marathi (mar)
42
+ - Nepali (nep)
43
+ - Oriya (ory)
44
+ - Punjabi (pan)
45
+ - Sanskrit (san)
46
+ - Santali (sat)
47
+ - Sindhi (snd)
48
+ - Tamil (tam)
49
+ - Telugu (tel)
50
+ - Urdu (urd)
51
+
52
+ ## Usage
53
+
54
+ ### Text Translation
55
+ 1. Select the "Text Translation" tab
56
+ 2. Enter or paste your text in the input box
57
+ 3. Choose source and target languages
58
+ 4. Click "Translate Text"
59
+ 5. View the translated text in the output box
60
+
61
+ ### Document Translation
62
+ 1. Select the "Document Translation" tab
63
+ 2. Upload a PDF or DOCX file
64
+ 3. Choose source and target languages
65
+ 4. Click "Translate Document"
66
+ 5. Download the translated document when ready
67
+
68
+ ## Technical Details
69
+
70
+ - **Models**: ai4bharat/indictrans2-en-indic-1B and ai4bharat/indictrans2-indic-en-1B
71
+ - **Framework**: Transformers, PyTorch
72
+ - **Interface**: Gradio
73
+ - **Supported File Types**: PDF, DOCX
74
+ - **Output Format**: Matches input format (text → text, document → document)
75
+ - **Translation Directions**: English ↔ Indic languages
76
+
77
+ ## Model Information
78
+
79
+ This application uses the IndicTrans2 1B model developed by AI4Bharat. The model is specifically designed for high-quality translation between English and Indian languages.
80
+
81
+ ## License
82
+
83
+ This project follows the licensing terms of the underlying IndicTrans2 model and its dependencies.
app.py ADDED
@@ -0,0 +1,1023 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # reverted to code v29
2
+
3
+ import gradio as gr
4
+ import torch
5
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
6
+ import PyPDF2
7
+ import docx
8
+ from docx import Document
9
+ import io
10
+ import tempfile
11
+ import os
12
+ from typing import Optional, Tuple
13
+ import logging
14
+ import spaces
15
+ import time
16
+
17
+ # Set up logging
18
+ logging.basicConfig(level=logging.INFO)
19
+ logger = logging.getLogger(__name__)
20
+
21
+ # Import IndicProcessor
22
+ from IndicTransToolkit.processor import IndicProcessor
23
+
24
+ # Authentication credentials from environment variables
25
+ VALID_USERNAME = os.getenv("USERNAME", "admin")
26
+ VALID_PASSWORD = os.getenv("PASSWORD", "password123")
27
+
28
+ # Session management
29
+ authenticated_sessions = set()
30
+
31
+ def authenticate(username: str, password: str) -> tuple:
32
+ """Authenticate user credentials and return session info"""
33
+ if username == VALID_USERNAME and password == VALID_PASSWORD:
34
+ session_id = f"session_{int(time.time())}_{hash(username)}"
35
+ authenticated_sessions.add(session_id)
36
+ logger.info(f"Successful login for user: {username}")
37
+ return True, session_id
38
+ else:
39
+ logger.warning(f"Failed login attempt for user: {username}")
40
+ return False, None
41
+
42
+ def is_authenticated(session_id: str) -> bool:
43
+ """Check if session is authenticated"""
44
+ return session_id in authenticated_sessions
45
+
46
+ def logout_session(session_id: str):
47
+ """Remove session from authenticated sessions"""
48
+ if session_id in authenticated_sessions:
49
+ authenticated_sessions.remove(session_id)
50
+ logger.info(f"Session logged out: {session_id}")
51
+
52
+ class IndicTrans2Translator:
53
+ def __init__(self):
54
+ self.en_indic_model = None
55
+ self.en_indic_tokenizer = None
56
+ self.indic_en_model = None
57
+ self.indic_en_tokenizer = None
58
+ self.ip = None
59
+ self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
60
+ self.load_models()
61
+
62
+ def load_models(self):
63
+ """Load the IndicTrans2 models and tokenizers optimized for HuggingFace Spaces GPU"""
64
+ try:
65
+ logger.info("Loading IndicTrans2 models with HF Spaces GPU optimizations...")
66
+
67
+ # Verify CUDA is available
68
+ if torch.cuda.is_available():
69
+ logger.info(f"CUDA available: {torch.cuda.is_available()}")
70
+ logger.info(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}")
71
+ logger.info(f"CUDA device count: {torch.cuda.device_count()}")
72
+ else:
73
+ logger.warning("CUDA not available, using CPU")
74
+
75
+ # Initialize IndicProcessor
76
+ self.ip = IndicProcessor(inference=True)
77
+ logger.info("IndicProcessor loaded successfully!")
78
+
79
+ # Check if accelerate is available for device_map
80
+ try:
81
+ import accelerate
82
+ use_device_map = True
83
+ logger.info("Accelerate available, using device_map for optimal GPU utilization")
84
+ except ImportError:
85
+ use_device_map = False
86
+ logger.info("Accelerate not available, using manual device placement")
87
+
88
+ # Load English to Indic model with HF Spaces optimizations
89
+ logger.info("Loading English to Indic model...")
90
+ self.en_indic_tokenizer = AutoTokenizer.from_pretrained(
91
+ "ai4bharat/indictrans2-en-indic-1B",
92
+ trust_remote_code=True
93
+ )
94
+
95
+ # Use bfloat16 for better performance on modern GPUs (A10G, A100, etc.)
96
+ # Fall back to float16 if bfloat16 is not supported
97
+ if torch.cuda.is_available():
98
+ try:
99
+ # Check if GPU supports bfloat16
100
+ torch_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
101
+ logger.info(f"Using {torch_dtype} precision for optimal GPU performance")
102
+ except:
103
+ torch_dtype = torch.float16
104
+ logger.info("Using float16 precision")
105
+ else:
106
+ torch_dtype = torch.float32
107
+ logger.info("Using float32 precision for CPU")
108
+
109
+ # Load model with or without device_map based on accelerate availability
110
+ if use_device_map and torch.cuda.is_available():
111
+ self.en_indic_model = AutoModelForSeq2SeqLM.from_pretrained(
112
+ "ai4bharat/indictrans2-en-indic-1B",
113
+ trust_remote_code=True,
114
+ torch_dtype=torch_dtype,
115
+ low_cpu_mem_usage=True,
116
+ device_map="auto" # Automatically distribute model across available GPUs
117
+ )
118
+ else:
119
+ self.en_indic_model = AutoModelForSeq2SeqLM.from_pretrained(
120
+ "ai4bharat/indictrans2-en-indic-1B",
121
+ trust_remote_code=True,
122
+ torch_dtype=torch_dtype,
123
+ low_cpu_mem_usage=True
124
+ )
125
+ self.en_indic_model = self.en_indic_model.to(self.device)
126
+
127
+ self.en_indic_model.eval()
128
+
129
+ # Load Indic to English model
130
+ logger.info("Loading Indic to English model...")
131
+ self.indic_en_tokenizer = AutoTokenizer.from_pretrained(
132
+ "ai4bharat/indictrans2-indic-en-1B",
133
+ trust_remote_code=True
134
+ )
135
+
136
+ if use_device_map and torch.cuda.is_available():
137
+ self.indic_en_model = AutoModelForSeq2SeqLM.from_pretrained(
138
+ "ai4bharat/indictrans2-indic-en-1B",
139
+ trust_remote_code=True,
140
+ torch_dtype=torch_dtype,
141
+ low_cpu_mem_usage=True,
142
+ device_map="auto"
143
+ )
144
+ else:
145
+ self.indic_en_model = AutoModelForSeq2SeqLM.from_pretrained(
146
+ "ai4bharat/indictrans2-indic-en-1B",
147
+ trust_remote_code=True,
148
+ torch_dtype=torch_dtype,
149
+ low_cpu_mem_usage=True
150
+ )
151
+ self.indic_en_model = self.indic_en_model.to(self.device)
152
+
153
+ self.indic_en_model.eval()
154
+
155
+ # Optimize models for inference
156
+ if torch.cuda.is_available():
157
+ # Enable cuDNN benchmark for consistent input sizes
158
+ torch.backends.cudnn.benchmark = True
159
+
160
+ # Compile models for faster inference (PyTorch 2.0+)
161
+ try:
162
+ if not use_device_map: # Only compile if not using device_map (can conflict)
163
+ self.en_indic_model = torch.compile(self.en_indic_model, mode="reduce-overhead")
164
+ self.indic_en_model = torch.compile(self.indic_en_model, mode="reduce-overhead")
165
+ logger.info("Models compiled with torch.compile for faster inference")
166
+ else:
167
+ logger.info("Skipping torch.compile (using device_map)")
168
+ except Exception as e:
169
+ logger.info(f"torch.compile not available or failed: {e}")
170
+
171
+ logger.info("Models loaded successfully with HF Spaces optimizations!")
172
+
173
+ # Log GPU memory usage
174
+ if torch.cuda.is_available():
175
+ memory_allocated = torch.cuda.memory_allocated(0) / 1024**3 # GB
176
+ memory_reserved = torch.cuda.memory_reserved(0) / 1024**3 # GB
177
+ logger.info(f"GPU Memory - Allocated: {memory_allocated:.2f}GB, Reserved: {memory_reserved:.2f}GB")
178
+
179
+ except Exception as e:
180
+ logger.error(f"Error loading models: {str(e)}")
181
+ raise e
182
+
183
+ def split_into_sentences(self, text: str) -> list:
184
+ """Split text into sentences while preserving paragraph structure"""
185
+ import re
186
+
187
+ # Split by paragraphs first (double newlines or more)
188
+ paragraphs = re.split(r'\n\s*\n', text)
189
+
190
+ sentence_list = []
191
+ paragraph_markers = []
192
+
193
+ for para_idx, paragraph in enumerate(paragraphs):
194
+ if not paragraph.strip():
195
+ continue
196
+
197
+ # Split paragraph into sentences using basic sentence endings
198
+ sentences = re.split(r'(?<=[.!?])\s+', paragraph.strip())
199
+
200
+ for sent_idx, sentence in enumerate(sentences):
201
+ if sentence.strip():
202
+ sentence_list.append(sentence.strip())
203
+ # Mark if this is the last sentence in a paragraph
204
+ is_para_end = (sent_idx == len(sentences) - 1)
205
+ is_last_para = (para_idx == len(paragraphs) - 1)
206
+ paragraph_markers.append({
207
+ 'is_paragraph_end': is_para_end and not is_last_para,
208
+ 'original_sentence': sentence.strip()
209
+ })
210
+
211
+ return sentence_list, paragraph_markers
212
+
213
+ def reconstruct_formatting(self, translated_sentences: list, paragraph_markers: list) -> str:
214
+ """Reconstruct text with original paragraph formatting"""
215
+ if len(translated_sentences) != len(paragraph_markers):
216
+ # Fallback: join with single spaces if lengths don't match
217
+ return ' '.join(translated_sentences)
218
+
219
+ result = []
220
+ for i, (translation, marker) in enumerate(zip(translated_sentences, paragraph_markers)):
221
+ result.append(translation)
222
+
223
+ # Add paragraph break if this sentence ended a paragraph
224
+ if marker['is_paragraph_end']:
225
+ result.append('\n\n')
226
+ # Add space between sentences within same paragraph
227
+ elif i < len(translated_sentences) - 1:
228
+ result.append(' ')
229
+
230
+ return ''.join(result)
231
+
232
+ @spaces.GPU
233
+ def translate_text(self, text: str, source_lang: str, target_lang: str) -> str:
234
+ """Translate text from source language to target language while preserving formatting"""
235
+ try:
236
+ # Get proper language-script codes
237
+ source_lang_code = LANGUAGE_SCRIPT_MAPPING.get(source_lang)
238
+ target_lang_code = LANGUAGE_SCRIPT_MAPPING.get(target_lang)
239
+
240
+ if not source_lang_code or not target_lang_code:
241
+ return f"Unsupported language: {source_lang} or {target_lang}"
242
+
243
+ # Check if source and target are the same
244
+ if source_lang == target_lang:
245
+ return text # Return original text if same language
246
+
247
+ # Debug logging
248
+ logger.info(f"Translating from {source_lang} ({source_lang_code}) to {target_lang} ({target_lang_code})")
249
+
250
+ # Check if input is single sentence or multiple paragraphs
251
+ if '\n' not in text and len(text.split('.')) <= 2:
252
+ # Simple single sentence - translate directly
253
+ input_sentences = [text.strip()]
254
+ paragraph_markers = None
255
+ else:
256
+ # Complex text - preserve formatting
257
+ input_sentences, paragraph_markers = self.split_into_sentences(text)
258
+ if not input_sentences:
259
+ return "No valid text found to translate."
260
+
261
+ # Determine which models to use based on source and target languages
262
+ if source_lang == "en" and target_lang != "en":
263
+ # English to Indic translation
264
+ tokenizer = self.en_indic_tokenizer
265
+ model = self.en_indic_model
266
+
267
+ elif source_lang != "en" and target_lang == "en":
268
+ # Indic to English translation
269
+ tokenizer = self.indic_en_tokenizer
270
+ model = self.indic_en_model
271
+
272
+ elif source_lang != "en" and target_lang != "en":
273
+ # Indic to Indic translation (via English as intermediate)
274
+ logger.info(f"Performing Indic-to-Indic translation via English: {source_lang} -> English -> {target_lang}")
275
+
276
+ # Step 1: Translate from source Indic language to English
277
+ intermediate_text = self.translate_via_english(input_sentences, source_lang, "en", paragraph_markers)
278
+
279
+ # Step 2: Translate from English to target Indic language
280
+ if paragraph_markers:
281
+ # Re-split the intermediate text to maintain structure
282
+ intermediate_sentences, intermediate_markers = self.split_into_sentences(intermediate_text)
283
+ final_text = self.translate_via_english(intermediate_sentences, "en", target_lang, intermediate_markers)
284
+ else:
285
+ final_text = self.translate_via_english([intermediate_text], "en", target_lang, None)
286
+
287
+ return final_text
288
+
289
+ else:
290
+ # This shouldn't happen, but just in case
291
+ return "Translation configuration error."
292
+
293
+ # Direct translation (English <-> Indic)
294
+ return self.perform_direct_translation(input_sentences, source_lang_code, target_lang_code,
295
+ tokenizer, model, paragraph_markers)
296
+
297
+ except Exception as e:
298
+ logger.error(f"Translation error: {str(e)}")
299
+ import traceback
300
+ traceback.print_exc()
301
+ return f"Error during translation: {str(e)}"
302
+
303
+ def translate_via_english(self, input_sentences: list, source_lang: str, target_lang: str, paragraph_markers: list) -> str:
304
+ """Helper method to translate via English intermediate step"""
305
+ source_lang_code = LANGUAGE_SCRIPT_MAPPING.get(source_lang)
306
+ target_lang_code = LANGUAGE_SCRIPT_MAPPING.get(target_lang)
307
+
308
+ if source_lang == "en":
309
+ # English to Indic
310
+ tokenizer = self.en_indic_tokenizer
311
+ model = self.en_indic_model
312
+ else:
313
+ # Indic to English
314
+ tokenizer = self.indic_en_tokenizer
315
+ model = self.indic_en_model
316
+
317
+ return self.perform_direct_translation(input_sentences, source_lang_code, target_lang_code,
318
+ tokenizer, model, paragraph_markers)
319
+
320
+ def perform_direct_translation(self, input_sentences: list, source_lang_code: str, target_lang_code: str,
321
+ tokenizer, model, paragraph_markers: list) -> str:
322
+ """Perform the actual translation using the specified model optimized for HF Spaces GPU"""
323
+ # Balanced batch size for optimal GPU utilization
324
+ batch_size = 4 # Optimal for most HF Spaces GPU configurations
325
+
326
+ # For very long sentences, reduce batch size
327
+ avg_sentence_length = sum(len(s.split()) for s in input_sentences) / len(input_sentences) if input_sentences else 0
328
+ if avg_sentence_length > 100:
329
+ batch_size = 2
330
+ elif avg_sentence_length > 200:
331
+ batch_size = 1
332
+
333
+ logger.info(f"Using batch size {batch_size} for average sentence length {avg_sentence_length:.1f} words")
334
+
335
+ all_translations = []
336
+
337
+ for i in range(0, len(input_sentences), batch_size):
338
+ batch_sentences = input_sentences[i:i + batch_size]
339
+
340
+ try:
341
+ # Preprocess the batch using IndicProcessor
342
+ batch = self.ip.preprocess_batch(
343
+ batch_sentences,
344
+ src_lang=source_lang_code,
345
+ tgt_lang=target_lang_code
346
+ )
347
+
348
+ # Tokenize with optimal settings for GPU
349
+ inputs = tokenizer(
350
+ batch,
351
+ truncation=True,
352
+ padding="longest",
353
+ max_length=256, # Keep reasonable max length
354
+ return_tensors="pt"
355
+ ).to(self.device)
356
+
357
+ # Generate translation with optimized parameters
358
+ with torch.no_grad():
359
+ # Use torch.inference_mode() for better performance
360
+ with torch.inference_mode():
361
+ outputs = model.generate(
362
+ **inputs,
363
+ do_sample=False, # Greedy decoding is faster
364
+ max_length=256,
365
+ num_beams=1, # Greedy search for speed
366
+ use_cache=True, # Enable cache for better speed
367
+ pad_token_id=tokenizer.pad_token_id,
368
+ eos_token_id=tokenizer.eos_token_id
369
+ )
370
+
371
+ # Decode the generated tokens
372
+ generated_tokens = tokenizer.batch_decode(
373
+ outputs,
374
+ skip_special_tokens=True,
375
+ clean_up_tokenization_spaces=True
376
+ )
377
+
378
+ # Postprocess the translations using IndicProcessor
379
+ batch_translations = self.ip.postprocess_batch(generated_tokens, lang=target_lang_code)
380
+ all_translations.extend(batch_translations)
381
+
382
+ # Progress logging for large documents
383
+ if len(input_sentences) > 20:
384
+ progress = min(100, int(((i + batch_size) / len(input_sentences)) * 100))
385
+ logger.info(f"Translation progress: {progress}% ({i + len(batch_sentences)}/{len(input_sentences)} sentences)")
386
+
387
+ except Exception as e:
388
+ logger.error(f"Translation error in batch {i//batch_size + 1}: {str(e)}")
389
+
390
+ # Fallback: try single sentences with more conservative settings
391
+ for single_sentence in batch_sentences:
392
+ try:
393
+ single_batch = self.ip.preprocess_batch(
394
+ [single_sentence],
395
+ src_lang=source_lang_code,
396
+ tgt_lang=target_lang_code
397
+ )
398
+
399
+ inputs = tokenizer(
400
+ single_batch,
401
+ truncation=True,
402
+ padding=False,
403
+ max_length=256,
404
+ return_tensors="pt"
405
+ ).to(self.device)
406
+
407
+ with torch.no_grad():
408
+ with torch.inference_mode():
409
+ outputs = model.generate(
410
+ **inputs,
411
+ do_sample=False,
412
+ max_length=256,
413
+ num_beams=1,
414
+ use_cache=True
415
+ )
416
+
417
+ generated_tokens = tokenizer.batch_decode(
418
+ outputs,
419
+ skip_special_tokens=True,
420
+ clean_up_tokenization_spaces=True
421
+ )
422
+
423
+ single_translations = self.ip.postprocess_batch(generated_tokens, lang=target_lang_code)
424
+ all_translations.extend(single_translations)
425
+
426
+ except Exception as single_e:
427
+ logger.error(f"Failed to translate sentence: {str(single_e)}")
428
+ all_translations.append(f"[Translation failed: {single_sentence[:50]}...]")
429
+
430
+ # Reconstruct formatting if we have paragraph structure
431
+ if paragraph_markers and len(all_translations) == len(paragraph_markers):
432
+ final_translation = self.reconstruct_formatting(all_translations, paragraph_markers)
433
+ else:
434
+ # Simple join if no paragraph structure or mismatch
435
+ final_translation = ' '.join(all_translations) if all_translations else "Translation failed"
436
+
437
+ return final_translation
438
+
439
+ # Language mappings with proper IndicTrans2 language codes
440
+ LANGUAGES = {
441
+ "English": "en",
442
+ "Assamese": "asm",
443
+ "Bengali": "ben",
444
+ "Bodo": "brx",
445
+ "Dogri": "doi",
446
+ "Gujarati": "guj",
447
+ "Hindi": "hin",
448
+ "Kannada": "kan",
449
+ "Kashmiri": "kas",
450
+ "Konkani": "gom",
451
+ "Maithili": "mai",
452
+ "Malayalam": "mal",
453
+ "Manipuri": "mni",
454
+ "Marathi": "mar",
455
+ "Nepali": "nep",
456
+ "Oriya": "ory",
457
+ "Punjabi": "pan",
458
+ "Sanskrit": "san",
459
+ "Santali": "sat",
460
+ "Sindhi": "snd",
461
+ "Tamil": "tam",
462
+ "Telugu": "tel",
463
+ "Urdu": "urd"
464
+ }
465
+
466
+ # Language-script mapping with proper IndicTrans2 codes
467
+ LANGUAGE_SCRIPT_MAPPING = {
468
+ "en": "eng_Latn",
469
+ "asm": "asm_Beng",
470
+ "ben": "ben_Beng",
471
+ "brx": "brx_Deva",
472
+ "doi": "doi_Deva",
473
+ "guj": "guj_Gujr",
474
+ "hin": "hin_Deva",
475
+ "kan": "kan_Knda",
476
+ "kas": "kas_Arab",
477
+ "gom": "gom_Deva",
478
+ "mai": "mai_Deva",
479
+ "mal": "mal_Mlym",
480
+ "mni": "mni_Beng",
481
+ "mar": "mar_Deva",
482
+ "nep": "nep_Deva",
483
+ "ory": "ory_Orya",
484
+ "pan": "pan_Guru",
485
+ "san": "san_Deva",
486
+ "sat": "sat_Olck",
487
+ "snd": "snd_Arab",
488
+ "tam": "tam_Taml",
489
+ "tel": "tel_Telu",
490
+ "urd": "urd_Arab"
491
+ }
492
+
493
+ def extract_text_from_pdf(file_path: str) -> str:
494
+ """Extract text from PDF file while preserving paragraph structure"""
495
+ try:
496
+ with open(file_path, 'rb') as file:
497
+ pdf_reader = PyPDF2.PdfReader(file)
498
+ paragraphs = []
499
+
500
+ for page in pdf_reader.pages:
501
+ page_text = page.extract_text()
502
+ if page_text.strip():
503
+ # Split by double newlines and clean up
504
+ page_paragraphs = [p.strip() for p in page_text.split('\n\n') if p.strip()]
505
+ paragraphs.extend(page_paragraphs)
506
+
507
+ # Join paragraphs with double newlines to preserve structure
508
+ return '\n\n'.join(paragraphs)
509
+ except Exception as e:
510
+ logger.error(f"Error extracting text from PDF: {str(e)}")
511
+ return f"Error reading PDF: {str(e)}"
512
+
513
+ def extract_text_from_docx(file_path: str) -> Tuple[str, list]:
514
+ """Extract text from DOCX file while preserving paragraph structure and formatting info"""
515
+ try:
516
+ doc = Document(file_path)
517
+ paragraphs = []
518
+ formatting_info = []
519
+
520
+ for para in doc.paragraphs:
521
+ text = para.text.strip()
522
+ if text: # Only add non-empty paragraphs
523
+ paragraphs.append(text)
524
+
525
+ # Store paragraph formatting information
526
+ para_format = {
527
+ 'alignment': para.alignment,
528
+ 'left_indent': para.paragraph_format.left_indent,
529
+ 'right_indent': para.paragraph_format.right_indent,
530
+ 'first_line_indent': para.paragraph_format.first_line_indent,
531
+ 'space_before': para.paragraph_format.space_before,
532
+ 'space_after': para.paragraph_format.space_after,
533
+ 'line_spacing': para.paragraph_format.line_spacing,
534
+ 'runs': []
535
+ }
536
+
537
+ # Store run-level formatting (font, size, bold, italic, etc.)
538
+ for run in para.runs:
539
+ if run.text.strip(): # Only store formatting for non-empty runs
540
+ run_format = {
541
+ 'text': run.text,
542
+ 'bold': run.bold,
543
+ 'italic': run.italic,
544
+ 'underline': run.underline,
545
+ 'font_name': run.font.name,
546
+ 'font_size': run.font.size,
547
+ 'font_color': None,
548
+ 'highlight_color': None
549
+ }
550
+
551
+ # Try to get font color
552
+ try:
553
+ if run.font.color and run.font.color.rgb:
554
+ run_format['font_color'] = run.font.color.rgb
555
+ except:
556
+ pass
557
+
558
+ # Try to get highlight color
559
+ try:
560
+ if run.font.highlight_color:
561
+ run_format['highlight_color'] = run.font.highlight_color
562
+ except:
563
+ pass
564
+
565
+ para_format['runs'].append(run_format)
566
+
567
+ formatting_info.append(para_format)
568
+
569
+ # Join paragraphs with double newlines to preserve structure
570
+ text = '\n\n'.join(paragraphs)
571
+ return text, formatting_info
572
+
573
+ except Exception as e:
574
+ logger.error(f"Error extracting text from DOCX: {str(e)}")
575
+ return f"Error reading DOCX: {str(e)}", []
576
+
577
+ def create_formatted_docx(translated_paragraphs: list, formatting_info: list, filename: str) -> str:
578
+ """Create a DOCX file with translated text while preserving original formatting"""
579
+ try:
580
+ doc = Document()
581
+
582
+ # Remove the default paragraph that gets created
583
+ if doc.paragraphs:
584
+ p = doc.paragraphs[0]
585
+ p._element.getparent().remove(p._element)
586
+
587
+ for i, (para_text, para_format) in enumerate(zip(translated_paragraphs, formatting_info)):
588
+ if not para_text.strip():
589
+ continue
590
+
591
+ # Create new paragraph
592
+ paragraph = doc.add_paragraph()
593
+
594
+ # Apply paragraph-level formatting
595
+ try:
596
+ if para_format.get('alignment') is not None:
597
+ paragraph.alignment = para_format['alignment']
598
+ if para_format.get('left_indent') is not None:
599
+ paragraph.paragraph_format.left_indent = para_format['left_indent']
600
+ if para_format.get('right_indent') is not None:
601
+ paragraph.paragraph_format.right_indent = para_format['right_indent']
602
+ if para_format.get('first_line_indent') is not None:
603
+ paragraph.paragraph_format.first_line_indent = para_format['first_line_indent']
604
+ if para_format.get('space_before') is not None:
605
+ paragraph.paragraph_format.space_before = para_format['space_before']
606
+ if para_format.get('space_after') is not None:
607
+ paragraph.paragraph_format.space_after = para_format['space_after']
608
+ if para_format.get('line_spacing') is not None:
609
+ paragraph.paragraph_format.line_spacing = para_format['line_spacing']
610
+ except Exception as e:
611
+ logger.warning(f"Could not apply some paragraph formatting: {e}")
612
+
613
+ # Handle run-level formatting
614
+ runs_info = para_format.get('runs', [])
615
+
616
+ if runs_info:
617
+ # Determine dominant formatting
618
+ total_runs = len(runs_info)
619
+ bold_count = sum(1 for r in runs_info if r.get('bold'))
620
+ italic_count = sum(1 for r in runs_info if r.get('italic'))
621
+ underline_count = sum(1 for r in runs_info if r.get('underline'))
622
+
623
+ # Get the most common font info
624
+ font_names = [r.get('font_name') for r in runs_info if r.get('font_name')]
625
+ font_sizes = [r.get('font_size') for r in runs_info if r.get('font_size')]
626
+ font_colors = [r.get('font_color') for r in runs_info if r.get('font_color')]
627
+
628
+ # Apply formatting to the translated text
629
+ run = paragraph.add_run(para_text)
630
+
631
+ # Apply dominant formatting
632
+ try:
633
+ if bold_count > total_runs / 2:
634
+ run.bold = True
635
+ if italic_count > total_runs / 2:
636
+ run.italic = True
637
+ if underline_count > total_runs / 2:
638
+ run.underline = True
639
+
640
+ # Apply most common font settings
641
+ if font_names:
642
+ run.font.name = max(set(font_names), key=font_names.count)
643
+ if font_sizes:
644
+ run.font.size = max(set(font_sizes), key=font_sizes.count)
645
+ if font_colors:
646
+ run.font.color.rgb = max(set(font_colors), key=font_colors.count)
647
+ except Exception as e:
648
+ logger.warning(f"Could not apply some formatting: {e}")
649
+
650
+ else:
651
+ # No run formatting info, just add the text
652
+ paragraph.add_run(para_text)
653
+
654
+ doc.save(filename)
655
+ return filename
656
+
657
+ except Exception as e:
658
+ logger.error(f"Error creating formatted DOCX: {str(e)}")
659
+ # Fallback to simple version
660
+ return create_docx_with_text('\n\n'.join(translated_paragraphs), filename)
661
+
662
+ def create_docx_with_text(text: str, filename: str) -> str:
663
+ """Create a DOCX file with the given text, preserving paragraph formatting (fallback method)"""
664
+ try:
665
+ doc = Document()
666
+
667
+ # Split text by double newlines to preserve paragraph structure
668
+ paragraphs = text.split('\n\n')
669
+
670
+ for para_text in paragraphs:
671
+ if para_text.strip(): # Only add non-empty paragraphs
672
+ # Clean up any single newlines within paragraphs and replace with spaces
673
+ cleaned_text = para_text.replace('\n', ' ').strip()
674
+ doc.add_paragraph(cleaned_text)
675
+
676
+ doc.save(filename)
677
+ return filename
678
+ except Exception as e:
679
+ logger.error(f"Error creating DOCX: {str(e)}")
680
+ return None
681
+
682
+ @spaces.GPU
683
+ def translate_text_input(text: str, source_lang: str, target_lang: str, session_id: str = "") -> str:
684
+ """Handle text input translation"""
685
+ if not is_authenticated(session_id):
686
+ return "❌ Please log in to use this feature."
687
+
688
+ if not text.strip():
689
+ return "Please enter some text to translate."
690
+
691
+ source_code = LANGUAGES.get(source_lang)
692
+ target_code = LANGUAGES.get(target_lang)
693
+
694
+ if not source_code or not target_code:
695
+ return "Invalid language selection."
696
+
697
+ # Allow same language (will return original text)
698
+ # No need to check if source_code == target_code
699
+
700
+ return translator.translate_text(text, source_code, target_code)
701
+
702
+ @spaces.GPU
703
+ def translate_document(file, source_lang: str, target_lang: str, session_id: str = "") -> Tuple[Optional[str], str]:
704
+ """Handle document translation while preserving original formatting"""
705
+ if not is_authenticated(session_id):
706
+ return None, "❌ Please log in to use this feature."
707
+
708
+ if file is None:
709
+ return None, "Please upload a document."
710
+
711
+ source_code = LANGUAGES.get(source_lang)
712
+ target_code = LANGUAGES.get(target_lang)
713
+
714
+ if not source_code or not target_code:
715
+ return None, "Invalid language selection."
716
+
717
+ # Start timing the translation
718
+ start_time = time.time()
719
+
720
+ try:
721
+ # Get file extension
722
+ file_extension = os.path.splitext(file.name)[1].lower()
723
+ formatting_info = None
724
+
725
+ logger.info(f"Starting document translation: {source_lang} → {target_lang}")
726
+
727
+ # Extract text based on file type
728
+ if file_extension == '.pdf':
729
+ text = extract_text_from_pdf(file.name)
730
+ elif file_extension == '.docx':
731
+ text, formatting_info = extract_text_from_docx(file.name)
732
+ else:
733
+ return None, "Unsupported file format. Please upload PDF or DOCX files only."
734
+
735
+ if text.startswith("Error"):
736
+ return None, text
737
+
738
+ # Log document stats
739
+ word_count = len(text.split())
740
+ char_count = len(text)
741
+ logger.info(f"Document stats: {word_count} words, {char_count} characters")
742
+
743
+ # Translate the text
744
+ translate_start = time.time()
745
+ translated_text = translator.translate_text(text, source_code, target_code)
746
+ translate_end = time.time()
747
+
748
+ translate_duration = translate_end - translate_start
749
+ logger.info(f"Core translation took: {translate_duration:.2f} seconds")
750
+
751
+ # Create output file
752
+ output_filename = f"translated_{os.path.splitext(os.path.basename(file.name))[0]}.docx"
753
+ output_path = os.path.join(tempfile.gettempdir(), output_filename)
754
+
755
+ # Create formatted output if we have formatting info
756
+ if formatting_info and file_extension == '.docx':
757
+ # Split translated text back into paragraphs
758
+ translated_paragraphs = translated_text.split('\n\n')
759
+
760
+ # Ensure we have the right number of paragraphs
761
+ if len(translated_paragraphs) == len(formatting_info):
762
+ create_formatted_docx(translated_paragraphs, formatting_info, output_path)
763
+ else:
764
+ logger.warning(f"Paragraph count mismatch: {len(translated_paragraphs)} vs {len(formatting_info)}, using fallback")
765
+ create_docx_with_text(translated_text, output_path)
766
+ else:
767
+ # Fallback to regular formatting
768
+ create_docx_with_text(translated_text, output_path)
769
+
770
+ # Calculate total time
771
+ end_time = time.time()
772
+ total_duration = end_time - start_time
773
+
774
+ # Format time display
775
+ minutes = int(total_duration // 60)
776
+ seconds = int(total_duration % 60)
777
+
778
+ # Create detailed status message
779
+ if minutes > 0:
780
+ time_str = f"{minutes}m {seconds}s"
781
+ else:
782
+ time_str = f"{seconds}s"
783
+
784
+ # Calculate translation speed (words per minute)
785
+ if word_count > 0 and total_duration > 0:
786
+ words_per_minute = int((word_count / total_duration) * 60)
787
+ speed_info = f" • Speed: {words_per_minute} words/min"
788
+ else:
789
+ speed_info = ""
790
+
791
+ # Determine translation type for status
792
+ if source_code == target_code:
793
+ translation_type = "Document processed"
794
+ elif source_code == "en" or target_code == "en":
795
+ translation_type = "Direct translation"
796
+ else:
797
+ translation_type = "Indic-to-Indic translation (via English)"
798
+
799
+ status_message = (
800
+ f"✅ Translation completed successfully!\n"
801
+ f"⏱️ Time taken: {time_str}\n"
802
+ f"📄 Document: {word_count} words, {char_count} characters\n"
803
+ f"🔄 Type: {translation_type}{speed_info}\n"
804
+ f"📁 Original formatting preserved in output file."
805
+ )
806
+
807
+ logger.info(f"Document translation completed in {total_duration:.2f} seconds ({time_str})")
808
+
809
+ return output_path, status_message
810
+
811
+ except Exception as e:
812
+ end_time = time.time()
813
+ total_duration = end_time - start_time
814
+ minutes = int(total_duration // 60)
815
+ seconds = int(total_duration % 60)
816
+ time_str = f"{minutes}m {seconds}s" if minutes > 0 else f"{seconds}s"
817
+
818
+ logger.error(f"Document translation error after {time_str}: {str(e)}")
819
+ return None, f"❌ Error during document translation (after {time_str}): {str(e)}"
820
+
821
+ # Initialize translator
822
+ print("Initializing IndicTrans2 Translator with IndicTransToolkit...")
823
+ translator = IndicTrans2Translator()
824
+
825
+ # Create the app with proper authentication
826
+ with gr.Blocks(title="IndicTrans2 Translator", theme=gr.themes.Soft()) as demo:
827
+ # Session state
828
+ session_state = gr.State("")
829
+
830
+ # Login interface (visible by default)
831
+ with gr.Column(visible=True) as login_column:
832
+ gr.Markdown("""
833
+ # 🔐 IndicTrans2 Translator - Authentication Required
834
+
835
+ Please enter your credentials to access the translation tool.
836
+ """)
837
+
838
+ with gr.Row():
839
+ with gr.Column(scale=1):
840
+ pass # Empty column for centering
841
+
842
+ with gr.Column(scale=2):
843
+ with gr.Group():
844
+ gr.Markdown("### Login")
845
+ username_input = gr.Textbox(
846
+ label="Username",
847
+ placeholder="Enter username",
848
+ type="text"
849
+ )
850
+ password_input = gr.Textbox(
851
+ label="Password",
852
+ placeholder="Enter password",
853
+ type="password"
854
+ )
855
+ login_btn = gr.Button("Login", variant="primary", size="lg")
856
+ login_status = gr.Markdown("")
857
+
858
+ with gr.Column(scale=1):
859
+ pass # Empty column for centering
860
+
861
+ gr.Markdown("""
862
+ ---
863
+
864
+ **For Administrators:**
865
+ - Set environment secrets `USERNAME` and `PASSWORD` to configure credentials
866
+ - Secrets are encrypted and secure in HuggingFace Spaces
867
+
868
+ **Features:**
869
+ - 🔒 Secure authentication system
870
+ - 🌍 Support for 22+ Indian languages
871
+ - 📄 Document translation with formatting preservation
872
+ - 🔥 High-quality translation using IndicTrans2 models
873
+ """)
874
+
875
+ # Main translator interface (hidden by default)
876
+ with gr.Column(visible=False) as main_column:
877
+ gr.Markdown("""
878
+ # IndicTrans2 Translation Tool
879
+
880
+ Translate text between English and Indian languages using the IndicTrans2 1B model with IndicTransToolkit for optimal quality.
881
+ """)
882
+
883
+ with gr.Tabs():
884
+ # Text Translation Tab
885
+ with gr.TabItem("Text Translation"):
886
+ with gr.Row():
887
+ with gr.Column():
888
+ text_input = gr.Textbox(
889
+ label="Input Text",
890
+ placeholder="Enter text to translate...",
891
+ lines=5
892
+ )
893
+ with gr.Row():
894
+ source_lang_text = gr.Dropdown(
895
+ choices=list(LANGUAGES.keys()),
896
+ label="Source Language",
897
+ value="English"
898
+ )
899
+ target_lang_text = gr.Dropdown(
900
+ choices=list(LANGUAGES.keys()),
901
+ label="Target Language",
902
+ value="Hindi"
903
+ )
904
+ translate_text_btn = gr.Button("Translate Text", variant="primary")
905
+
906
+ with gr.Column():
907
+ text_output = gr.Textbox(
908
+ label="Translated Text",
909
+ lines=5,
910
+ interactive=False
911
+ )
912
+
913
+ # Document Translation Tab
914
+ with gr.TabItem("Document Translation"):
915
+ with gr.Row():
916
+ with gr.Column():
917
+ file_input = gr.File(
918
+ label="Upload Document",
919
+ file_types=[".pdf", ".docx"],
920
+ type="filepath"
921
+ )
922
+ with gr.Row():
923
+ source_lang_doc = gr.Dropdown(
924
+ choices=list(LANGUAGES.keys()),
925
+ label="Source Language",
926
+ value="English"
927
+ )
928
+ target_lang_doc = gr.Dropdown(
929
+ choices=list(LANGUAGES.keys()),
930
+ label="Target Language",
931
+ value="Hindi"
932
+ )
933
+ translate_doc_btn = gr.Button("Translate Document", variant="primary")
934
+
935
+ with gr.Column():
936
+ doc_status = gr.Textbox(
937
+ label="Status",
938
+ interactive=False
939
+ )
940
+ doc_output = gr.File(
941
+ label="Download Translated Document"
942
+ )
943
+
944
+ # Examples
945
+ gr.Examples(
946
+ examples=[
947
+ ["Hello, how are you?", "English", "Hindi"],
948
+ ["This is a test sentence for translation.", "English", "Bengali"],
949
+ ["Machine learning is changing the world.", "English", "Tamil"],
950
+ ["नमस्ते, आप कैसे हैं?", "Hindi", "English"],
951
+ ["আমি ভালো আছি।", "Bengali", "Hindi"],
952
+ ["मला खूप आनंद झाला।", "Marathi", "Tamil"],
953
+ ["ನಾನು ಚೆನ್ನಾಗಿದ್ದೇನೆ।", "Kannada", "Telugu"]
954
+ ],
955
+ inputs=[text_input, source_lang_text, target_lang_text],
956
+ outputs=[text_output],
957
+ fn=lambda text, src, tgt: translate_text_input(text, src, tgt, ""),
958
+ cache_examples=False
959
+ )
960
+
961
+ # Logout functionality
962
+ with gr.Row():
963
+ logout_btn = gr.Button("🔓 Logout", variant="secondary", size="sm")
964
+
965
+ def handle_login(username, password):
966
+ success, session_id = authenticate(username, password)
967
+ if success:
968
+ return (
969
+ gr.Markdown("✅ **Login successful!** Welcome to the translator."),
970
+ gr.Column(visible=False),
971
+ gr.Column(visible=True),
972
+ session_id
973
+ )
974
+ else:
975
+ return (
976
+ gr.Markdown("❌ **Invalid credentials.** Please try again."),
977
+ gr.Column(visible=True),
978
+ gr.Column(visible=False),
979
+ ""
980
+ )
981
+
982
+ def handle_logout(session_id):
983
+ if session_id:
984
+ logout_session(session_id)
985
+ return (
986
+ gr.Column(visible=True),
987
+ gr.Column(visible=False),
988
+ "",
989
+ gr.Textbox(value=""),
990
+ gr.Textbox(value=""),
991
+ gr.Markdown("🔓 **Logged out successfully.** Please login again.")
992
+ )
993
+
994
+ # Event handlers
995
+ login_btn.click(
996
+ fn=handle_login,
997
+ inputs=[username_input, password_input],
998
+ outputs=[login_status, login_column, main_column, session_state]
999
+ )
1000
+
1001
+ logout_btn.click(
1002
+ fn=handle_logout,
1003
+ inputs=[session_state],
1004
+ outputs=[login_column, main_column, session_state, username_input, password_input, login_status]
1005
+ )
1006
+
1007
+ translate_text_btn.click(
1008
+ fn=lambda text, src, tgt, session: translate_text_input(text, src, tgt, session),
1009
+ inputs=[text_input, source_lang_text, target_lang_text, session_state],
1010
+ outputs=[text_output]
1011
+ )
1012
+
1013
+ translate_doc_btn.click(
1014
+ fn=lambda file, src, tgt, session: translate_document(file, src, tgt, session),
1015
+ inputs=[file_input, source_lang_doc, target_lang_doc, session_state],
1016
+ outputs=[doc_output, doc_status]
1017
+ )
1018
+
1019
+ print("IndicTrans2 Translator with Authentication initialized successfully!")
1020
+
1021
+ # Launch the app
1022
+ if __name__ == "__main__":
1023
+ demo.launch(share=True)
config.yaml ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ title: IndicTrans2 Translation Tool
2
+ emoji: 🌍
3
+ colorFrom: blue
4
+ colorTo: green
5
+ sdk: gradio
6
+ sdk_version: 4.0.0
7
+ app_file: app.py
8
+ pinned: false
9
+ license: mit
10
+ short_description: Translate between English and Indian languages using IndicTrans2
11
+ suggested_hardware: t4-small
12
+ suggested_storage: small
requirements.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ --extra-index-url https://download.pytorch.org/whl/cu118
2
+ torch
3
+ transformers>=4.33.2
4
+ accelerate
5
+ gradio
6
+ PyPDF2
7
+ python-docx
8
+ nltk
9
+ sacremoses
10
+ pandas
11
+ regex
12
+ mosestokenizer
13
+ git+https://github.com/VarunGumma/IndicTransToolkit.git