Ruhivig65 commited on
Commit
6cc0372
Β·
verified Β·
1 Parent(s): bd06613

Upload 6 files

Browse files
Files changed (6) hide show
  1. Dockerfile +26 -0
  2. README.md +7 -5
  3. app.py +486 -0
  4. requirements.txt +7 -0
  5. task_manager.py +439 -0
  6. translator_engine.py +740 -0
Dockerfile ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10.14-slim
2
+
3
+ # Create non-root user (HuggingFace requirement)
4
+ RUN useradd -m -u 1000 appuser
5
+
6
+ WORKDIR /app
7
+
8
+ # Install dependencies first (Docker cache optimization)
9
+ COPY requirements.txt .
10
+ RUN pip install --no-cache-dir -r requirements.txt
11
+
12
+ # Copy all application code
13
+ COPY . .
14
+
15
+ # Create necessary directories with proper permissions
16
+ RUN mkdir -p uploads outputs && \
17
+ chown -R appuser:appuser /app
18
+
19
+ # Switch to non-root user
20
+ USER appuser
21
+
22
+ # HuggingFace Spaces expects port 7860
23
+ EXPOSE 7860
24
+
25
+ # Launch command
26
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,10 +1,12 @@
1
  ---
2
- title: Novelt
3
- emoji: πŸ†
4
- colorFrom: gray
5
- colorTo: red
6
  sdk: docker
 
7
  pinned: false
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
1
  ---
2
+ title: Massive Text Translator EN-HI
3
+ emoji: πŸ“š
4
+ colorFrom: blue
5
+ colorTo: green
6
  sdk: docker
7
+ app_port: 7860
8
  pinned: false
9
  ---
10
 
11
+ # Massive English to Hindi Text Translator
12
+ Upload massive .txt files (1000+ chapters, 2-3 million words) and get translated .txt back.
app.py ADDED
@@ -0,0 +1,486 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ app.py
3
+ ======
4
+ FastAPI Web Application for Massive Text Translation (EN β†’ HI)
5
+
6
+ HUGGINGFACE SPACES DEPLOYMENT:
7
+ -------------------------------
8
+ - This is the main entry point
9
+ - HuggingFace Spaces Docker SDK runs: uvicorn app:app --host 0.0.0.0 --port 7860
10
+ - The app serves both the API and the frontend HTML
11
+
12
+ ROUTES:
13
+ -------
14
+ GET / β†’ Frontend UI (HTML page)
15
+ POST /upload β†’ Upload a .txt file and start translation
16
+ GET /progress β†’ Get real-time progress of current/latest task
17
+ GET /progress/{id} β†’ Get progress of a specific task by ID
18
+ GET /download/{id} β†’ Download the translated file
19
+ POST /cancel/{id} β†’ Cancel a running translation
20
+ GET /health β†’ Health check endpoint
21
+
22
+ FLOW:
23
+ -----
24
+ 1. User opens / β†’ sees the upload UI
25
+ 2. User uploads .txt file β†’ POST /upload
26
+ 3. Server saves file, creates task, starts background thread
27
+ 4. Frontend polls GET /progress every 2 seconds
28
+ 5. When status="completed", download button appears
29
+ 6. User clicks download β†’ GET /download/{task_id}
30
+ 7. Server streams the translated .txt file to browser
31
+ """
32
+
33
+ import os
34
+ import logging
35
+ import shutil
36
+ import uuid
37
+ from typing import Optional
38
+
39
+ from fastapi import FastAPI, File, UploadFile, HTTPException, Request
40
+ from fastapi.responses import (
41
+ HTMLResponse,
42
+ JSONResponse,
43
+ FileResponse,
44
+ StreamingResponse,
45
+ )
46
+ from fastapi.staticfiles import StaticFiles
47
+ from fastapi.templating import Jinja2Templates
48
+
49
+ from task_manager import (
50
+ get_task_manager,
51
+ TaskStatus,
52
+ UPLOAD_DIR,
53
+ OUTPUT_DIR,
54
+ )
55
+
56
+ # ============================================================================
57
+ # LOGGING
58
+ # ============================================================================
59
+ logging.basicConfig(
60
+ level=logging.INFO,
61
+ format="%(asctime)s [%(name)s] %(levelname)s: %(message)s",
62
+ )
63
+ logger = logging.getLogger("app")
64
+
65
+ # ============================================================================
66
+ # FASTAPI APP INITIALIZATION
67
+ # ============================================================================
68
+ app = FastAPI(
69
+ title="Massive Text Translator (EN β†’ HI)",
70
+ description="Translate massive .txt files from English to Hindi using free Google Translate",
71
+ version="1.0.0",
72
+ )
73
+
74
+ # ============================================================================
75
+ # TEMPLATES SETUP
76
+ # ============================================================================
77
+ BASE_DIR = os.path.dirname(os.path.abspath(__file__))
78
+ TEMPLATES_DIR = os.path.join(BASE_DIR, "templates")
79
+ os.makedirs(TEMPLATES_DIR, exist_ok=True)
80
+
81
+ templates = Jinja2Templates(directory=TEMPLATES_DIR)
82
+
83
+
84
+ # ============================================================================
85
+ # STARTUP EVENT
86
+ # ============================================================================
87
+ @app.on_event("startup")
88
+ async def startup_event():
89
+ """
90
+ Called when the FastAPI server starts.
91
+ - Ensures directories exist
92
+ - Initializes the TaskManager
93
+ - Cleans up any leftover files from previous runs
94
+ """
95
+ os.makedirs(UPLOAD_DIR, exist_ok=True)
96
+ os.makedirs(OUTPUT_DIR, exist_ok=True)
97
+
98
+ # Initialize the global task manager
99
+ tm = get_task_manager()
100
+ logger.info("=" * 60)
101
+ logger.info("Massive Text Translator started successfully!")
102
+ logger.info(f"Upload dir: {UPLOAD_DIR}")
103
+ logger.info(f"Output dir: {OUTPUT_DIR}")
104
+ logger.info("=" * 60)
105
+
106
+
107
+ # ============================================================================
108
+ # ROUTE: HOME PAGE (Frontend UI)
109
+ # ============================================================================
110
+ @app.get("/", response_class=HTMLResponse)
111
+ async def home(request: Request):
112
+ """
113
+ Serve the main frontend page.
114
+ The HTML template handles:
115
+ - File upload form
116
+ - Real-time progress display
117
+ - Download button when complete
118
+ """
119
+ return templates.TemplateResponse("index.html", {"request": request})
120
+
121
+
122
+ # ============================================================================
123
+ # ROUTE: FILE UPLOAD
124
+ # ============================================================================
125
+ @app.post("/upload")
126
+ async def upload_file(file: UploadFile = File(...)):
127
+ """
128
+ Handle .txt file upload and start translation.
129
+
130
+ FLOW:
131
+ 1. Validate file (must be .txt)
132
+ 2. Save to uploads/ directory
133
+ 3. Create a translation task
134
+ 4. Start translation in background thread
135
+ 5. Return task_id for progress polling
136
+
137
+ ERROR SCENARIOS:
138
+ - File is not .txt β†’ 400 error
139
+ - Another translation running β†’ 409 Conflict
140
+ - Disk full β†’ 500 error (caught by generic handler)
141
+ - File too large β†’ handled by streaming save (won't OOM)
142
+ """
143
+ tm = get_task_manager()
144
+
145
+ # --- Validation ---
146
+ if not file.filename:
147
+ raise HTTPException(status_code=400, detail="No file provided.")
148
+
149
+ if not file.filename.lower().endswith(".txt"):
150
+ raise HTTPException(
151
+ status_code=400,
152
+ detail="Only .txt files are supported. Please upload a plain text file.",
153
+ )
154
+
155
+ # --- Check if busy ---
156
+ if tm.is_busy():
157
+ raise HTTPException(
158
+ status_code=409,
159
+ detail="A translation is already in progress. Please wait for it to "
160
+ "complete or cancel it before uploading a new file.",
161
+ )
162
+
163
+ # --- Cleanup old tasks before accepting new one ---
164
+ tm.cleanup_old_tasks()
165
+
166
+ # --- Save uploaded file to disk (streaming β€” no full file in RAM) ---
167
+ file_id = str(uuid.uuid4())[:8]
168
+ safe_filename = f"{file_id}_{file.filename}"
169
+ input_path = os.path.join(UPLOAD_DIR, safe_filename)
170
+
171
+ try:
172
+ # Stream file to disk in chunks (handles files larger than RAM)
173
+ # IMPORTANT: We read in 1MB chunks, not the whole file at once
174
+ with open(input_path, "wb") as f:
175
+ while True:
176
+ chunk = await file.read(1024 * 1024) # 1MB chunks
177
+ if not chunk:
178
+ break
179
+ f.write(chunk)
180
+
181
+ file_size = os.path.getsize(input_path)
182
+ logger.info(
183
+ f"File uploaded: {file.filename} β†’ {input_path} "
184
+ f"({file_size / 1024 / 1024:.1f} MB)"
185
+ )
186
+
187
+ except Exception as e:
188
+ # Clean up partial upload
189
+ if os.path.exists(input_path):
190
+ os.remove(input_path)
191
+ logger.exception(f"File upload failed: {e}")
192
+ raise HTTPException(
193
+ status_code=500,
194
+ detail=f"Failed to save uploaded file: {str(e)}",
195
+ )
196
+
197
+ # --- Create and start translation task ---
198
+ try:
199
+ task = tm.create_task(
200
+ original_filename=file.filename,
201
+ input_path=input_path,
202
+ )
203
+ tm.start_task(task.task_id)
204
+
205
+ return JSONResponse(
206
+ status_code=202, # 202 Accepted (processing started)
207
+ content={
208
+ "message": "Translation started!",
209
+ "task_id": task.task_id,
210
+ "filename": file.filename,
211
+ "file_size_mb": round(file_size / 1024 / 1024, 2),
212
+ },
213
+ )
214
+
215
+ except RuntimeError as e:
216
+ # Another task started between our check and create (race condition)
217
+ if os.path.exists(input_path):
218
+ os.remove(input_path)
219
+ raise HTTPException(status_code=409, detail=str(e))
220
+
221
+ except Exception as e:
222
+ if os.path.exists(input_path):
223
+ os.remove(input_path)
224
+ logger.exception(f"Task creation failed: {e}")
225
+ raise HTTPException(
226
+ status_code=500,
227
+ detail=f"Failed to start translation: {str(e)}",
228
+ )
229
+
230
+
231
+ # ============================================================================
232
+ # ROUTE: PROGRESS POLLING (Latest Task)
233
+ # ============================================================================
234
+ @app.get("/progress")
235
+ async def get_progress():
236
+ """
237
+ Get the progress of the latest/current translation task.
238
+
239
+ The frontend polls this endpoint every 2 seconds.
240
+
241
+ Returns:
242
+ {
243
+ "task_id": "a1b2c3d4",
244
+ "original_filename": "my_novel.txt",
245
+ "status": "translating",
246
+ "progress": {
247
+ "total_paragraphs": 50000,
248
+ "translated_paragraphs": 15000,
249
+ "failed_paragraphs": 3,
250
+ "percent_complete": 30.0,
251
+ "elapsed_seconds": 1800.5,
252
+ "speed_per_second": 8.33,
253
+ "eta_seconds": 4200.0,
254
+ ...
255
+ },
256
+ ...
257
+ }
258
+ """
259
+ tm = get_task_manager()
260
+ progress = tm.get_latest_task_progress()
261
+
262
+ if progress is None:
263
+ return JSONResponse(
264
+ content={
265
+ "status": "idle",
266
+ "message": "No translation tasks. Upload a .txt file to start.",
267
+ }
268
+ )
269
+
270
+ return JSONResponse(content=progress)
271
+
272
+
273
+ # ============================================================================
274
+ # ROUTE: PROGRESS POLLING (Specific Task)
275
+ # ============================================================================
276
+ @app.get("/progress/{task_id}")
277
+ async def get_task_progress(task_id: str):
278
+ """
279
+ Get progress of a specific task by ID.
280
+ Useful if you have the task_id from the upload response.
281
+ """
282
+ tm = get_task_manager()
283
+ progress = tm.get_task_progress(task_id)
284
+
285
+ if progress is None:
286
+ raise HTTPException(
287
+ status_code=404,
288
+ detail=f"Task not found: {task_id}",
289
+ )
290
+
291
+ return JSONResponse(content=progress)
292
+
293
+
294
+ # ============================================================================
295
+ # ROUTE: DOWNLOAD TRANSLATED FILE
296
+ # ============================================================================
297
+ @app.get("/download/{task_id}")
298
+ async def download_file(task_id: str):
299
+ """
300
+ Download the translated .txt file.
301
+
302
+ Only works for COMPLETED tasks.
303
+ Streams the file to the browser (doesn't load entire file in RAM).
304
+
305
+ SECURITY:
306
+ - Only serves files from the outputs/ directory
307
+ - Validates task_id exists in our registry
308
+ - Prevents path traversal attacks
309
+ """
310
+ tm = get_task_manager()
311
+ task = tm.get_task(task_id)
312
+
313
+ if task is None:
314
+ raise HTTPException(status_code=404, detail=f"Task not found: {task_id}")
315
+
316
+ if task.status != TaskStatus.COMPLETED:
317
+ raise HTTPException(
318
+ status_code=400,
319
+ detail=f"Task is not completed yet. Current status: {task.status.value}",
320
+ )
321
+
322
+ output_path = tm.get_output_path(task_id)
323
+ if output_path is None or not os.path.exists(output_path):
324
+ raise HTTPException(
325
+ status_code=404,
326
+ detail="Translated file not found on disk. It may have been cleaned up.",
327
+ )
328
+
329
+ # Generate a user-friendly download filename
330
+ original_name = os.path.splitext(task.original_filename)[0]
331
+ download_filename = f"{original_name}_Hindi_Translated.txt"
332
+
333
+ logger.info(f"Serving download: {output_path} as '{download_filename}'")
334
+
335
+ # Use FileResponse β€” it streams the file, doesn't load into RAM
336
+ return FileResponse(
337
+ path=output_path,
338
+ filename=download_filename,
339
+ media_type="text/plain; charset=utf-8",
340
+ )
341
+
342
+
343
+ # ============================================================================
344
+ # ROUTE: CANCEL TRANSLATION
345
+ # ============================================================================
346
+ @app.post("/cancel/{task_id}")
347
+ async def cancel_translation(task_id: str):
348
+ """
349
+ Cancel a running translation.
350
+
351
+ The translator will stop after completing its current paragraph.
352
+ Any already-translated content is preserved in the output file.
353
+ """
354
+ tm = get_task_manager()
355
+
356
+ success = tm.cancel_task(task_id)
357
+ if not success:
358
+ task = tm.get_task(task_id)
359
+ if task is None:
360
+ raise HTTPException(status_code=404, detail=f"Task not found: {task_id}")
361
+ raise HTTPException(
362
+ status_code=400,
363
+ detail=f"Cannot cancel task in '{task.status.value}' state.",
364
+ )
365
+
366
+ return JSONResponse(
367
+ content={
368
+ "message": "Translation cancelled. Partial output preserved on disk.",
369
+ "task_id": task_id,
370
+ }
371
+ )
372
+
373
+
374
+ # ============================================================================
375
+ # ROUTE: CANCEL LATEST TASK (convenience for simple UI)
376
+ # ============================================================================
377
+ @app.post("/cancel")
378
+ async def cancel_latest():
379
+ """Cancel the most recent/active translation task."""
380
+ tm = get_task_manager()
381
+ task = tm.get_latest_task()
382
+
383
+ if task is None:
384
+ raise HTTPException(status_code=404, detail="No active tasks to cancel.")
385
+
386
+ success = tm.cancel_task(task.task_id)
387
+ if not success:
388
+ raise HTTPException(
389
+ status_code=400,
390
+ detail=f"Cannot cancel task in '{task.status.value}' state.",
391
+ )
392
+
393
+ return JSONResponse(
394
+ content={
395
+ "message": "Translation cancelled.",
396
+ "task_id": task.task_id,
397
+ }
398
+ )
399
+
400
+
401
+ # ============================================================================
402
+ # ROUTE: HEALTH CHECK
403
+ # ============================================================================
404
+ @app.get("/health")
405
+ async def health_check():
406
+ """
407
+ Health check endpoint.
408
+ HuggingFace Spaces uses this to verify the app is running.
409
+ Also useful for monitoring.
410
+ """
411
+ tm = get_task_manager()
412
+
413
+ return JSONResponse(
414
+ content={
415
+ "status": "healthy",
416
+ "server": "Massive Text Translator EN→HI",
417
+ "is_busy": tm.is_busy(),
418
+ "upload_dir": UPLOAD_DIR,
419
+ "output_dir": OUTPUT_DIR,
420
+ "upload_dir_exists": os.path.exists(UPLOAD_DIR),
421
+ "output_dir_exists": os.path.exists(OUTPUT_DIR),
422
+ }
423
+ )
424
+
425
+
426
+ # ============================================================================
427
+ # ROUTE: GET ALL TASKS (debug/admin)
428
+ # ============================================================================
429
+ @app.get("/tasks")
430
+ async def list_tasks():
431
+ """
432
+ List all tasks (for debugging).
433
+ Shows all active, completed, and failed tasks.
434
+ """
435
+ tm = get_task_manager()
436
+ tasks = []
437
+
438
+ with tm._lock:
439
+ for task_id, task in tm._tasks.items():
440
+ tasks.append(task.to_dict())
441
+
442
+ # Sort by created_at descending (newest first)
443
+ tasks.sort(key=lambda t: t.get("created_at", 0), reverse=True)
444
+
445
+ return JSONResponse(content={"tasks": tasks, "total": len(tasks)})
446
+
447
+
448
+ # ============================================================================
449
+ # GLOBAL EXCEPTION HANDLER
450
+ # ============================================================================
451
+ @app.exception_handler(Exception)
452
+ async def global_exception_handler(request: Request, exc: Exception):
453
+ """
454
+ Catch-all exception handler.
455
+ Logs the error and returns a clean JSON response.
456
+ Prevents ugly stack traces from leaking to the user.
457
+ """
458
+ logger.exception(f"Unhandled exception on {request.url}: {exc}")
459
+ return JSONResponse(
460
+ status_code=500,
461
+ content={
462
+ "detail": "An internal server error occurred. Please try again.",
463
+ "error": str(exc),
464
+ },
465
+ )
466
+
467
+
468
+ # ============================================================================
469
+ # MAIN β€” Direct execution (for local testing)
470
+ # ============================================================================
471
+ if __name__ == "__main__":
472
+ import uvicorn
473
+
474
+ print("=" * 60)
475
+ print("Starting Massive Text Translator...")
476
+ print("Open http://localhost:7860 in your browser")
477
+ print("=" * 60)
478
+
479
+ uvicorn.run(
480
+ "app:app",
481
+ host="0.0.0.0",
482
+ port=7860,
483
+ reload=False, # No reload in production
484
+ workers=1, # Single worker β€” we manage threads ourselves
485
+ log_level="info",
486
+ )
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ fastapi==0.110.0
2
+ uvicorn==0.29.0
3
+ deep-translator==1.11.4
4
+ tenacity==8.2.3
5
+ python-multipart==0.0.9
6
+ jinja2==3.1.3
7
+ aiofiles==23.2.1
task_manager.py ADDED
@@ -0,0 +1,439 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ task_manager.py
3
+ ===============
4
+ Background Task Manager for Massive File Translation
5
+
6
+ RESPONSIBILITIES:
7
+ -----------------
8
+ 1. Manages the lifecycle of translation tasks (create, run, track, cleanup)
9
+ 2. Runs translation in a background thread (non-blocking to FastAPI)
10
+ 3. Maintains a registry of all tasks with their progress
11
+ 4. Handles file path management (uploads, outputs)
12
+ 5. Supports single-task mode (one translation at a time on HuggingFace free tier)
13
+
14
+ WHY SINGLE TASK MODE:
15
+ ---------------------
16
+ On HuggingFace Spaces (free tier):
17
+ - Shared IP = easier to get rate-limited by Google
18
+ - Limited CPU/RAM compared to dedicated server
19
+ - Running 2+ massive translations simultaneously would guarantee IP ban
20
+ - So we queue: one active translation, others wait
21
+
22
+ ARCHITECTURE:
23
+ -------------
24
+ FastAPI Request β†’ TaskManager.create_task()
25
+ ↓
26
+ Background Thread starts
27
+ ↓
28
+ MassiveFileTranslator.translate_file()
29
+ ↓
30
+ Progress updated in real-time
31
+ ↓
32
+ Task marked "completed"
33
+ ↓
34
+ FastAPI serves download link
35
+ """
36
+
37
+ import os
38
+ import uuid
39
+ import time
40
+ import shutil
41
+ import threading
42
+ import logging
43
+ from typing import Optional, Dict
44
+ from dataclasses import dataclass, field
45
+ from enum import Enum
46
+
47
+ from translator_engine import (
48
+ MassiveFileTranslator,
49
+ TranslatorConfig,
50
+ TranslationProgress,
51
+ )
52
+
53
+ logger = logging.getLogger("task_manager")
54
+
55
+
56
+ # ============================================================================
57
+ # DIRECTORY SETUP
58
+ # ============================================================================
59
+ # Base directories β€” created at module load time
60
+ BASE_DIR = os.path.dirname(os.path.abspath(__file__))
61
+ UPLOAD_DIR = os.path.join(BASE_DIR, "uploads")
62
+ OUTPUT_DIR = os.path.join(BASE_DIR, "outputs")
63
+
64
+ # Ensure directories exist
65
+ os.makedirs(UPLOAD_DIR, exist_ok=True)
66
+ os.makedirs(OUTPUT_DIR, exist_ok=True)
67
+
68
+
69
+ # ============================================================================
70
+ # TASK STATUS ENUM
71
+ # ============================================================================
72
+ class TaskStatus(str, Enum):
73
+ QUEUED = "queued"
74
+ PREPARING = "preparing"
75
+ TRANSLATING = "translating"
76
+ COMPLETED = "completed"
77
+ FAILED = "failed"
78
+ CANCELLED = "cancelled"
79
+
80
+
81
+ # ============================================================================
82
+ # SINGLE TASK REPRESENTATION
83
+ # ============================================================================
84
+ @dataclass
85
+ class TranslationTask:
86
+ """
87
+ Represents one translation job.
88
+
89
+ Lifecycle:
90
+ QUEUED β†’ PREPARING β†’ TRANSLATING β†’ COMPLETED
91
+ β†’ FAILED
92
+ β†’ CANCELLED
93
+ """
94
+ task_id: str
95
+ original_filename: str
96
+ input_path: str
97
+ output_path: str
98
+ status: TaskStatus = TaskStatus.QUEUED
99
+ created_at: float = field(default_factory=time.time)
100
+ completed_at: float = 0.0
101
+ error_message: str = ""
102
+ progress: TranslationProgress = field(default_factory=TranslationProgress)
103
+ translator: Optional[MassiveFileTranslator] = field(default=None, repr=False)
104
+ thread: Optional[threading.Thread] = field(default=None, repr=False)
105
+
106
+ def to_dict(self) -> dict:
107
+ """Serialize task info for API response."""
108
+ return {
109
+ "task_id": self.task_id,
110
+ "original_filename": self.original_filename,
111
+ "status": self.status.value,
112
+ "created_at": self.created_at,
113
+ "completed_at": self.completed_at,
114
+ "error_message": self.error_message,
115
+ "progress": self.progress.to_dict(),
116
+ "output_filename": os.path.basename(self.output_path),
117
+ }
118
+
119
+
120
+ # ============================================================================
121
+ # TASK MANAGER β€” Singleton that manages all translation tasks
122
+ # ============================================================================
123
+ class TaskManager:
124
+ """
125
+ Central manager for all translation tasks.
126
+
127
+ THREAD SAFETY:
128
+ - Uses a lock for task registry modifications
129
+ - Each task runs in its own background thread
130
+ - Progress objects are internally thread-safe (have their own locks)
131
+
132
+ SINGLE TASK ENFORCEMENT:
133
+ - Only one translation can be ACTIVE at a time
134
+ - If a new upload comes while one is running, it returns an error
135
+ - This prevents Google IP bans from concurrent heavy usage
136
+
137
+ FILE CLEANUP:
138
+ - Old completed tasks' files are cleaned up after configurable time
139
+ - Prevents disk space exhaustion on HuggingFace (limited storage)
140
+ """
141
+
142
+ def __init__(self, config: Optional[TranslatorConfig] = None):
143
+ self.config = config or TranslatorConfig()
144
+ self._tasks: Dict[str, TranslationTask] = {}
145
+ self._lock = threading.Lock()
146
+ self._active_task_id: Optional[str] = None
147
+
148
+ # Auto-cleanup interval (seconds) β€” remove completed tasks after 1 hour
149
+ self.cleanup_after_seconds = 3600
150
+
151
+ logger.info("TaskManager initialized.")
152
+ logger.info(f"Upload directory: {UPLOAD_DIR}")
153
+ logger.info(f"Output directory: {OUTPUT_DIR}")
154
+
155
+ # -----------------------------------------------------------------------
156
+ # PUBLIC API
157
+ # -----------------------------------------------------------------------
158
+
159
+ def is_busy(self) -> bool:
160
+ """Check if a translation is currently running."""
161
+ with self._lock:
162
+ if self._active_task_id is None:
163
+ return False
164
+ task = self._tasks.get(self._active_task_id)
165
+ if task is None:
166
+ self._active_task_id = None
167
+ return False
168
+ # If active task is done/failed/cancelled, we're not busy
169
+ if task.status in (
170
+ TaskStatus.COMPLETED,
171
+ TaskStatus.FAILED,
172
+ TaskStatus.CANCELLED,
173
+ ):
174
+ self._active_task_id = None
175
+ return False
176
+ return True
177
+
178
+ def create_task(self, original_filename: str, input_path: str) -> TranslationTask:
179
+ """
180
+ Create a new translation task.
181
+
182
+ Args:
183
+ original_filename: The user's original file name (for display)
184
+ input_path: Path to the uploaded file on disk
185
+
186
+ Returns:
187
+ TranslationTask object
188
+
189
+ Raises:
190
+ RuntimeError if another translation is already running
191
+ """
192
+ with self._lock:
193
+ # --- Single task enforcement ---
194
+ if self.is_busy():
195
+ raise RuntimeError(
196
+ "Another translation is currently in progress. "
197
+ "Please wait for it to complete before uploading a new file."
198
+ )
199
+
200
+ # Generate unique task ID
201
+ task_id = str(uuid.uuid4())[:12]
202
+
203
+ # Create output file path
204
+ # Original: "my_novel.txt" β†’ Output: "my_novel_hindi_a1b2c3d4.txt"
205
+ name_without_ext = os.path.splitext(original_filename)[0]
206
+ output_filename = f"{name_without_ext}_hindi_{task_id}.txt"
207
+ output_path = os.path.join(OUTPUT_DIR, output_filename)
208
+
209
+ # Create progress tracker
210
+ progress = TranslationProgress()
211
+ progress.file_name = original_filename
212
+
213
+ # Create translator instance
214
+ translator = MassiveFileTranslator(
215
+ config=self.config,
216
+ progress=progress,
217
+ )
218
+
219
+ # Create task
220
+ task = TranslationTask(
221
+ task_id=task_id,
222
+ original_filename=original_filename,
223
+ input_path=input_path,
224
+ output_path=output_path,
225
+ progress=progress,
226
+ translator=translator,
227
+ )
228
+
229
+ # Register task
230
+ self._tasks[task_id] = task
231
+ self._active_task_id = task_id
232
+
233
+ logger.info(
234
+ f"Task created: {task_id} | File: {original_filename} | "
235
+ f"Input: {input_path} | Output: {output_path}"
236
+ )
237
+
238
+ return task
239
+
240
+ def start_task(self, task_id: str):
241
+ """
242
+ Start the translation task in a background thread.
243
+
244
+ The thread runs the translator and updates progress in real-time.
245
+ FastAPI can poll progress via get_task_progress().
246
+ """
247
+ with self._lock:
248
+ task = self._tasks.get(task_id)
249
+ if task is None:
250
+ raise ValueError(f"Task not found: {task_id}")
251
+ if task.status != TaskStatus.QUEUED:
252
+ raise RuntimeError(f"Task {task_id} is not in QUEUED state")
253
+
254
+ # Create and start background thread
255
+ thread = threading.Thread(
256
+ target=self._run_translation,
257
+ args=(task,),
258
+ name=f"translator-{task_id}",
259
+ daemon=True, # Thread dies if main process exits
260
+ )
261
+ task.thread = thread
262
+ thread.start()
263
+
264
+ logger.info(f"Task {task_id} started in background thread.")
265
+
266
+ def get_task(self, task_id: str) -> Optional[TranslationTask]:
267
+ """Get a task by ID."""
268
+ with self._lock:
269
+ return self._tasks.get(task_id)
270
+
271
+ def get_task_progress(self, task_id: str) -> Optional[dict]:
272
+ """Get task progress as a dictionary (for API response)."""
273
+ task = self.get_task(task_id)
274
+ if task is None:
275
+ return None
276
+ return task.to_dict()
277
+
278
+ def get_latest_task(self) -> Optional[TranslationTask]:
279
+ """Get the most recent task (for simple single-task UI)."""
280
+ with self._lock:
281
+ if not self._tasks:
282
+ return None
283
+ # Return the task with the latest created_at timestamp
284
+ return max(self._tasks.values(), key=lambda t: t.created_at)
285
+
286
+ def get_latest_task_progress(self) -> Optional[dict]:
287
+ """Get progress of the most recent task."""
288
+ task = self.get_latest_task()
289
+ if task is None:
290
+ return None
291
+ return task.to_dict()
292
+
293
+ def cancel_task(self, task_id: str) -> bool:
294
+ """
295
+ Cancel a running translation task.
296
+
297
+ Sets a cancel flag that the translator checks periodically.
298
+ The translator will stop after completing its current paragraph.
299
+ Already-translated content is preserved on disk.
300
+ """
301
+ task = self.get_task(task_id)
302
+ if task is None:
303
+ return False
304
+
305
+ if task.status not in (TaskStatus.QUEUED, TaskStatus.PREPARING, TaskStatus.TRANSLATING):
306
+ return False # Can't cancel a completed/failed task
307
+
308
+ task.translator.cancel()
309
+ task.status = TaskStatus.CANCELLED
310
+ task.completed_at = time.time()
311
+
312
+ with self._lock:
313
+ if self._active_task_id == task_id:
314
+ self._active_task_id = None
315
+
316
+ logger.info(f"Task {task_id} cancelled.")
317
+ return True
318
+
319
+ def cleanup_old_tasks(self):
320
+ """
321
+ Remove completed/failed tasks older than cleanup_after_seconds.
322
+ Also deletes their files from disk to free space.
323
+
324
+ Called periodically (e.g., before each new upload).
325
+ """
326
+ now = time.time()
327
+ to_remove = []
328
+
329
+ with self._lock:
330
+ for task_id, task in self._tasks.items():
331
+ if task.status in (
332
+ TaskStatus.COMPLETED,
333
+ TaskStatus.FAILED,
334
+ TaskStatus.CANCELLED,
335
+ ):
336
+ age = now - task.completed_at if task.completed_at > 0 else now - task.created_at
337
+ if age > self.cleanup_after_seconds:
338
+ to_remove.append(task_id)
339
+
340
+ for task_id in to_remove:
341
+ self._remove_task_files(task_id)
342
+ with self._lock:
343
+ del self._tasks[task_id]
344
+ logger.info(f"Cleaned up old task: {task_id}")
345
+
346
+ def get_output_path(self, task_id: str) -> Optional[str]:
347
+ """Get the output file path for a completed task."""
348
+ task = self.get_task(task_id)
349
+ if task is None:
350
+ return None
351
+ if task.status != TaskStatus.COMPLETED:
352
+ return None
353
+ if not os.path.exists(task.output_path):
354
+ return None
355
+ return task.output_path
356
+
357
+ # -----------------------------------------------------------------------
358
+ # PRIVATE METHODS
359
+ # -----------------------------------------------------------------------
360
+
361
+ def _run_translation(self, task: TranslationTask):
362
+ """
363
+ The actual translation runner β€” executes in a background thread.
364
+
365
+ ERROR HANDLING:
366
+ - Any exception is caught and stored in the task
367
+ - Task status is set to FAILED
368
+ - The active_task_id is cleared so new tasks can be submitted
369
+ - Already-written content is preserved on disk
370
+ """
371
+ try:
372
+ task.status = TaskStatus.PREPARING
373
+ task.progress.set_status("preparing", "Starting translation...")
374
+
375
+ logger.info(
376
+ f"Task {task.task_id}: Starting translation of "
377
+ f"'{task.original_filename}'"
378
+ )
379
+
380
+ # Run the translation (this blocks until complete)
381
+ task.translator.translate_file(task.input_path, task.output_path)
382
+
383
+ # Check if it was cancelled during translation
384
+ if task.translator._cancel_flag.is_set():
385
+ task.status = TaskStatus.CANCELLED
386
+ task.completed_at = time.time()
387
+ logger.info(f"Task {task.task_id}: Cancelled during translation.")
388
+ else:
389
+ task.status = TaskStatus.COMPLETED
390
+ task.completed_at = time.time()
391
+ logger.info(
392
+ f"Task {task.task_id}: Translation COMPLETED. "
393
+ f"Output: {task.output_path}"
394
+ )
395
+
396
+ except Exception as e:
397
+ task.status = TaskStatus.FAILED
398
+ task.error_message = str(e)
399
+ task.completed_at = time.time()
400
+ task.progress.set_status("failed", f"Error: {str(e)}")
401
+ logger.exception(f"Task {task.task_id}: FAILED with error: {e}")
402
+
403
+ finally:
404
+ # Clear active task so new uploads are accepted
405
+ with self._lock:
406
+ if self._active_task_id == task.task_id:
407
+ self._active_task_id = None
408
+
409
+ def _remove_task_files(self, task_id: str):
410
+ """Safely delete task files from disk."""
411
+ task = self._tasks.get(task_id)
412
+ if task is None:
413
+ return
414
+
415
+ for path in [task.input_path, task.output_path]:
416
+ try:
417
+ if path and os.path.exists(path):
418
+ os.remove(path)
419
+ logger.debug(f"Deleted file: {path}")
420
+ except OSError as e:
421
+ logger.warning(f"Could not delete {path}: {e}")
422
+
423
+
424
+ # ============================================================================
425
+ # GLOBAL SINGLETON β€” Used by FastAPI app
426
+ # ============================================================================
427
+ # Create a single TaskManager instance shared across the entire application
428
+ # This is safe because TaskManager is thread-safe internally
429
+
430
+ _global_task_manager: Optional[TaskManager] = None
431
+
432
+
433
+ def get_task_manager() -> TaskManager:
434
+ """Get or create the global TaskManager singleton."""
435
+ global _global_task_manager
436
+ if _global_task_manager is None:
437
+ config = TranslatorConfig()
438
+ _global_task_manager = TaskManager(config=config)
439
+ return _global_task_manager
translator_engine.py ADDED
@@ -0,0 +1,740 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ translator_engine.py
3
+ ====================
4
+ Core Translation Engine for Massive Text Files (English -> Hindi)
5
+
6
+ ARCHITECTURE DECISIONS:
7
+ -----------------------
8
+ 1. STREAMING READ/WRITE: We never hold the full translated file in RAM.
9
+ - Read source file paragraph-by-paragraph
10
+ - Translate each paragraph
11
+ - Immediately append translated text to output file on disk
12
+ - If process crashes at paragraph 25,000 β€” the first 24,999 are SAFE on disk
13
+
14
+ 2. CHUNKING STRATEGY:
15
+ - Split file by double-newline (\n\n) to get paragraphs
16
+ - Each paragraph is one translation unit
17
+ - Single newlines (\n) WITHIN a paragraph are preserved by translating
18
+ line-by-line inside each paragraph
19
+ - This guarantees ZERO text loss and exact structural preservation
20
+
21
+ 3. RATE LIMITING & IP BAN PREVENTION:
22
+ - Each translation call has a mandatory sleep (configurable, default 0.3s)
23
+ - ThreadPoolExecutor with LIMITED workers (default 2) β€” NOT aggressive
24
+ - On HTTP 429 (Too Many Requests): exponential backoff via tenacity
25
+ - Base wait: 10 seconds, multiplier: 2x, max wait: 320 seconds
26
+ - Max 10 retries per chunk before marking as FAILED
27
+ - On connection drop: same retry logic catches ConnectionError, Timeout
28
+
29
+ 4. CONCURRENCY MODEL:
30
+ - We use ThreadPoolExecutor but with STRICT controls:
31
+ a) Max 2 workers (Google free tier can't handle more)
32
+ b) Each worker has mandatory inter-request delay
33
+ c) A global rate limiter (threading.Semaphore) prevents burst
34
+ - This is NOT about max speed β€” it's about RELIABLE completion
35
+ of 50,000+ paragraphs without getting IP-banned
36
+
37
+ 5. ERROR HANDLING SCENARIOS (explicitly documented):
38
+ - HTTP 429 Too Many Requests β†’ exponential backoff, retry up to 10x
39
+ - ConnectionError / Timeout β†’ same retry logic, waits and retries
40
+ - InvalidURL / encoding error β†’ log error, write original text as fallback
41
+ - Process crash β†’ all previously written paragraphs are safe on disk
42
+ - Google temporary block β†’ backoff will wait up to ~320s per retry
43
+ """
44
+
45
+ import os
46
+ import re
47
+ import time
48
+ import threading
49
+ import logging
50
+ from typing import Optional, Callable
51
+ from concurrent.futures import ThreadPoolExecutor, as_completed
52
+ from dataclasses import dataclass, field
53
+
54
+ from deep_translator import GoogleTranslator
55
+ from tenacity import (
56
+ retry,
57
+ stop_after_attempt,
58
+ wait_exponential,
59
+ retry_if_exception_type,
60
+ before_sleep_log,
61
+ RetryError,
62
+ )
63
+
64
+ # ============================================================================
65
+ # LOGGING SETUP
66
+ # ============================================================================
67
+ logging.basicConfig(
68
+ level=logging.INFO,
69
+ format="%(asctime)s [%(threadName)s] %(levelname)s: %(message)s",
70
+ )
71
+ logger = logging.getLogger("translator_engine")
72
+
73
+
74
+ # ============================================================================
75
+ # CONFIGURATION β€” All tunables in one place
76
+ # ============================================================================
77
+ @dataclass
78
+ class TranslatorConfig:
79
+ """
80
+ All configuration for the translation engine.
81
+ Tuned for Google Free Translator on a 16GB RAM server.
82
+
83
+ IMPORTANT FOR HUGGINGFACE SPACES:
84
+ - HF Spaces have shared IPs, so we must be EXTRA conservative
85
+ - Lower workers, higher delays to avoid bans
86
+ """
87
+
88
+ # --- Translation Settings ---
89
+ source_lang: str = "en"
90
+ target_lang: str = "hi"
91
+
92
+ # --- Chunking ---
93
+ # Google Translator free tier has a ~5000 char limit per request.
94
+ # We set our max chunk size to 4500 to leave safety margin.
95
+ max_chunk_chars: int = 4500
96
+
97
+ # --- Rate Limiting ---
98
+ # Minimum seconds to wait BETWEEN translation requests (per thread)
99
+ min_request_delay: float = 0.5
100
+
101
+ # How many concurrent translation threads
102
+ # KEEP THIS LOW β€” Google will ban aggressive concurrent requests
103
+ max_workers: int = 2
104
+
105
+ # Global semaphore β€” max simultaneous in-flight requests across all threads
106
+ max_concurrent_requests: int = 2
107
+
108
+ # --- Retry / Backoff (for HTTP 429, Connection drops) ---
109
+ max_retries: int = 10 # Max retries per single chunk
110
+ backoff_base_wait: int = 10 # First retry waits 10 seconds
111
+ backoff_multiplier: int = 2 # Each subsequent retry doubles wait
112
+ backoff_max_wait: int = 320 # Never wait more than 320s per retry
113
+
114
+ # --- File I/O ---
115
+ # How many paragraphs to batch into one write operation
116
+ # (each paragraph is still translated individually, but we flush to disk
117
+ # every N paragraphs for I/O efficiency)
118
+ disk_flush_interval: int = 5
119
+
120
+
121
+ # ============================================================================
122
+ # PROGRESS TRACKER β€” Thread-safe progress monitoring
123
+ # ============================================================================
124
+ @dataclass
125
+ class TranslationProgress:
126
+ """
127
+ Thread-safe progress tracker.
128
+ The frontend polls this to show real-time progress.
129
+ """
130
+ total_paragraphs: int = 0
131
+ translated_paragraphs: int = 0
132
+ failed_paragraphs: int = 0
133
+ status: str = "idle" # idle, preparing, translating, completed, failed
134
+ error_message: str = ""
135
+ current_phase: str = ""
136
+ start_time: float = 0.0
137
+ file_name: str = ""
138
+ output_file: str = ""
139
+
140
+ _lock: threading.Lock = field(default_factory=threading.Lock, repr=False)
141
+
142
+ def increment_translated(self):
143
+ with self._lock:
144
+ self.translated_paragraphs += 1
145
+
146
+ def increment_failed(self):
147
+ with self._lock:
148
+ self.failed_paragraphs += 1
149
+
150
+ def set_status(self, status: str, phase: str = ""):
151
+ with self._lock:
152
+ self.status = status
153
+ if phase:
154
+ self.current_phase = phase
155
+
156
+ def to_dict(self) -> dict:
157
+ with self._lock:
158
+ elapsed = 0
159
+ speed = 0
160
+ eta_seconds = 0
161
+
162
+ if self.start_time > 0 and self.translated_paragraphs > 0:
163
+ elapsed = time.time() - self.start_time
164
+ speed = self.translated_paragraphs / elapsed # paragraphs per second
165
+ remaining = self.total_paragraphs - self.translated_paragraphs
166
+ eta_seconds = remaining / speed if speed > 0 else 0
167
+
168
+ return {
169
+ "total_paragraphs": self.total_paragraphs,
170
+ "translated_paragraphs": self.translated_paragraphs,
171
+ "failed_paragraphs": self.failed_paragraphs,
172
+ "status": self.status,
173
+ "error_message": self.error_message,
174
+ "current_phase": self.current_phase,
175
+ "file_name": self.file_name,
176
+ "output_file": self.output_file,
177
+ "elapsed_seconds": round(elapsed, 1),
178
+ "speed_per_second": round(speed, 2),
179
+ "eta_seconds": round(eta_seconds, 1),
180
+ "percent_complete": round(
181
+ (self.translated_paragraphs / self.total_paragraphs * 100)
182
+ if self.total_paragraphs > 0
183
+ else 0,
184
+ 1,
185
+ ),
186
+ }
187
+
188
+
189
+ # ============================================================================
190
+ # SINGLE CHUNK TRANSLATOR β€” With full retry & backoff logic
191
+ # ============================================================================
192
+ class ChunkTranslator:
193
+ """
194
+ Translates a single text chunk (paragraph or sub-paragraph).
195
+
196
+ ERROR HANDLING SCENARIOS:
197
+ -------------------------
198
+ SCENARIO 1 β€” HTTP 429 Too Many Requests (IP Rate Limit):
199
+ Google returns 429 when we're sending too many requests.
200
+ β†’ tenacity catches this, waits 10s (then 20s, 40s, 80s, 160s, 320s...)
201
+ β†’ Retries the EXACT same chunk up to 10 times
202
+ β†’ After 10 failures, returns original text as fallback (no data loss)
203
+
204
+ SCENARIO 2 β€” Connection Drop (ConnectionError, Timeout):
205
+ Network issues, Google server down, DNS failure, etc.
206
+ β†’ Same retry logic as above catches these exceptions
207
+ β†’ Exponential backoff gives the network time to recover
208
+ β†’ If persistent failure, returns original text
209
+
210
+ SCENARIO 3 β€” Encoding/Invalid Input:
211
+ Malformed characters, binary data mixed in text
212
+ β†’ Caught by generic Exception handler
213
+ β†’ Returns original text (preserving content even if untranslated)
214
+
215
+ SCENARIO 4 β€” Google Temporary IP Block (longer than rate limit):
216
+ Sometimes Google blocks an IP for minutes, not just seconds
217
+ β†’ Our max backoff is 320 seconds (~5 minutes)
218
+ β†’ With 10 retries and exponential backoff, total possible wait
219
+ is ~10 + 20 + 40 + 80 + 160 + 320 + 320 + 320 + 320 + 320 = ~31 minutes
220
+ β†’ This is usually enough for Google to unblock
221
+ """
222
+
223
+ def __init__(self, config: TranslatorConfig):
224
+ self.config = config
225
+ # Global semaphore: limits total in-flight requests across all threads
226
+ self.request_semaphore = threading.Semaphore(config.max_concurrent_requests)
227
+ # Per-thread rate limiter
228
+ self._last_request_time = threading.local()
229
+
230
+ def _enforce_rate_limit(self):
231
+ """
232
+ Ensures minimum delay between requests on the SAME thread.
233
+ This prevents any single thread from hammering Google.
234
+ """
235
+ now = time.time()
236
+ last = getattr(self._last_request_time, "value", 0)
237
+ elapsed = now - last
238
+ if elapsed < self.config.min_request_delay:
239
+ sleep_time = self.config.min_request_delay - elapsed
240
+ time.sleep(sleep_time)
241
+ self._last_request_time.value = time.time()
242
+
243
+ @retry(
244
+ # Retry on these specific exceptions (network + rate limit errors)
245
+ retry=retry_if_exception_type((
246
+ Exception, # deep-translator wraps errors in generic exceptions
247
+ )),
248
+ # Exponential backoff: 10s β†’ 20s β†’ 40s β†’ 80s β†’ 160s β†’ 320s (capped)
249
+ wait=wait_exponential(
250
+ multiplier=2,
251
+ min=10,
252
+ max=320,
253
+ ),
254
+ # Maximum 10 retry attempts per chunk
255
+ stop=stop_after_attempt(10),
256
+ # Log before each retry sleep (helps debugging)
257
+ before_sleep=before_sleep_log(logger, logging.WARNING),
258
+ # Re-raise the final exception if all retries exhausted
259
+ reraise=True,
260
+ )
261
+ def _translate_with_retry(self, text: str) -> str:
262
+ """
263
+ The actual translation call, wrapped with tenacity retry decorator.
264
+
265
+ This is the INNERMOST function β€” it talks to Google Translate.
266
+ If Google returns 429 or connection drops, tenacity handles retry.
267
+ """
268
+ # Acquire semaphore β€” blocks if too many concurrent requests
269
+ with self.request_semaphore:
270
+ # Enforce per-thread rate limit
271
+ self._enforce_rate_limit()
272
+
273
+ # Create a fresh translator instance per call
274
+ # (deep-translator is not guaranteed thread-safe with shared instances)
275
+ translator = GoogleTranslator(
276
+ source=self.config.source_lang,
277
+ target=self.config.target_lang,
278
+ )
279
+
280
+ result = translator.translate(text=text)
281
+
282
+ # Google sometimes returns None for empty strings
283
+ if result is None:
284
+ return text
285
+
286
+ return result
287
+
288
+ def translate_chunk(self, text: str) -> str:
289
+ """
290
+ Public API: Translate a text chunk with full error handling.
291
+
292
+ Returns translated text on success.
293
+ Returns ORIGINAL text on permanent failure (zero text loss guarantee).
294
+ """
295
+ # Don't waste API calls on empty/whitespace-only text
296
+ if not text or not text.strip():
297
+ return text
298
+
299
+ try:
300
+ return self._translate_with_retry(text)
301
+ except RetryError as e:
302
+ # All 10 retries exhausted β€” log and return original
303
+ logger.error(
304
+ f"PERMANENT FAILURE after {self.config.max_retries} retries. "
305
+ f"Chunk (first 100 chars): '{text[:100]}...'. "
306
+ f"Error: {e.last_attempt.exception()}"
307
+ )
308
+ # ZERO TEXT LOSS: Return original text rather than losing it
309
+ return text
310
+ except Exception as e:
311
+ # Unexpected error β€” still preserve the original text
312
+ logger.error(f"Unexpected translation error: {e}. Preserving original text.")
313
+ return text
314
+
315
+
316
+ # ============================================================================
317
+ # PARAGRAPH SPLITTER β€” Preserves exact structure
318
+ # ============================================================================
319
+ class TextSplitter:
320
+ """
321
+ Splits massive text files into translation-ready chunks while
322
+ preserving EXACT line break structure.
323
+
324
+ STRATEGY:
325
+ 1. Split file by double-newline (\n\n) β†’ paragraphs
326
+ 2. For each paragraph, if it's within char limit β†’ translate as-is
327
+ 3. If paragraph exceeds char limit β†’ split by single newline (\n) β†’ lines
328
+ 4. If a single line exceeds char limit β†’ split by sentences
329
+ 5. After translation, rejoin with EXACT same separators
330
+
331
+ This guarantees:
332
+ - Every \n is preserved
333
+ - Every \n\n is preserved
334
+ - No lines are skipped or merged
335
+ """
336
+
337
+ def __init__(self, config: TranslatorConfig):
338
+ self.config = config
339
+
340
+ def split_into_paragraphs(self, file_path: str):
341
+ """
342
+ Generator: Yields paragraphs from file WITHOUT loading entire file into RAM.
343
+
344
+ Uses streaming read β€” reads file line by line, accumulates paragraphs,
345
+ yields when a paragraph boundary (double newline) is found.
346
+
347
+ Memory usage: Only ONE paragraph in RAM at a time.
348
+ For a 3-million-word file, this is crucial.
349
+ """
350
+ current_paragraph_lines = []
351
+ consecutive_empty_lines = 0
352
+
353
+ with open(file_path, "r", encoding="utf-8", errors="replace") as f:
354
+ for line in f:
355
+ # Remove only the trailing newline for analysis
356
+ stripped = line.rstrip("\n")
357
+
358
+ if stripped == "":
359
+ consecutive_empty_lines += 1
360
+ if consecutive_empty_lines >= 1 and current_paragraph_lines:
361
+ # End of a paragraph β€” yield it
362
+ paragraph_text = "\n".join(current_paragraph_lines)
363
+ yield paragraph_text
364
+ current_paragraph_lines = []
365
+ # Yield empty line as separator (preserves exact blank lines)
366
+ yield ""
367
+ else:
368
+ consecutive_empty_lines = 0
369
+ current_paragraph_lines.append(stripped)
370
+
371
+ # Don't forget the last paragraph (file might not end with \n\n)
372
+ if current_paragraph_lines:
373
+ paragraph_text = "\n".join(current_paragraph_lines)
374
+ yield paragraph_text
375
+
376
+ def split_paragraph_into_chunks(self, paragraph: str) -> list[str]:
377
+ """
378
+ If a paragraph is too long for one API call, split it into
379
+ smaller chunks while preserving line boundaries.
380
+
381
+ Returns a list of chunk strings. When rejoined with \n, they
382
+ reconstruct the original paragraph exactly.
383
+ """
384
+ if len(paragraph) <= self.config.max_chunk_chars:
385
+ return [paragraph]
386
+
387
+ # Split by lines first
388
+ lines = paragraph.split("\n")
389
+ chunks = []
390
+ current_chunk_lines = []
391
+ current_chunk_len = 0
392
+
393
+ for line in lines:
394
+ line_len = len(line) + 1 # +1 for the \n separator
395
+
396
+ if current_chunk_len + line_len > self.config.max_chunk_chars:
397
+ if current_chunk_lines:
398
+ chunks.append("\n".join(current_chunk_lines))
399
+ current_chunk_lines = []
400
+ current_chunk_len = 0
401
+
402
+ # If a single line is STILL too long, split by sentences
403
+ if line_len > self.config.max_chunk_chars:
404
+ sentence_chunks = self._split_long_line(line)
405
+ chunks.extend(sentence_chunks)
406
+ continue
407
+
408
+ current_chunk_lines.append(line)
409
+ current_chunk_len += line_len
410
+
411
+ if current_chunk_lines:
412
+ chunks.append("\n".join(current_chunk_lines))
413
+
414
+ return chunks
415
+
416
+ def _split_long_line(self, line: str) -> list[str]:
417
+ """
418
+ Last resort: Split a very long single line by sentence boundaries.
419
+ Preserves all text β€” just breaks it into translatable pieces.
420
+ """
421
+ # Split on sentence endings while keeping the delimiter
422
+ sentences = re.split(r"(?<=[.!?])\s+", line)
423
+ chunks = []
424
+ current = ""
425
+
426
+ for sentence in sentences:
427
+ if len(current) + len(sentence) + 1 > self.config.max_chunk_chars:
428
+ if current:
429
+ chunks.append(current)
430
+ current = sentence
431
+ else:
432
+ current = f"{current} {sentence}" if current else sentence
433
+
434
+ if current:
435
+ chunks.append(current)
436
+
437
+ # If we still have chunks that are too long (no sentence boundaries),
438
+ # do a hard split at max_chunk_chars (last resort, very rare)
439
+ final_chunks = []
440
+ for chunk in chunks:
441
+ if len(chunk) > self.config.max_chunk_chars:
442
+ for i in range(0, len(chunk), self.config.max_chunk_chars):
443
+ final_chunks.append(chunk[i : i + self.config.max_chunk_chars])
444
+ else:
445
+ final_chunks.append(chunk)
446
+
447
+ return final_chunks
448
+
449
+ def count_paragraphs(self, file_path: str) -> int:
450
+ """
451
+ Quick scan to count total paragraphs for progress tracking.
452
+ Streams through file β€” doesn't load into RAM.
453
+ """
454
+ count = 0
455
+ for _ in self.split_into_paragraphs(file_path):
456
+ count += 1
457
+ return count
458
+
459
+
460
+ # ============================================================================
461
+ # MAIN TRANSLATION ORCHESTRATOR β€” Ties everything together
462
+ # ============================================================================
463
+ class MassiveFileTranslator:
464
+ """
465
+ The main orchestrator that:
466
+ 1. Reads the input file (streaming)
467
+ 2. Splits into paragraphs
468
+ 3. Translates each paragraph (with concurrency + rate limiting)
469
+ 4. Writes translated text to output file (streaming/append)
470
+ 5. Tracks progress in real-time
471
+
472
+ STREAMING WRITE GUARANTEE:
473
+ --------------------------
474
+ Every translated paragraph is IMMEDIATELY flushed to the output file.
475
+ If the process crashes at paragraph 25,000 out of 50,000:
476
+ - The output file contains paragraphs 1-24,999 fully translated
477
+ - No data is held only in RAM
478
+ - You can resume or at least salvage the partial translation
479
+ """
480
+
481
+ def __init__(
482
+ self,
483
+ config: Optional[TranslatorConfig] = None,
484
+ progress: Optional[TranslationProgress] = None,
485
+ ):
486
+ self.config = config or TranslatorConfig()
487
+ self.progress = progress or TranslationProgress()
488
+ self.chunk_translator = ChunkTranslator(self.config)
489
+ self.text_splitter = TextSplitter(self.config)
490
+ # Lock for sequential file writing (multiple threads, one output file)
491
+ self._write_lock = threading.Lock()
492
+ # Flag to support graceful cancellation
493
+ self._cancel_flag = threading.Event()
494
+
495
+ def cancel(self):
496
+ """Signal the translation to stop gracefully."""
497
+ self._cancel_flag.set()
498
+ self.progress.set_status("failed", "Cancelled by user")
499
+
500
+ def translate_file(self, input_path: str, output_path: str) -> str:
501
+ """
502
+ Main entry point: Translate an entire file.
503
+
504
+ This method is designed to be called in a background thread.
505
+ It streams through the input, translates, and streams to output.
506
+
507
+ Returns the output file path on completion.
508
+ """
509
+ try:
510
+ self.progress.set_status("preparing", "Counting paragraphs...")
511
+ self.progress.file_name = os.path.basename(input_path)
512
+ self.progress.output_file = os.path.basename(output_path)
513
+
514
+ # Phase 1: Count total paragraphs (quick streaming scan)
515
+ logger.info(f"Counting paragraphs in: {input_path}")
516
+ total = self.text_splitter.count_paragraphs(input_path)
517
+ self.progress.total_paragraphs = total
518
+ logger.info(f"Total paragraphs to translate: {total}")
519
+
520
+ if total == 0:
521
+ self.progress.set_status("completed", "File is empty")
522
+ # Create empty output file
523
+ open(output_path, "w", encoding="utf-8").close()
524
+ return output_path
525
+
526
+ # Phase 2: Clear/create output file
527
+ with open(output_path, "w", encoding="utf-8") as f:
528
+ f.write("") # Create empty file
529
+
530
+ # Phase 3: Stream-translate with ordered sequential writing
531
+ self.progress.set_status("translating", "Translation in progress...")
532
+ self.progress.start_time = time.time()
533
+
534
+ self._translate_sequential_with_threads(input_path, output_path)
535
+
536
+ # Phase 4: Done
537
+ if not self._cancel_flag.is_set():
538
+ self.progress.set_status("completed", "Translation finished!")
539
+ logger.info(
540
+ f"Translation complete. Output: {output_path}. "
541
+ f"Translated: {self.progress.translated_paragraphs}, "
542
+ f"Failed: {self.progress.failed_paragraphs}"
543
+ )
544
+
545
+ return output_path
546
+
547
+ except Exception as e:
548
+ logger.exception(f"Fatal error during translation: {e}")
549
+ self.progress.set_status("failed", f"Fatal error: {str(e)}")
550
+ self.progress.error_message = str(e)
551
+ raise
552
+
553
+ def _translate_sequential_with_threads(
554
+ self, input_path: str, output_path: str
555
+ ):
556
+ """
557
+ ORDERED translation with thread pool.
558
+
559
+ WHY ORDERED MATTERS:
560
+ We must write paragraphs to the output file in the EXACT same order
561
+ as the input file. Random ordering would scramble the book.
562
+
563
+ STRATEGY:
564
+ - We use a thread pool for concurrent translation
565
+ - But we process in BATCHES to maintain order
566
+ - Each batch = N paragraphs (N = max_workers * 3 for pipeline efficiency)
567
+ - Within a batch, paragraphs are translated concurrently
568
+ - After the entire batch is done, we write results IN ORDER to disk
569
+ - Then move to the next batch
570
+
571
+ This gives us:
572
+ βœ… Concurrency (multiple paragraphs translated simultaneously)
573
+ βœ… Strict ordering (output matches input structure exactly)
574
+ βœ… Streaming writes (each batch is flushed to disk immediately)
575
+ """
576
+ batch_size = self.config.max_workers * 3 # Pipeline efficiency
577
+ paragraph_generator = self.text_splitter.split_into_paragraphs(input_path)
578
+
579
+ with ThreadPoolExecutor(
580
+ max_workers=self.config.max_workers,
581
+ thread_name_prefix="translator",
582
+ ) as executor:
583
+
584
+ batch = []
585
+ batch_indices = []
586
+ paragraph_index = 0
587
+
588
+ for paragraph in paragraph_generator:
589
+ if self._cancel_flag.is_set():
590
+ logger.warning("Translation cancelled.")
591
+ return
592
+
593
+ batch.append(paragraph)
594
+ batch_indices.append(paragraph_index)
595
+ paragraph_index += 1
596
+
597
+ # Process batch when full
598
+ if len(batch) >= batch_size:
599
+ self._process_batch(
600
+ executor, batch, batch_indices, output_path
601
+ )
602
+ batch = []
603
+ batch_indices = []
604
+
605
+ # Process remaining paragraphs in the last (partial) batch
606
+ if batch:
607
+ self._process_batch(
608
+ executor, batch, batch_indices, output_path
609
+ )
610
+
611
+ def _process_batch(
612
+ self,
613
+ executor: ThreadPoolExecutor,
614
+ batch: list[str],
615
+ indices: list[int],
616
+ output_path: str,
617
+ ):
618
+ """
619
+ Process a batch of paragraphs:
620
+ 1. Submit all to thread pool for concurrent translation
621
+ 2. Wait for ALL to complete
622
+ 3. Write results to disk IN ORDER
623
+ """
624
+ # Submit all paragraphs in this batch to the thread pool
625
+ future_to_index = {}
626
+ for i, paragraph in enumerate(batch):
627
+ if self._cancel_flag.is_set():
628
+ return
629
+
630
+ # Empty paragraphs (blank lines) don't need translation
631
+ if not paragraph.strip():
632
+ future_to_index[i] = paragraph # Store directly
633
+ else:
634
+ future = executor.submit(self._translate_single_paragraph, paragraph)
635
+ future_to_index[future] = i
636
+
637
+ # Collect results in order
638
+ results = [""] * len(batch)
639
+
640
+ # First, fill in the empty paragraphs (no futures for these)
641
+ for key, value in list(future_to_index.items()):
642
+ if isinstance(key, int):
643
+ results[key] = value
644
+ del future_to_index[key]
645
+
646
+ # Wait for translation futures
647
+ for future in as_completed(future_to_index):
648
+ idx = future_to_index[future]
649
+ try:
650
+ translated_text = future.result()
651
+ results[idx] = translated_text
652
+ except Exception as e:
653
+ # This shouldn't happen (translate_chunk handles all errors)
654
+ # but just in case β€” preserve original text
655
+ logger.error(f"Unexpected future error: {e}")
656
+ results[idx] = batch[idx] # Original text as fallback
657
+ self.progress.increment_failed()
658
+
659
+ # Write entire batch to disk IN ORDER
660
+ with self._write_lock:
661
+ with open(output_path, "a", encoding="utf-8") as f:
662
+ for i, translated in enumerate(results):
663
+ if i > 0 or os.path.getsize(output_path) > 0:
664
+ f.write("\n\n") # Paragraph separator
665
+ f.write(translated)
666
+ # Flush to disk immediately β€” crash protection
667
+ f.flush()
668
+ os.fsync(f.fileno())
669
+
670
+ def _translate_single_paragraph(self, paragraph: str) -> str:
671
+ """
672
+ Translate a single paragraph, handling the case where it might
673
+ be too long for a single API call.
674
+
675
+ Preserves internal \n line breaks exactly.
676
+ """
677
+ # Split paragraph into API-friendly chunks if needed
678
+ chunks = self.text_splitter.split_paragraph_into_chunks(paragraph)
679
+
680
+ if len(chunks) == 1:
681
+ # Paragraph fits in one API call
682
+ # But we still need to preserve internal line breaks
683
+ # Strategy: translate line by line within the paragraph
684
+ lines = paragraph.split("\n")
685
+ translated_lines = []
686
+ for line in lines:
687
+ if line.strip():
688
+ translated_line = self.chunk_translator.translate_chunk(line)
689
+ translated_lines.append(translated_line)
690
+ else:
691
+ # Preserve empty lines within paragraph
692
+ translated_lines.append(line)
693
+
694
+ self.progress.increment_translated()
695
+ return "\n".join(translated_lines)
696
+ else:
697
+ # Large paragraph β€” translate each chunk, then rejoin
698
+ translated_chunks = []
699
+ for chunk in chunks:
700
+ lines = chunk.split("\n")
701
+ translated_lines = []
702
+ for line in lines:
703
+ if line.strip():
704
+ translated_line = self.chunk_translator.translate_chunk(line)
705
+ translated_lines.append(translated_line)
706
+ else:
707
+ translated_lines.append(line)
708
+ translated_chunks.append("\n".join(translated_lines))
709
+
710
+ self.progress.increment_translated()
711
+ return "\n".join(translated_chunks)
712
+
713
+
714
+ # ============================================================================
715
+ # QUICK TEST β€” Run this file directly to test translation
716
+ # ============================================================================
717
+ if __name__ == "__main__":
718
+ print("=" * 60)
719
+ print("TRANSLATOR ENGINE β€” Quick Self-Test")
720
+ print("=" * 60)
721
+
722
+ config = TranslatorConfig()
723
+ progress = TranslationProgress()
724
+
725
+ # Test single chunk translation
726
+ chunk_translator = ChunkTranslator(config)
727
+ test_text = "Hello, how are you? This is a test of the translation engine."
728
+
729
+ print(f"\nOriginal: {test_text}")
730
+ translated = chunk_translator.translate_chunk(test_text)
731
+ print(f"Translated: {translated}")
732
+
733
+ # Test text splitter
734
+ splitter = TextSplitter(config)
735
+ long_text = "A" * 5000
736
+ chunks = splitter.split_paragraph_into_chunks(long_text)
737
+ print(f"\nLong text ({len(long_text)} chars) split into {len(chunks)} chunks")
738
+ print(f"Chunk sizes: {[len(c) for c in chunks]}")
739
+
740
+ print("\nβœ… Self-test complete. Engine is functional.")