Spaces:

Arsive
/

lt_space

Sleeping

App Files Files Community

Arsive2 commited on Apr 15, 2025

Commit

afd2cc6

1 Parent(s): 8720cc4

Ctranslate performance upgrade

Browse files

Files changed (7) hide show

Dockerfile +34 -29
README.md +157 -1
api_server.py +81 -47
app/models/benchmark_script.py +364 -0
app/models/ct2_model_converter.py +282 -0
app/models/translation_model_ct2.py +277 -0
requirements.txt +3 -1

Dockerfile CHANGED Viewed

@@ -1,43 +1,48 @@
-FROM python:3.10-bullseye
-WORKDIR /app
-# Install system dependencies
-RUN apt-get update && apt-get install -y \
     build-essential \
-    libffi-dev \
     git \
     && rm -rf /var/lib/apt/lists/*
-# Set up directories with proper permissions
-RUN mkdir -p /app/.cache /app/nltk_data && \
-    chmod 777 /app/.cache /app/nltk_data
-# Set environment variables for cache directories
-ENV TRANSFORMERS_CACHE=/app/.cache
-ENV HF_HOME=/app/.cache
-ENV NLTK_DATA=/app/nltk_data
-# Copy requirements file
 COPY requirements.txt .
-# Install Python dependencies
-RUN pip install --no-cache-dir -r requirements.txt
-# Pre-download NLTK data before copying application code
 RUN python -c "import nltk; nltk.download('punkt', download_dir='/app/nltk_data')"
-# Copy application code
-COPY . .
-# Expose the port for the API
-EXPOSE 7860
-# Set environment variables
-ENV PYTHONUNBUFFERED=1
-ENV OMP_NUM_THREADS=4
-ENV MKL_NUM_THREADS=4
-ENV TORCH_CPU_NUM_THREADS=4
-# Run the API server
-CMD ["uvicorn", "api_server:app", "--host", "0.0.0.0", "--port", "7860", "--timeout-keep-alive", "900"]

+FROM python:3.10-slim
+LABEL maintainer="Arsive <arsive.ai@gmail.com>"
+LABEL description="Universal Translator API with CTranslate2 optimization"
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    CT2_MODEL_CACHE=/app/.cache/ct2_models \
+    NLTK_DATA=/app/nltk_data
+RUN apt-get update && apt-get install -y --no-install-recommends \
     build-essential \
+    python3-dev \
     git \
+    curl \
+    wget \
+    unzip \
+    cmake \
+    pkg-config \
+    libpoppler-cpp-dev \
+    poppler-utils \
+    libsm6 \
+    libxext6 \
+    libxrender-dev \
+    libgl1-mesa-glx \
+    libglib2.0-0 \
     && rm -rf /var/lib/apt/lists/*
+WORKDIR /app
+RUN mkdir -p /app/app/models /app/uploads /app/.cache/ct2_models /app/nltk_data /app/translation_logs
 COPY requirements.txt .
+RUN pip install --upgrade pip && \
+    pip install torch==2.0.1 && \
+    pip install --no-cache-dir -r requirements.txt
 RUN python -c "import nltk; nltk.download('punkt', download_dir='/app/nltk_data')"
+COPY app/ /app/app/
+COPY *.py /app/
+COPY fix_permissions.sh /app/
+RUN chmod +x /app/fix_permissions.sh && \
+    /app/fix_permissions.sh
+EXPOSE 8000
+CMD ["gunicorn", "-b", "0.0.0.0:8000", "--timeout", "300", "--workers", "1", "--threads", "4", "app:app"]

README.md CHANGED Viewed

@@ -9,7 +9,163 @@ license: mit
 short_description: Language translation space
 ---
-# Universal Translator API
 This is a Hugging Face Spaces deployment of the Universal Translator API service, which provides translation capabilities using the MADLAD-400 3B model.

 short_description: Language translation space
 ---
+# Universal Translator with CTranslate2 Optimization
+This project implements a high-performance language translation service optimized with CTranslate2, supporting 450+ languages including special handling for Dravidian languages (Tamil, Telugu, Kannada, Malayalam).
+## 🚀 Performance Improvements
+CTranslate2 is a custom inference engine for Transformer models that provides significant speed and memory improvements:
+- **5-10x faster translation** compared to standard Transformers library
+- **Reduced memory usage** through model quantization
+- **Batch processing** for improved throughput
+- **Hardware optimization** for both CPU and GPU
+## 🔧 Key Features
+- Text, HTML, and document (PDF) translation
+- Special handling for Dravidian languages with language-specific tags
+- Optimized batch processing for improved performance
+- Docker support for easy deployment
+- GPU acceleration when available
+## 📋 Requirements
+- Python 3.8+
+- CTranslate2 3.20.0+
+- Transformers 4.28.0+
+- PyTorch 2.0.0+
+- Flask 2.2.3+
+- Other dependencies in requirements.txt
+## 💻 Installation
+### Using Docker
+```bash
+# Build the Docker image
+docker build -t universal-translator .
+# Run the container
+docker run -p 8000:8000 -v ./models:/app/.cache/ct2_models universal-translator
+```
+## 🔁 Converting Models
+The translation service automatically converts models as needed, but you can pre-convert them using the provided utility:
+```bash
+# Convert a specific model
+python ct2_model_converter.py --src en --tgt es --quantization int8
+# Convert all common language pairs
+python ct2_model_converter.py --all
+# List available language pairs and quantization options
+python ct2_model_converter.py --list
+```
+### Quantization Options
+- `int8`: 8-bit integer quantization (best for CPU)
+- `float16`: 16-bit floating point (best for GPU)
+- `int16`: 16-bit integer quantization
+- `float8`: 8-bit floating point (experimental)
+- `auto`: Automatic selection based on device
+## 📊 Benchmarking
+You can benchmark the performance improvements using the provided script:
+```bash
+# Run benchmarks for all language pairs
+python benchmark.py
+# Run benchmarks for specific language pairs
+python benchmark.py --lang-pairs en-es en-fr en-dra
+# Customize benchmark parameters
+python benchmark.py --runs 10 --warm-up 3 --output custom_results.json
+```
+## 🌐 API Usage
+### Text Translation
+```python
+import requests
+data = {
+    'text': 'Hello, how are you today?',
+    'source_lang': 'English',
+    'target_lang': 'Spanish'
+}
+response = requests.post('http://localhost:8000/translate', data=data)
+print(response.json()['translated_text'])
+```
+### HTML Translation
+```python
+import requests
+data = {
+    'html': '<div><p>Hello, world!</p></div>',
+    'source_lang': 'English',
+    'target_lang': 'French'
+}
+response = requests.post('http://localhost:8000/translate-html', data=data)
+print(response.json()['translated_html'])
+```
+### Document Translation
+```python
+import requests
+files = {
+    'file': open('document.pdf', 'rb')
+}
+data = {
+    'source_lang': 'English',
+    'target_lang': 'German',
+    'use_ocr': 'false'
+}
+response = requests.post('http://localhost:8000/process-document', files=files, data=data)
+print(response.json()['translated_text'])
+```
+## 🌍 Dravidian Language Support
+For translating to Dravidian languages (Tamil, Telugu, Kannada, Malayalam), the system automatically handles the required special tokens:
+```python
+# Tamil translation example
+data = {
+    'text': 'Hello, how are you?',
+    'source_lang': 'English',
+    'target_lang': 'Tamil'
+}
+response = requests.post('http://localhost:8000/translate', data=data)
+```
+The backend adds the special token `>>tam<<` for Tamil, `>>tel<<` for Telugu, `>>kan<<` for Kannada, or `>>mal<<` for Malayalam as required by the Helsinki NLP models.
+## 📝 License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## 🙏 Acknowledgements
+- [Helsinki NLP](https://github.com/Helsinki-NLP) for providing the OPUS-MT models
+- [OpenNMT](https://github.com/OpenNMT/CTranslate2) for the CTranslate2 optimization library
+- [Hugging Face](https://huggingface.co/) for model hosting and Transformers library
 This is a Hugging Face Spaces deployment of the Universal Translator API service, which provides translation capabilities using the MADLAD-400 3B model.

api_server.py CHANGED Viewed

@@ -1,5 +1,6 @@
 import logging
 import os
 import torch
 import uvicorn
@@ -10,7 +11,7 @@ from pydantic import BaseModel
 from app.models.document_processor import DocumentProcessor
 from app.models.html_processor import HTMLProcessor
 from app.models.text_chunker import TextChunker
-from app.models.translation_model import TranslationModel
 logging.basicConfig(
     level=logging.INFO,
@@ -20,8 +21,8 @@ logger = logging.getLogger(__name__)
 app = FastAPI(
     title="Universal Translator API",
-    description="API for text, HTML, and document translation services",
-    version="1.0.0"
 )
 app.add_middleware(
@@ -33,11 +34,16 @@ app.add_middleware(
 )
 try:
-    model = TranslationModel()
     html_processor = HTMLProcessor()
     text_chunker = TextChunker(max_tokens=250, overlap_tokens=30)
     document_processor = DocumentProcessor()
     initialization_error = None
 except Exception as e:
     logger.error(f"Error initializing components: {str(e)}")
@@ -70,7 +76,11 @@ async def root():
             "message": "Service initialization failed",
             "error": initialization_error
         }
-    return {"status": "ok", "model": "OPUS-MT/NLLB-CPU-Optimized", "version": "1.0"}
 @app.get("/health")
 async def health_check():
@@ -81,14 +91,14 @@ async def health_check():
         "environment": {
             "python_version": os.environ.get('PYTHON_VERSION'),
             "cuda_available": torch.cuda.is_available(),
-            "device": str(model.device) if hasattr(model, 'device') else "Unknown",
-            "loaded_models": list(model.opus_mt_models.keys()) if hasattr(model, 'opus_mt_models') else []
         }
     }
 @app.post("/translate", response_model=TranslationResponse)
 async def translate_text(request: TranslationRequest):
-    """Translate text from source to target language"""
     if initialization_error:
         raise HTTPException(status_code=500, detail=f"Service not properly initialized: {initialization_error}")
@@ -98,20 +108,25 @@ async def translate_text(request: TranslationRequest):
         if request.special_token:
             logger.info(f"Using special language token: {request.special_token}")
-        chunks = text_chunker.create_chunks(request.text)
-        translated_chunks = []
-        for chunk in chunks:
-            translated_text = model.translate(
-                chunk.text,
                 request.source_lang_code,
                 request.target_lang_code
             )
-            translated_chunks.append(translated_text)
-        final_translation = text_chunker.combine_translations(
-            request.text, chunks, translated_chunks
-        )
         return {"translated_text": final_translation}
     except Exception as e:
@@ -120,7 +135,7 @@ async def translate_text(request: TranslationRequest):
 @app.post("/translate-html", response_model=HTMLTranslationResponse)
 async def translate_html(request: HTMLTranslationRequest):
-    """Translate HTML content while preserving structure"""
     if initialization_error:
         raise HTTPException(status_code=500, detail=f"Service not properly initialized: {initialization_error}")
@@ -128,9 +143,8 @@ async def translate_html(request: HTMLTranslationRequest):
         text_fragments, dom_data = html_processor.extract_text(request.html)
         if not text_fragments:
-            return {"translated_html": request.html}  # No text to translate
-        # Apply special token to each text fragment if needed
         if request.special_token:
             logger.info(f"Using special language token for HTML: {request.special_token}")
             text_fragments = html_processor.prepare_fragments_with_token(
@@ -138,25 +152,31 @@ async def translate_html(request: HTMLTranslationRequest):
                 request.special_token
             )
-        translated_fragments = []
-        batch_size = 10
-        for i in range(0, len(text_fragments), batch_size):
-            batch = text_fragments[i:i+batch_size]
-            for fragment in batch:
-                if not fragment.strip():
-                    translated_fragments.append(fragment)
-                    continue
-                translated_text = model.translate(
-                    fragment,
-                    request.source_lang_code,
-                    request.target_lang_code
-                )
-                translated_fragments.append(translated_text)
-        translated_html = html_processor.replace_text(dom_data, translated_fragments)
         return {"translated_html": translated_html}
     except Exception as e:
@@ -171,7 +191,7 @@ async def process_document(
     special_token: str = Form(""),
     use_ocr: bool = Form(False)
 ):
-    """Process and translate document (PDF or image)"""
     if initialization_error:
         raise HTTPException(status_code=500, detail=f"Service not properly initialized: {initialization_error}")
@@ -189,16 +209,30 @@ async def process_document(
                 status_code=400,
                 detail="No text could be extracted from the document"
             )
         if special_token:
             logger.info(f"Using special language token for document: {special_token}")
             extracted_text = f"{special_token}{extracted_text}"
-        translated_text = model.translate(
-            extracted_text,
-            source_lang_code,
-            target_lang_code
-        )
         return {
             "extracted_text": extracted_text,

 import logging
 import os
+import time
 import torch
 import uvicorn
 from app.models.document_processor import DocumentProcessor
 from app.models.html_processor import HTMLProcessor
 from app.models.text_chunker import TextChunker
+from app.models.translation_model_ct2 import TranslationModelCT2
 logging.basicConfig(
     level=logging.INFO,
 app = FastAPI(
     title="Universal Translator API",
+    description="API for text, HTML, and document translation services with CTranslate2 optimization",
+    version="2.0.0"
 )
 app.add_middleware(
 )
 try:
+    start_time = time.time()
+    model = TranslationModelCT2(model_cache_dir=os.getenv("CT2_MODEL_CACHE", ".cache/ct2_models"))
     html_processor = HTMLProcessor()
     text_chunker = TextChunker(max_tokens=250, overlap_tokens=30)
     document_processor = DocumentProcessor()
+    initialization_time = time.time() - start_time
+    logger.info(f"Initialized components in {initialization_time:.2f}s")
     initialization_error = None
 except Exception as e:
     logger.error(f"Error initializing components: {str(e)}")
             "message": "Service initialization failed",
             "error": initialization_error
         }
+    return {
+        "status": "ok",
+        "model": "CTranslate2 Optimized with MADLAD-400 3B model",
+        "version": "2.0"
+    }
 @app.get("/health")
 async def health_check():
         "environment": {
             "python_version": os.environ.get('PYTHON_VERSION'),
             "cuda_available": torch.cuda.is_available(),
+            "device": model.device if hasattr(model, 'device') else "Unknown",
+            "model_info": model.get_model_info() if hasattr(model, 'get_model_info') else {}
         }
     }
 @app.post("/translate", response_model=TranslationResponse)
 async def translate_text(request: TranslationRequest):
+    """Translate text from source to target language using CTranslate2"""
     if initialization_error:
         raise HTTPException(status_code=500, detail=f"Service not properly initialized: {initialization_error}")
         if request.special_token:
             logger.info(f"Using special language token: {request.special_token}")
+        if len(request.text) > 1000:
+            chunks = text_chunker.create_chunks(request.text)
+            chunk_texts = [chunk.text for chunk in chunks]
+            translated_chunks = model.translate_batch(
+                chunk_texts,
+                request.source_lang_code,
+                request.target_lang_code
+            )
+            final_translation = text_chunker.combine_translations(
+                request.text, chunks, translated_chunks
+            )
+        else:
+            final_translation = model.translate(
+                request.text,
                 request.source_lang_code,
                 request.target_lang_code
             )
         return {"translated_text": final_translation}
     except Exception as e:
 @app.post("/translate-html", response_model=HTMLTranslationResponse)
 async def translate_html(request: HTMLTranslationRequest):
+    """Translate HTML content while preserving structure using CTranslate2"""
     if initialization_error:
         raise HTTPException(status_code=500, detail=f"Service not properly initialized: {initialization_error}")
         text_fragments, dom_data = html_processor.extract_text(request.html)
         if not text_fragments:
+            return {"translated_html": request.html}
         if request.special_token:
             logger.info(f"Using special language token for HTML: {request.special_token}")
             text_fragments = html_processor.prepare_fragments_with_token(
                 request.special_token
             )
+        non_empty_fragments = []
+        empty_indices = []
+        for i, fragment in enumerate(text_fragments):
+            if fragment.strip():
+                non_empty_fragments.append(fragment)
+            else:
+                empty_indices.append(i)
+        translated_fragments = model.translate_batch(
+            non_empty_fragments,
+            request.source_lang_code,
+            request.target_lang_code
+        )
+        full_translated_fragments = []
+        non_empty_idx = 0
+        for i in range(len(text_fragments)):
+            if i in empty_indices:
+                full_translated_fragments.append("")
+            else:
+                full_translated_fragments.append(translated_fragments[non_empty_idx])
+                non_empty_idx += 1
+        translated_html = html_processor.replace_text(dom_data, full_translated_fragments)
         return {"translated_html": translated_html}
     except Exception as e:
     special_token: str = Form(""),
     use_ocr: bool = Form(False)
 ):
+    """Process and translate document (PDF or image) using CTranslate2"""
     if initialization_error:
         raise HTTPException(status_code=500, detail=f"Service not properly initialized: {initialization_error}")
                 status_code=400,
                 detail="No text could be extracted from the document"
             )
         if special_token:
             logger.info(f"Using special language token for document: {special_token}")
             extracted_text = f"{special_token}{extracted_text}"
+        if len(extracted_text) > 1000:
+            chunks = text_chunker.create_chunks(extracted_text)
+            chunk_texts = [chunk.text for chunk in chunks]
+            translated_chunks = model.translate_batch(
+                chunk_texts,
+                source_lang_code,
+                target_lang_code
+            )
+            translated_text = text_chunker.combine_translations(
+                extracted_text, chunks, translated_chunks
+            )
+        else:
+            translated_text = model.translate(
+                extracted_text,
+                source_lang_code,
+                target_lang_code
+            )
         return {
             "extracted_text": extracted_text,

app/models/benchmark_script.py ADDED Viewed

	@@ -0,0 +1,364 @@

+#!/usr/bin/env python
+"""
+Benchmark script to compare performance between standard Transformers
+and CTranslate2 optimized translation models.
+"""
+import argparse
+import json
+import logging
+import os
+import sys
+import time
+from typing import Dict, List, Tuple
+import numpy as np
+import torch
+import tqdm
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, MarianMTModel
+# Add project root to path for imports
+sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+# Import our models
+try:
+    from app.models.translation_model import TranslationModel  # Standard model
+    from app.models.translation_model_ct2 import TranslationModelCT2  # CTranslate2 model
+except ImportError:
+    logger.error("Could not import translation models. Make sure you're running this script from the project root.")
+    sys.exit(1)
+# Test sentences for various languages
+TEST_SENTENCES = {
+    "en-es": [
+        "Hello, how are you today?",
+        "I would like to book a flight to Madrid for next week.",
+        "The quick brown fox jumps over the lazy dog.",
+        "Artificial intelligence is transforming the way we live and work.",
+        "Please contact our customer service if you have any questions."
+    ],
+    "en-fr": [
+        "Hello, how are you today?",
+        "I would like to book a flight to Paris for next week.",
+        "The quick brown fox jumps over the lazy dog.",
+        "Artificial intelligence is transforming the way we live and work.",
+        "Please contact our customer service if you have any questions."
+    ],
+    "en-de": [
+        "Hello, how are you today?",
+        "I would like to book a flight to Berlin for next week.",
+        "The quick brown fox jumps over the lazy dog.",
+        "Artificial intelligence is transforming the way we live and work.",
+        "Please contact our customer service if you have any questions."
+    ],
+    "en-dra": [
+        "Hello, how are you today?",
+        "I would like to book a flight to Chennai for next week.",
+        "The quick brown fox jumps over the lazy dog.",
+        "Artificial intelligence is transforming the way we live and work.",
+        "Please contact our customer service if you have any questions."
+    ]
+}
+def benchmark_standard_model(
+    src_lang: str,
+    tgt_lang: str,
+    sentences: List[str],
+    num_runs: int = 5,
+    warm_up: int = 2
+) -> Dict:
+    """Benchmark the standard Transformers model."""
+    logger.info(f"Benchmarking standard Transformers model for {src_lang}-{tgt_lang}")
+    # Initialize model
+    model = TranslationModel()
+    # Warm-up runs
+    logger.info(f"Performing {warm_up} warm-up runs...")
+    for _ in range(warm_up):
+        for sentence in sentences[:2]:  # Use only first 2 sentences for warm-up
+            model.translate(sentence, src_lang, tgt_lang)
+    # Actual benchmark
+    logger.info(f"Performing {num_runs} benchmark runs...")
+    times = []
+    translations = []
+    for run in range(num_runs):
+        run_times = []
+        run_translations = []
+        for sentence in tqdm.tqdm(sentences, desc=f"Run {run+1}/{num_runs}"):
+            start_time = time.time()
+            translation = model.translate(sentence, src_lang, tgt_lang)
+            elapsed_time = time.time() - start_time
+            run_times.append(elapsed_time)
+            run_translations.append(translation)
+        times.append(run_times)
+        # Only keep translations from the first run
+        if run == 0:
+            translations = run_translations
+    # Calculate statistics
+    all_times = np.array(times).flatten()
+    stats = {
+        "mean_time": float(np.mean(all_times)),
+        "median_time": float(np.median(all_times)),
+        "std_dev": float(np.std(all_times)),
+        "min_time": float(np.min(all_times)),
+        "max_time": float(np.max(all_times)),
+        "total_time": float(np.sum(all_times)),
+        "num_sentences": len(sentences) * num_runs,
+        "translations": translations
+    }
+    return stats
+def benchmark_ct2_model(
+    src_lang: str,
+    tgt_lang: str,
+    sentences: List[str],
+    num_runs: int = 5,
+    warm_up: int = 2
+) -> Dict:
+    """Benchmark the CTranslate2 optimized model."""
+    logger.info(f"Benchmarking CTranslate2 model for {src_lang}-{tgt_lang}")
+    # Initialize model
+    model = TranslationModelCT2()
+    # Warm-up runs
+    logger.info(f"Performing {warm_up} warm-up runs...")
+    for _ in range(warm_up):
+        for sentence in sentences[:2]:  # Use only first 2 sentences for warm-up
+            model.translate(sentence, src_lang, tgt_lang)
+    # Actual benchmark
+    logger.info(f"Performing {num_runs} benchmark runs...")
+    times = []
+    translations = []
+    for run in range(num_runs):
+        run_times = []
+        run_translations = []
+        for sentence in tqdm.tqdm(sentences, desc=f"Run {run+1}/{num_runs}"):
+            start_time = time.time()
+            translation = model.translate(sentence, src_lang, tgt_lang)
+            elapsed_time = time.time() - start_time
+            run_times.append(elapsed_time)
+            run_translations.append(translation)
+        times.append(run_times)
+        # Only keep translations from the first run
+        if run == 0:
+            translations = run_translations
+    # Calculate statistics
+    all_times = np.array(times).flatten()
+    stats = {
+        "mean_time": float(np.mean(all_times)),
+        "median_time": float(np.median(all_times)),
+        "std_dev": float(np.std(all_times)),
+        "min_time": float(np.min(all_times)),
+        "max_time": float(np.max(all_times)),
+        "total_time": float(np.sum(all_times)),
+        "num_sentences": len(sentences) * num_runs,
+        "translations": translations
+    }
+    return stats
+def benchmark_batch(
+    src_lang: str,
+    tgt_lang: str,
+    sentences: List[str],
+    num_runs: int = 5,
+    warm_up: int = 2
+) -> Dict:
+    """Benchmark batch translation with CTranslate2."""
+    logger.info(f"Benchmarking CTranslate2 batch translation for {src_lang}-{tgt_lang}")
+    # Initialize model
+    model = TranslationModelCT2()
+    # Warm-up runs
+    logger.info(f"Performing {warm_up} warm-up runs...")
+    for _ in range(warm_up):
+        model.translate_batch(sentences[:2], src_lang, tgt_lang)
+    # Actual benchmark
+    logger.info(f"Performing {num_runs} benchmark runs...")
+    times = []
+    translations = []
+    for run in range(num_runs):
+        start_time = time.time()
+        batch_translations = model.translate_batch(sentences, src_lang, tgt_lang)
+        elapsed_time = time.time() - start_time
+        times.append(elapsed_time)
+        # Only keep translations from the first run
+        if run == 0:
+            translations = batch_translations
+    # Calculate statistics
+    stats = {
+        "mean_time": float(np.mean(times)),
+        "median_time": float(np.median(times)),
+        "std_dev": float(np.std(times)),
+        "min_time": float(np.min(times)),
+        "max_time": float(np.max(times)),
+        "total_time": float(np.sum(times)),
+        "num_sentences": len(sentences),
+        "num_batches": num_runs,
+        "translations": translations
+    }
+    return stats
+def run_benchmarks(
+    lang_pairs: List[Tuple[str, str]],
+    num_runs: int = 5,
+    warm_up: int = 2,
+    output_file: str = "benchmark_results.json"
+) -> Dict:
+    """Run benchmarks for specified language pairs."""
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    logger.info(f"Running benchmarks on {device}")
+    results = {
+        "device": device,
+        "cuda_available": torch.cuda.is_available(),
+        "cuda_version": torch.version.cuda if torch.cuda.is_available() else None,
+        "num_runs": num_runs,
+        "warm_up_runs": warm_up,
+        "language_pairs": {}
+    }
+    for src_lang, tgt_lang in lang_pairs:
+        model_key = f"{src_lang}-{tgt_lang}"
+        if model_key not in TEST_SENTENCES:
+            logger.warning(f"No test sentences available for {model_key}, skipping...")
+            continue
+        logger.info(f"Benchmarking {model_key}...")
+        sentences = TEST_SENTENCES[model_key]
+        # Run standard model benchmark
+        standard_stats = benchmark_standard_model(
+            src_lang, tgt_lang, sentences, num_runs, warm_up
+        )
+        # Run CTranslate2 model benchmark
+        ct2_stats = benchmark_ct2_model(
+            src_lang, tgt_lang, sentences, num_runs, warm_up
+        )
+        # Run batch translation benchmark
+        batch_stats = benchmark_batch(
+            src_lang, tgt_lang, sentences, num_runs, warm_up
+        )
+        # Calculate speedup
+        speedup = standard_stats["mean_time"] / ct2_stats["mean_time"]
+        batch_speedup = standard_stats["mean_time"] * len(sentences) / batch_stats["mean_time"]
+        results["language_pairs"][model_key] = {
+            "standard_model": standard_stats,
+            "ct2_model": ct2_stats,
+            "batch_translation": batch_stats,
+            "speedup": float(speedup),
+            "batch_speedup": float(batch_speedup)
+        }
+        # Print summary
+        logger.info(f"\nResults for {model_key}:")
+        logger.info(f"  Standard model average time: {standard_stats['mean_time']:.4f}s")
+        logger.info(f"  CTranslate2 model average time: {ct2_stats['mean_time']:.4f}s")
+        logger.info(f"  Batch translation average time: {batch_stats['mean_time']:.4f}s (for {len(sentences)} sentences)")
+        logger.info(f"  Speedup: {speedup:.2f}x")
+        logger.info(f"  Batch speedup: {batch_speedup:.2f}x")
+    # Save results to file
+    with open(output_file, "w") as f:
+        json.dump(results, f, indent=2)
+    logger.info(f"Benchmark results saved to {output_file}")
+    return results
+def main():
+    """Main entry point for the benchmark script."""
+    parser = argparse.ArgumentParser(
+        description="Benchmark translation models performance"
+    )
+    parser.add_argument(
+        "--lang-pairs",
+        type=str,
+        nargs="+",
+        default=["en-es", "en-fr", "en-de", "en-dra"],
+        help="Language pairs to benchmark (e.g., 'en-es en-fr')"
+    )
+    parser.add_argument(
+        "--runs",
+        type=int,
+        default=5,
+        help="Number of benchmark runs"
+    )
+    parser.add_argument(
+        "--warm-up",
+        type=int,
+        default=2,
+        help="Number of warm-up runs"
+    )
+    parser.add_argument(
+        "--output",
+        type=str,
+        default="benchmark_results.json",
+        help="Output file for benchmark results"
+    )
+    args = parser.parse_args()
+    # Parse language pairs
+    lang_pairs = []
+    for pair in args.lang_pairs:
+        if "-" in pair:
+            src, tgt = pair.split("-")
+            lang_pairs.append((src, tgt))
+        else:
+            logger.warning(f"Invalid language pair format: {pair}, skipping...")
+    if not lang_pairs:
+        logger.error("No valid language pairs specified")
+        return 1
+    # Run benchmarks
+    run_benchmarks(
+        lang_pairs=lang_pairs,
+        num_runs=args.runs,
+        warm_up=args.warm_up,
+        output_file=args.output
+    )
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

app/models/ct2_model_converter.py ADDED Viewed

	@@ -0,0 +1,282 @@

+#!/usr/bin/env python
+"""
+Utility script to convert Helsinki NLP Opus MT models to CTranslate2 format.
+This script handles the special case of Dravidian languages.
+"""
+import argparse
+import logging
+import os
+import sys
+from typing import Dict, List, Optional, Set
+import torch
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+# Common language pairs
+COMMON_LANGUAGE_PAIRS = [
+    ("en", "es"),  # English to Spanish
+    ("en", "fr"),  # English to French
+    ("en", "de"),  # English to German
+    ("en", "ru"),  # English to Russian
+    ("en", "zh"),  # English to Chinese
+    ("en", "ar"),  # English to Arabic
+    ("en", "hi"),  # English to Hindi
+    ("en", "dra"),  # English to Dravidian languages
+    ("es", "en"),  # Spanish to English
+    ("fr", "en"),  # French to English
+    ("de", "en"),  # German to English
+    ("ru", "en"),  # Russian to English
+    ("zh", "en"),  # Chinese to English
+    ("ar", "en"),  # Arabic to English
+    ("hi", "en"),  # Hindi to English
+]
+# Supported quantization types
+QUANTIZATION_TYPES = {
+    "int8": "8-bit integer quantization (best for CPU)",
+    "int16": "16-bit integer quantization",
+    "float16": "16-bit floating point (best for modern GPUs)",
+    "float8": "8-bit floating point (experimental)",
+    "auto": "Automatic selection based on device",
+}
+def get_device() -> str:
+    """Get the best available device for model inference."""
+    if torch.cuda.is_available():
+        return "cuda"
+    else:
+        return "cpu"
+def get_auto_quantization(device: str) -> str:
+    """Get the appropriate quantization based on device."""
+    if device == "cuda":
+        return "float16"
+    else:
+        return "int8"
+def get_huggingface_model_name(src_lang: str, tgt_lang: str) -> str:
+    """Get the appropriate HuggingFace model name for the language pair."""
+    return f"Helsinki-NLP/opus-mt-{src_lang}-{tgt_lang}"
+def convert_model(
+    src_lang: str,
+    tgt_lang: str,
+    output_dir: str,
+    quantization: str = "auto",
+    force: bool = False
+) -> bool:
+    """
+    Convert a Helsinki NLP model to CTranslate2 format.
+    Args:
+        src_lang: Source language code
+        tgt_lang: Target language code
+        output_dir: Output directory path
+        quantization: Quantization type
+        force: Whether to force conversion if model exists
+    Returns:
+        bool: Success status
+    """
+    try:
+        # Determine output path
+        model_key = f"{src_lang}-{tgt_lang}"
+        model_dir = os.path.join(output_dir, f"ct2_{model_key}")
+        # Check if model already exists
+        if os.path.exists(model_dir) and os.path.isdir(model_dir) and not force:
+            logger.info(f"Model {model_key} already exists at {model_dir}. Use --force to overwrite.")
+            return True
+        # Get the HuggingFace model name
+        huggingface_model = get_huggingface_model_name(src_lang, tgt_lang)
+        logger.info(f"Converting model {huggingface_model} to CTranslate2 format")
+        # Handle auto quantization
+        device = get_device()
+        if quantization == "auto":
+            quantization = get_auto_quantization(device)
+        logger.info(f"Using {quantization} quantization for {device} device")
+        try:
+            # Import here to avoid dependency if not installed
+            from ctranslate2.converters import TransformersConverter
+            # Create converter
+            converter = TransformersConverter(huggingface_model)
+            # Convert model
+            converter.convert(
+                model_dir,
+                quantization=quantization,
+                force=True
+            )
+            logger.info(f"Successfully converted {huggingface_model} to CTranslate2 format at {model_dir}")
+            return True
+        except ImportError:
+            logger.warning("Could not import TransformersConverter, falling back to command line")
+            # Fallback to command line
+            import subprocess
+            cmd = [
+                "ct2-transformers-converter",
+                "--model", huggingface_model,
+                "--output_dir", model_dir,
+                "--quantization", quantization,
+                "--force"
+            ]
+            # Run the command
+            logger.info(f"Running command: {' '.join(cmd)}")
+            result = subprocess.run(cmd, capture_output=True, text=True)
+            if result.returncode == 0:
+                logger.info(f"Successfully converted model using shell command")
+                return True
+            else:
+                logger.error(f"Error in shell command: {result.stderr}")
+                return False
+    except Exception as e:
+        logger.error(f"Error converting model {src_lang}-{tgt_lang}: {str(e)}")
+        return False
+def convert_all_models(
+    output_dir: str,
+    quantization: str = "auto",
+    force: bool = False
+) -> Dict[str, bool]:
+    """
+    Convert all common language pair models to CTranslate2 format.
+    Args:
+        output_dir: Output directory path
+        quantization: Quantization type
+        force: Whether to force conversion if model exists
+    Returns:
+        Dict[str, bool]: Results by language pair
+    """
+    results = {}
+    for src_lang, tgt_lang in COMMON_LANGUAGE_PAIRS:
+        model_key = f"{src_lang}-{tgt_lang}"
+        logger.info(f"Processing model pair: {model_key}")
+        success = convert_model(
+            src_lang=src_lang,
+            tgt_lang=tgt_lang,
+            output_dir=output_dir,
+            quantization=quantization,
+            force=force
+        )
+        results[model_key] = success
+    # Print summary
+    logger.info("\n=== Conversion Summary ===")
+    success_count = sum(1 for success in results.values() if success)
+    logger.info(f"Successfully converted {success_count} of {len(results)} models")
+    for model_key, success in results.items():
+        status = "✓" if success else "✗"
+        logger.info(f"{status} {model_key}")
+    return results
+def main():
+    """Main entry point for the script."""
+    parser = argparse.ArgumentParser(
+        description="Convert Helsinki NLP Opus MT models to CTranslate2 format"
+    )
+    parser.add_argument(
+        "--src",
+        type=str,
+        help="Source language code (e.g., 'en')"
+    )
+    parser.add_argument(
+        "--tgt",
+        type=str,
+        help="Target language code (e.g., 'es', 'fr', 'dra')"
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=str,
+        default=".cache/ct2_models",
+        help="Output directory for converted models"
+    )
+    parser.add_argument(
+        "--quantization",
+        type=str,
+        choices=list(QUANTIZATION_TYPES.keys()),
+        default="auto",
+        help="Quantization type to use"
+    )
+    parser.add_argument(
+        "--force",
+        action="store_true",
+        help="Force conversion even if model exists"
+    )
+    parser.add_argument(
+        "--all",
+        action="store_true",
+        help="Convert all common language pairs"
+    )
+    parser.add_argument(
+        "--list",
+        action="store_true",
+        help="List all common language pairs"
+    )
+    args = parser.parse_args()
+    # Make sure output directory exists
+    os.makedirs(args.output_dir, exist_ok=True)
+    # List common language pairs if requested
+    if args.list:
+        print("\nCommon language pairs:")
+        for src, tgt in COMMON_LANGUAGE_PAIRS:
+            print(f"  {src}-{tgt}")
+        print("\nQuantization types:")
+        for q_type, desc in QUANTIZATION_TYPES.items():
+            print(f"  {q_type}: {desc}")
+        return 0
+    # Convert all models if requested
+    if args.all:
+        results = convert_all_models(
+            output_dir=args.output_dir,
+            quantization=args.quantization,
+            force=args.force
+        )
+        return 0 if all(results.values()) else 1
+    # Otherwise, need source and target languages
+    if not args.src or not args.tgt:
+        parser.error("--src and --tgt are required unless --all or --list is specified")
+    # Convert single model
+    success = convert_model(
+        src_lang=args.src,
+        tgt_lang=args.tgt,
+        output_dir=args.output_dir,
+        quantization=args.quantization,
+        force=args.force
+    )
+    return 0 if success else 1
+if __name__ == "__main__":
+    sys.exit(main())

app/models/translation_model_ct2.py ADDED Viewed

	@@ -0,0 +1,277 @@

+import logging
+import os
+import re
+from typing import Dict, List, Optional, Tuple
+import ctranslate2
+import torch
+import transformers
+from transformers import AutoTokenizer
+logger = logging.getLogger(__name__)
+class TranslationModelCT2:
+    """
+    Optimized translation model using CTranslate2 for faster inference.
+    """
+    def __init__(self, model_cache_dir: str = ".cache/ct2_models"):
+        """
+        Initialize the CTranslate2 translation model manager.
+        Args:
+            model_cache_dir: Directory to cache converted models
+        """
+        self.model_cache_dir = model_cache_dir
+        self.device = self._get_device()
+        self.ct2_models = {}  # Cache for loaded CTranslate2 models
+        self.tokenizers = {}  # Cache for tokenizers
+        self.model_paths = {}  # Map for model paths
+        self.initialized = False
+        self.initialization_error = None
+        # Create cache directory
+        os.makedirs(model_cache_dir, exist_ok=True)
+        try:
+            # Log available device
+            logger.info(f"TranslationModelCT2 initialized with device: {self.device}")
+            self.initialized = True
+        except Exception as e:
+            self.initialization_error = str(e)
+            logger.error(f"Failed to initialize CTranslate2 translation model: {str(e)}")
+    def _get_device(self) -> str:
+        """Get the best available device for model inference."""
+        if torch.cuda.is_available():
+            logger.info("Using CUDA GPU for CTranslate2")
+            return "cuda"
+        else:
+            logger.info("Using CPU for CTranslate2")
+            return "cpu"
+    def _get_compute_type(self) -> str:
+        """Get the appropriate compute type based on device."""
+        if self.device == "cuda":
+            return "int8_float16"  # More efficient for GPU
+        else:
+            return "int8"  # More efficient for CPU
+    def _get_model_key(self, source_lang_code: str, target_lang_code: str) -> str:
+        """Create a unique key for the model cache."""
+        return f"{source_lang_code}-{target_lang_code}"
+    def _get_huggingface_model_name(self, source_lang_code: str, target_lang_code: str) -> str:
+        """Get the appropriate HuggingFace model name for the language pair."""
+        # Handle special case for Dravidian languages
+        if target_lang_code == "dra":
+            return "Helsinki-NLP/opus-mt-en-dra"
+        # Standard language pairs
+        return f"Helsinki-NLP/opus-mt-{source_lang_code}-{target_lang_code}"
+    def _get_ct2_model_path(self, source_lang_code: str, target_lang_code: str) -> str:
+        """Get the path for the CTranslate2 model."""
+        model_key = self._get_model_key(source_lang_code, target_lang_code)
+        return os.path.join(self.model_cache_dir, f"ct2_{model_key}")
+    def _convert_model_if_needed(self, source_lang_code: str, target_lang_code: str) -> str:
+        """Convert the model to CTranslate2 format if not already converted."""
+        model_key = self._get_model_key(source_lang_code, target_lang_code)
+        model_path = self._get_ct2_model_path(source_lang_code, target_lang_code)
+        # Check if model already exists
+        if os.path.exists(model_path) and os.path.isdir(model_path):
+            logger.info(f"Using existing CTranslate2 model for {model_key}")
+            return model_path
+        # Get the Hugging Face model name
+        huggingface_model = self._get_huggingface_model_name(source_lang_code, target_lang_code)
+        logger.info(f"Converting model {huggingface_model} to CTranslate2 format")
+        try:
+            # Import here to avoid dependency if ct2-transformers-converter not used
+            from ctranslate2.converters import TransformersConverter
+            # Create converter
+            converter = TransformersConverter(huggingface_model)
+            # Convert model
+            converter.convert(
+                model_path,
+                quantization=self._get_compute_type().split("_")[0],  # int8 or float16
+                force=True
+            )
+            logger.info(f"Successfully converted {huggingface_model} to CTranslate2 format at {model_path}")
+            return model_path
+        except Exception as e:
+            logger.error(f"Error converting model to CTranslate2 format: {str(e)}")
+            # Fallback - use shell command to convert
+            try:
+                logger.info(f"Attempting conversion using ct2-transformers-converter shell command")
+                import subprocess
+                cmd = [
+                    "ct2-transformers-converter",
+                    "--model", huggingface_model,
+                    "--output_dir", model_path,
+                    "--quantization", self._get_compute_type().split("_")[0],
+                    "--force"
+                ]
+                # Run the command
+                result = subprocess.run(cmd, capture_output=True, text=True)
+                if result.returncode == 0:
+                    logger.info(f"Successfully converted model using shell command")
+                    return model_path
+                else:
+                    logger.error(f"Error in shell command: {result.stderr}")
+                    raise ValueError(f"Failed to convert model: {result.stderr}")
+            except Exception as shell_error:
+                logger.error(f"Error with shell conversion: {str(shell_error)}")
+                raise ValueError(f"Could not convert model {huggingface_model} to CTranslate2 format")
+    def _load_model(self, source_lang_code: str, target_lang_code: str) -> Tuple[ctranslate2.Translator, transformers.PreTrainedTokenizer]:
+        """Load a CTranslate2 model and tokenizer for the language pair."""
+        model_key = self._get_model_key(source_lang_code, target_lang_code)
+        # Check if already loaded
+        if model_key in self.ct2_models and model_key in self.tokenizers:
+            return self.ct2_models[model_key], self.tokenizers[model_key]
+        try:
+            # Convert model if needed
+            model_path = self._convert_model_if_needed(source_lang_code, target_lang_code)
+            # Load the tokenizer
+            huggingface_model = self._get_huggingface_model_name(source_lang_code, target_lang_code)
+            tokenizer = AutoTokenizer.from_pretrained(huggingface_model)
+            # Load CTranslate2 model
+            inter_threads = 1  # Number of parallel translations
+            intra_threads = min(os.cpu_count() or 4, 4)  # Number of threads per translation
+            translator = ctranslate2.Translator(
+                model_path,
+                device=self.device,
+                compute_type=self._get_compute_type(),
+                inter_threads=inter_threads,
+                intra_threads=intra_threads
+            )
+            # Cache the model and tokenizer
+            self.ct2_models[model_key] = translator
+            self.tokenizers[model_key] = tokenizer
+            self.model_paths[model_key] = model_path
+            logger.info(f"Successfully loaded CTranslate2 model and tokenizer for {model_key}")
+            return translator, tokenizer
+        except Exception as e:
+            logger.error(f"Error loading CTranslate2 model: {str(e)}")
+            raise
+    def translate(self, text: str, source_lang_code: str, target_lang_code: str) -> str:
+        """
+        Translate text from source language to target language using CTranslate2.
+        Args:
+            text: Text to translate
+            source_lang_code: Source language code
+            target_lang_code: Target language code
+        Returns:
+            Translated text
+        """
+        if not text.strip():
+            return ""
+        try:
+            if not self.initialized:
+                raise ValueError(f"Translation model not properly initialized: {self.initialization_error}")
+            # Handle special tokens in text (for Dravidian languages)
+            # We don't need to modify the target_lang_code since the special token is already in the text
+            # Load the model and tokenizer
+            translator, tokenizer = self._load_model(source_lang_code, target_lang_code)
+            # Tokenize the input text
+            tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(text))
+            # Translate using CTranslate2
+            results = translator.translate_batch([tokens])
+            # The first result's first hypothesis
+            target_tokens = results[0].hypotheses[0]
+            # Convert tokens back to text
+            translated_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(target_tokens))
+            # Clean up the output
+            return re.sub(r'\s+', ' ', translated_text).strip()
+        except Exception as e:
+            logger.error(f"CTranslate2 translation error: {str(e)}")
+            raise
+    def translate_batch(self, texts: List[str], source_lang_code: str, target_lang_code: str) -> List[str]:
+        """
+        Translate a batch of texts for improved performance.
+        Args:
+            texts: List of texts to translate
+            source_lang_code: Source language code
+            target_lang_code: Target language code
+        Returns:
+            List of translated texts
+        """
+        if not texts:
+            return []
+        try:
+            if not self.initialized:
+                raise ValueError(f"Translation model not properly initialized: {self.initialization_error}")
+            # Load the model and tokenizer
+            translator, tokenizer = self._load_model(source_lang_code, target_lang_code)
+            # Tokenize all input texts
+            tokens_batch = [
+                tokenizer.convert_ids_to_tokens(tokenizer.encode(text))
+                for text in texts
+            ]
+            # Translate the batch
+            results = translator.translate_batch(tokens_batch)
+            # Extract the translations
+            translated_texts = []
+            for result in results:
+                if result.hypotheses:
+                    target_tokens = result.hypotheses[0]
+                    translated_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(target_tokens))
+                    translated_text = re.sub(r'\s+', ' ', translated_text).strip()
+                    translated_texts.append(translated_text)
+                else:
+                    translated_texts.append("")
+            return translated_texts
+        except Exception as e:
+            logger.error(f"CTranslate2 batch translation error: {str(e)}")
+            raise
+    def get_model_info(self) -> Dict:
+        """Get information about loaded models."""
+        return {
+            "device": self.device,
+            "compute_type": self._get_compute_type(),
+            "loaded_models": list(self.ct2_models.keys()),
+            "model_paths": self.model_paths
+        }

requirements.txt CHANGED Viewed

@@ -12,4 +12,6 @@ tqdm
 beautifulsoup4
 PyMuPDF
 protobuf
-torch

 beautifulsoup4
 PyMuPDF
 protobuf
+torch
+ctranslate2
+hf-hub-ctranslate2