πŸŽ™οΈ Enhanced Speech-to-Speech Translation Models

Complete Collection of Optimized AI Models for Real-Time Speech Translation

This repository contains pre-converted models and example audio files for building high-performance speech translation applications between English and French. Follow the steps carefully to implement the complete pipeline.

Before using these models, verify compatibility with your system requirements:

  • CUDA-compatible GPU (recommended)
  • Python 3.8+ environment
  • At least 8GB RAM (16GB recommended for larger models)

Step 1: Repository Overview

This repository provides everything needed for speech-to-speech translation:

🎀 Speech Recognition: Multiple Whisper model sizes (39MB to 1.5GB) πŸ”„ Translation: NLLB models optimized with CTranslate2 (600M & 1.3B)
πŸ”Š Speech Synthesis: Meta's MMS-TTS for natural voice generation πŸ“ Example Files: Sample audio for testing and demonstration


Step 2: Understanding the Model Structure

Main Models Directory:

models/
β”œβ”€β”€ whisper/                           # Speech Recognition Models
β”œβ”€β”€ nllb-200-distilled-600M-ct2-int8/  # Translation Model (Fast)
β”œβ”€β”€ nllb-200-distilled-1.3B-ct2-int8/  # Translation Model (Accurate)
β”œβ”€β”€ models--facebook--mms-tts-eng/     # English Voice Synthesis
└── models--facebook--mms-tts-fra/     # French Voice Synthesis

Example Files Directory:

examples/
β”œβ”€β”€ input_audio/     # Test audio files (English & French samples)
└── output_audio/    # Expected translation results

Step 3: Model Specifications

Whisper Models (Speech Recognition):

  • tiny.pt (76MB) - Ultra-fast, good accuracy
  • base.pt (145MB) - Balanced speed/accuracy
  • small.pt (484MB) - High accuracy, moderate speed
  • tiny.en.pt (76MB) - English-only variant
  • base.en.pt (145MB) - English-only variant

NLLB Models (Translation):

  • 600M (623MB) - Fast translation, good quality
  • 1.3B (1.4GB) - Slower but excellent accuracy

MMS-TTS Models (Voice Synthesis):

  • English (145MB) - Natural English voice
  • French (145MB) - Natural French voice

Step 4: Installation Requirements

Install required dependencies: using pip install

torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu   (For CPU inference)
torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126 (For GPU inference)
faster-whisper==1.1.1
ctranslate2==4.4.0
transformers==4.52.3
numpy==2.2.6
scipy==1.15.3
gradio==5.31.0
requests==2.32.3

Or install all at once:

$ pip install -r requirements.txt

Step 5: Basic Usage

Initialize the pipeline:

from enhanced_s2s_pipeline import EnhancedS2SPipeline

# Initialize with GPU support
pipeline = EnhancedS2SPipeline(device="cuda")

Simple translation example:

# Process audio file
for status, transcript, translation, output_audio in pipeline.process_speech_to_speech_realtime(
    audio_file="examples/input_audio/eng1.wav",
    source_lang="English",
    target_lang="French",
    whisper_model="small",
    nllb_model="600M"
):
    print(f"Status: {status}")
    print(f"Original: {transcript}")
    print(f"Translation: {translation}")

Step 6: Advanced Configuration

Configure model paths for your setup:

model_configs = {
    "nllb": {
        "600M": {
            "path": "./models/nllb-200-distilled-600M-ct2-int8",
            "size": "600M parameters",
            "speed": "Fast"
        },
        "1.3B": {
            "path": "./models/nllb-200-distilled-1.3B-ct2-int8",
            "size": "1.3B parameters", 
            "speed": "Medium"
        }
    }
}

Fine-tuned parameters for maximum quality:

result = pipeline.process_speech_to_speech_realtime(
    audio_file="input.wav",
    source_lang="French",
    target_lang="English",
    whisper_model="small",           # Better accuracy
    nllb_model="1.3B",              # Highest quality
    whisper_beam_size=5,            # Thorough search
    nllb_beam_size=4,               # Quality translation
    length_penalty=1.2,             # Prefer complete sentences
    speaking_rate=1.0               # Natural speed
)

Step 7: Model Performance Guide

Choose models based on your needs:

Use Case Whisper NLLB Total Size Speed
Demo/Testing tiny 600M ~699MB Very Fast
Production base 600M ~768MB Fast
High Quality small 1.3B ~1.9GB Medium
Maximum Accuracy small 1.3B ~1.9GB Slower

Step 8: Testing Your Setup

Test with provided examples:

# Test English to French
$ python test_translation.py --input examples/input_audio/eng1.wav --source English --target French

# Test French to English  
$ python test_translation.py --input examples/input_audio/fr1.wav --source French --target English

Verify model loading:

# Check if models load correctly
pipeline = EnhancedS2SPipeline(device="cuda")
whisper_model = pipeline.get_whisper_model("tiny")
nllb_model = pipeline.get_nllb_model("600M")
print("βœ… All models loaded successfully!")

Step 9: Building Your Application

Create a Gradio interface:

import gradio as gr
from enhanced_s2s_pipeline import EnhancedS2SPipeline

pipeline = EnhancedS2SPipeline()

def translate_audio(audio_file, source_lang, target_lang):
    for status, transcript, translation, output in pipeline.process_speech_to_speech_realtime(
        audio_file=audio_file,
        source_lang=source_lang, 
        target_lang=target_lang
    ):
        return status, transcript, translation, output

demo = gr.Interface(
    fn=translate_audio,
    inputs=[
        gr.Audio(type="filepath"),
        gr.Radio(["English", "French"]),
        gr.Radio(["English", "French"])
    ],
    outputs=[
        gr.Textbox(label="Status"),
        gr.Textbox(label="Original"),
        gr.Textbox(label="Translation"), 
        gr.Audio(label="Output")
    ]
)

demo.launch()

Step 10: Optimization Tips

For better performance:

  • Use GPU when available (2-4x faster)
  • Start with smaller models for testing
  • Use INT8 quantization for memory efficiency
  • Enable dynamic loading to save RAM

Memory management:

# Load models only when needed
pipeline = EnhancedS2SPipeline(device="cuda")
# Models are loaded dynamically during first use

Batch processing for multiple files:

audio_files = ["file1.wav", "file2.wav", "file3.wav"]
for audio_file in audio_files:
    result = pipeline.process_speech_to_speech_realtime(
        audio_file=audio_file,
        source_lang="English",
        target_lang="French"
    )

Troubleshooting

Common issues and solutions:

❌ CUDA out of memory:

# Use smaller models or CPU
pipeline = EnhancedS2SPipeline(device="cpu")

❌ Model loading fails:

# Check file paths
$ ls -la models/nllb-200-distilled-600M-ct2-int8/

❌ Audio format issues:

# Convert audio to supported format
import librosa
audio, sr = librosa.load("input.mp3", sr=16000)
librosa.output.write_wav("input.wav", audio, sr)

Language Support

Currently supported:

  • πŸ‡ΊπŸ‡Έ English ↔ πŸ‡«πŸ‡· French
  • High-quality bidirectional translation
  • Natural voice synthesis for both languages
  • Optimized models specifically for this language pair

Citation

If you use these models in your work:

@repository{enhanced-speech-translation-models,
  title={Enhanced Speech-to-Speech Translation Models},
  author={pruthvi423},
  year={2025},
  url={https://huggingface.co/pruthvi423/speech-translation-models}
}

Contributing

Help improve this project:

  • Add support for new languages
  • Optimize model performance
  • Improve documentation
  • Report issues and bugs

License

This project is licensed under the Apache 2.0 License.

Acknowledgments

  • OpenAI for Whisper models
  • Meta for NLLB and MMS-TTS models
  • CTranslate2 team for optimization framework
  • Hugging Face for hosting and transformers library

Example interface

image/png

image/png

πŸš€ Ready to start translating speech? Clone this repository and follow the steps above!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using pruthvi423/speech-translation-models 1