ποΈ Enhanced Speech-to-Speech Translation Models
Complete Collection of Optimized AI Models for Real-Time Speech Translation
This repository contains pre-converted models and example audio files for building high-performance speech translation applications between English and French. Follow the steps carefully to implement the complete pipeline.
Before using these models, verify compatibility with your system requirements:
- CUDA-compatible GPU (recommended)
- Python 3.8+ environment
- At least 8GB RAM (16GB recommended for larger models)
Step 1: Repository Overview
This repository provides everything needed for speech-to-speech translation:
π€ Speech Recognition: Multiple Whisper model sizes (39MB to 1.5GB)
π Translation: NLLB models optimized with CTranslate2 (600M & 1.3B)
π Speech Synthesis: Meta's MMS-TTS for natural voice generation
π Example Files: Sample audio for testing and demonstration
Step 2: Understanding the Model Structure
Main Models Directory:
models/
βββ whisper/ # Speech Recognition Models
βββ nllb-200-distilled-600M-ct2-int8/ # Translation Model (Fast)
βββ nllb-200-distilled-1.3B-ct2-int8/ # Translation Model (Accurate)
βββ models--facebook--mms-tts-eng/ # English Voice Synthesis
βββ models--facebook--mms-tts-fra/ # French Voice Synthesis
Example Files Directory:
examples/
βββ input_audio/ # Test audio files (English & French samples)
βββ output_audio/ # Expected translation results
Step 3: Model Specifications
Whisper Models (Speech Recognition):
- tiny.pt (76MB) - Ultra-fast, good accuracy
- base.pt (145MB) - Balanced speed/accuracy
- small.pt (484MB) - High accuracy, moderate speed
- tiny.en.pt (76MB) - English-only variant
- base.en.pt (145MB) - English-only variant
NLLB Models (Translation):
- 600M (623MB) - Fast translation, good quality
- 1.3B (1.4GB) - Slower but excellent accuracy
MMS-TTS Models (Voice Synthesis):
- English (145MB) - Natural English voice
- French (145MB) - Natural French voice
Step 4: Installation Requirements
Install required dependencies: using pip install
torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu (For CPU inference)
torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126 (For GPU inference)
faster-whisper==1.1.1
ctranslate2==4.4.0
transformers==4.52.3
numpy==2.2.6
scipy==1.15.3
gradio==5.31.0
requests==2.32.3
Or install all at once:
$ pip install -r requirements.txt
Step 5: Basic Usage
Initialize the pipeline:
from enhanced_s2s_pipeline import EnhancedS2SPipeline
# Initialize with GPU support
pipeline = EnhancedS2SPipeline(device="cuda")
Simple translation example:
# Process audio file
for status, transcript, translation, output_audio in pipeline.process_speech_to_speech_realtime(
audio_file="examples/input_audio/eng1.wav",
source_lang="English",
target_lang="French",
whisper_model="small",
nllb_model="600M"
):
print(f"Status: {status}")
print(f"Original: {transcript}")
print(f"Translation: {translation}")
Step 6: Advanced Configuration
Configure model paths for your setup:
model_configs = {
"nllb": {
"600M": {
"path": "./models/nllb-200-distilled-600M-ct2-int8",
"size": "600M parameters",
"speed": "Fast"
},
"1.3B": {
"path": "./models/nllb-200-distilled-1.3B-ct2-int8",
"size": "1.3B parameters",
"speed": "Medium"
}
}
}
Fine-tuned parameters for maximum quality:
result = pipeline.process_speech_to_speech_realtime(
audio_file="input.wav",
source_lang="French",
target_lang="English",
whisper_model="small", # Better accuracy
nllb_model="1.3B", # Highest quality
whisper_beam_size=5, # Thorough search
nllb_beam_size=4, # Quality translation
length_penalty=1.2, # Prefer complete sentences
speaking_rate=1.0 # Natural speed
)
Step 7: Model Performance Guide
Choose models based on your needs:
| Use Case | Whisper | NLLB | Total Size | Speed |
|---|---|---|---|---|
| Demo/Testing | tiny | 600M | ~699MB | Very Fast |
| Production | base | 600M | ~768MB | Fast |
| High Quality | small | 1.3B | ~1.9GB | Medium |
| Maximum Accuracy | small | 1.3B | ~1.9GB | Slower |
Step 8: Testing Your Setup
Test with provided examples:
# Test English to French
$ python test_translation.py --input examples/input_audio/eng1.wav --source English --target French
# Test French to English
$ python test_translation.py --input examples/input_audio/fr1.wav --source French --target English
Verify model loading:
# Check if models load correctly
pipeline = EnhancedS2SPipeline(device="cuda")
whisper_model = pipeline.get_whisper_model("tiny")
nllb_model = pipeline.get_nllb_model("600M")
print("β
All models loaded successfully!")
Step 9: Building Your Application
Create a Gradio interface:
import gradio as gr
from enhanced_s2s_pipeline import EnhancedS2SPipeline
pipeline = EnhancedS2SPipeline()
def translate_audio(audio_file, source_lang, target_lang):
for status, transcript, translation, output in pipeline.process_speech_to_speech_realtime(
audio_file=audio_file,
source_lang=source_lang,
target_lang=target_lang
):
return status, transcript, translation, output
demo = gr.Interface(
fn=translate_audio,
inputs=[
gr.Audio(type="filepath"),
gr.Radio(["English", "French"]),
gr.Radio(["English", "French"])
],
outputs=[
gr.Textbox(label="Status"),
gr.Textbox(label="Original"),
gr.Textbox(label="Translation"),
gr.Audio(label="Output")
]
)
demo.launch()
Step 10: Optimization Tips
For better performance:
- Use GPU when available (2-4x faster)
- Start with smaller models for testing
- Use INT8 quantization for memory efficiency
- Enable dynamic loading to save RAM
Memory management:
# Load models only when needed
pipeline = EnhancedS2SPipeline(device="cuda")
# Models are loaded dynamically during first use
Batch processing for multiple files:
audio_files = ["file1.wav", "file2.wav", "file3.wav"]
for audio_file in audio_files:
result = pipeline.process_speech_to_speech_realtime(
audio_file=audio_file,
source_lang="English",
target_lang="French"
)
Troubleshooting
Common issues and solutions:
β CUDA out of memory:
# Use smaller models or CPU
pipeline = EnhancedS2SPipeline(device="cpu")
β Model loading fails:
# Check file paths
$ ls -la models/nllb-200-distilled-600M-ct2-int8/
β Audio format issues:
# Convert audio to supported format
import librosa
audio, sr = librosa.load("input.mp3", sr=16000)
librosa.output.write_wav("input.wav", audio, sr)
Language Support
Currently supported:
- πΊπΈ English β π«π· French
- High-quality bidirectional translation
- Natural voice synthesis for both languages
- Optimized models specifically for this language pair
Citation
If you use these models in your work:
@repository{enhanced-speech-translation-models,
title={Enhanced Speech-to-Speech Translation Models},
author={pruthvi423},
year={2025},
url={https://huggingface.co/pruthvi423/speech-translation-models}
}
Contributing
Help improve this project:
- Add support for new languages
- Optimize model performance
- Improve documentation
- Report issues and bugs
License
This project is licensed under the Apache 2.0 License.
Acknowledgments
- OpenAI for Whisper models
- Meta for NLLB and MMS-TTS models
- CTranslate2 team for optimization framework
- Hugging Face for hosting and transformers library
Example interface
π Ready to start translating speech? Clone this repository and follow the steps above!

