🎙️ Enhanced Speech-to-Speech Translation Models

Complete Collection of Optimized AI Models for Real-Time Speech Translation

This repository contains pre-converted models and example audio files for building high-performance speech translation applications between English and French. Follow the steps carefully to implement the complete pipeline.

Before using these models, verify compatibility with your system requirements:

CUDA-compatible GPU (recommended)
Python 3.8+ environment
At least 8GB RAM (16GB recommended for larger models)

Step 1: Repository Overview

This repository provides everything needed for speech-to-speech translation:

🎤 Speech Recognition: Multiple Whisper model sizes (39MB to 1.5GB) 🔄 Translation: NLLB models optimized with CTranslate2 (600M & 1.3B)
🔊 Speech Synthesis: Meta's MMS-TTS for natural voice generation 📁 Example Files: Sample audio for testing and demonstration

Step 2: Understanding the Model Structure

Main Models Directory:

models/
├── whisper/                           # Speech Recognition Models
├── nllb-200-distilled-600M-ct2-int8/  # Translation Model (Fast)
├── nllb-200-distilled-1.3B-ct2-int8/  # Translation Model (Accurate)
├── models--facebook--mms-tts-eng/     # English Voice Synthesis
└── models--facebook--mms-tts-fra/     # French Voice Synthesis

Example Files Directory:

examples/
├── input_audio/     # Test audio files (English & French samples)
└── output_audio/    # Expected translation results

Step 3: Model Specifications

Whisper Models (Speech Recognition):

tiny.pt (76MB) - Ultra-fast, good accuracy
base.pt (145MB) - Balanced speed/accuracy
small.pt (484MB) - High accuracy, moderate speed
tiny.en.pt (76MB) - English-only variant
base.en.pt (145MB) - English-only variant

NLLB Models (Translation):

600M (623MB) - Fast translation, good quality
1.3B (1.4GB) - Slower but excellent accuracy

MMS-TTS Models (Voice Synthesis):

English (145MB) - Natural English voice
French (145MB) - Natural French voice

Step 4: Installation Requirements

Install required dependencies: using pip install

torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu   (For CPU inference)
torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126 (For GPU inference)
faster-whisper==1.1.1
ctranslate2==4.4.0
transformers==4.52.3
numpy==2.2.6
scipy==1.15.3
gradio==5.31.0
requests==2.32.3

Or install all at once:

$ pip install -r requirements.txt

Step 5: Basic Usage

Initialize the pipeline:

from enhanced_s2s_pipeline import EnhancedS2SPipeline

# Initialize with GPU support
pipeline = EnhancedS2SPipeline(device="cuda")

Simple translation example:

# Process audio file
for status, transcript, translation, output_audio in pipeline.process_speech_to_speech_realtime(
    audio_file="examples/input_audio/eng1.wav",
    source_lang="English",
    target_lang="French",
    whisper_model="small",
    nllb_model="600M"
):
    print(f"Status: {status}")
    print(f"Original: {transcript}")
    print(f"Translation: {translation}")

Step 6: Advanced Configuration

Configure model paths for your setup:

model_configs = {
    "nllb": {
        "600M": {
            "path": "./models/nllb-200-distilled-600M-ct2-int8",
            "size": "600M parameters",
            "speed": "Fast"
        },
        "1.3B": {
            "path": "./models/nllb-200-distilled-1.3B-ct2-int8",
            "size": "1.3B parameters", 
            "speed": "Medium"
        }
    }
}

Fine-tuned parameters for maximum quality:

result = pipeline.process_speech_to_speech_realtime(
    audio_file="input.wav",
    source_lang="French",
    target_lang="English",
    whisper_model="small",           # Better accuracy
    nllb_model="1.3B",              # Highest quality
    whisper_beam_size=5,            # Thorough search
    nllb_beam_size=4,               # Quality translation
    length_penalty=1.2,             # Prefer complete sentences
    speaking_rate=1.0               # Natural speed
)

Step 7: Model Performance Guide

Choose models based on your needs:

Use Case	Whisper	NLLB	Total Size	Speed
Demo/Testing	tiny	600M	~699MB	Very Fast
Production	base	600M	~768MB	Fast
High Quality	small	1.3B	~1.9GB	Medium
Maximum Accuracy	small	1.3B	~1.9GB	Slower

Step 8: Testing Your Setup

Test with provided examples:

# Test English to French
$ python test_translation.py --input examples/input_audio/eng1.wav --source English --target French

# Test French to English  
$ python test_translation.py --input examples/input_audio/fr1.wav --source French --target English

Verify model loading:

# Check if models load correctly
pipeline = EnhancedS2SPipeline(device="cuda")
whisper_model = pipeline.get_whisper_model("tiny")
nllb_model = pipeline.get_nllb_model("600M")
print("✅ All models loaded successfully!")

Step 9: Building Your Application

Create a Gradio interface:

import gradio as gr
from enhanced_s2s_pipeline import EnhancedS2SPipeline

pipeline = EnhancedS2SPipeline()

def translate_audio(audio_file, source_lang, target_lang):
    for status, transcript, translation, output in pipeline.process_speech_to_speech_realtime(
        audio_file=audio_file,
        source_lang=source_lang, 
        target_lang=target_lang
    ):
        return status, transcript, translation, output

demo = gr.Interface(
    fn=translate_audio,
    inputs=[
        gr.Audio(type="filepath"),
        gr.Radio(["English", "French"]),
        gr.Radio(["English", "French"])
    ],
    outputs=[
        gr.Textbox(label="Status"),
        gr.Textbox(label="Original"),
        gr.Textbox(label="Translation"), 
        gr.Audio(label="Output")
    ]
)

demo.launch()

Step 10: Optimization Tips

For better performance:

Use GPU when available (2-4x faster)
Start with smaller models for testing
Use INT8 quantization for memory efficiency
Enable dynamic loading to save RAM

Memory management:

# Load models only when needed
pipeline = EnhancedS2SPipeline(device="cuda")
# Models are loaded dynamically during first use

Batch processing for multiple files:

audio_files = ["file1.wav", "file2.wav", "file3.wav"]
for audio_file in audio_files:
    result = pipeline.process_speech_to_speech_realtime(
        audio_file=audio_file,
        source_lang="English",
        target_lang="French"
    )

Troubleshooting

Common issues and solutions:

❌ CUDA out of memory:

# Use smaller models or CPU
pipeline = EnhancedS2SPipeline(device="cpu")

❌ Model loading fails:

# Check file paths
$ ls -la models/nllb-200-distilled-600M-ct2-int8/

❌ Audio format issues:

# Convert audio to supported format
import librosa
audio, sr = librosa.load("input.mp3", sr=16000)
librosa.output.write_wav("input.wav", audio, sr)

Language Support

Currently supported:

🇺🇸 English ↔ 🇫🇷 French
High-quality bidirectional translation
Natural voice synthesis for both languages
Optimized models specifically for this language pair

Citation

If you use these models in your work:

@repository{enhanced-speech-translation-models,
  title={Enhanced Speech-to-Speech Translation Models},
  author={pruthvi423},
  year={2025},
  url={https://huggingface.co/pruthvi423/speech-translation-models}
}

Contributing

Help improve this project:

Add support for new languages
Optimize model performance
Improve documentation
Report issues and bugs

License

This project is licensed under the Apache 2.0 License.

Acknowledgments

OpenAI for Whisper models
Meta for NLLB and MMS-TTS models
CTranslate2 team for optimization framework
Hugging Face for hosting and transformers library

Example interface

🚀 Ready to start translating speech? Clone this repository and follow the steps above!

Downloads last month: -; Downloads are not tracked for this model. How to track

pruthvi423
/

speech-translation-models