Speech-To-Text / README.md
aryan083's picture
Fix YAML metadata duplication in README.md
547723c
metadata
title: Indian Speech-to-Text Models
emoji: 🎀
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
  - speech-to-text
  - indian-languages
  - hindi
  - tamil
  - bengali
  - whisper
  - wav2vec2
  - gradio
  - transformers
  - tensorflow
models:
  - distil-whisper/distil-large-v3
  - openai/whisper-large-v3
  - ai4bharat/indicwav2vec-hindi
datasets:
  - mozilla-foundation/common_voice_11_0

🎀 Complete Guide to Free Open-Source Speech-to-Text Models for Indian Languages

A comprehensive web application showcasing free, open-source speech-to-text models optimized for Indian languages. Built with TensorFlow, Gradio, and Transformers.

🌟 Features

  • 8 Free Models: Distil-Whisper, OpenAI Whisper, Wav2Vec2, SeamlessM4T, and more
  • 13 Indian Languages: Hindi, Tamil, Bengali, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia, Assamese, Urdu, English
  • Real-time Processing: Live audio recording and file upload
  • Batch Processing: Process multiple audio files simultaneously
  • Model Comparison: Performance metrics and recommendations
  • TensorFlow Integration: Optimized inference and deployment
  • Commercial License: All models are free for commercial use

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • CUDA (optional, for GPU acceleration)

Installation

  1. Clone the repository
git clone <repository-url>
cd GENAI
  1. Install dependencies
pip install -r requirements.txt
  1. Set environment
# Windows
set APP_ENV=local

# Linux/Mac
export APP_ENV=local
  1. Run the application
python app.py
  1. Open your browser Navigate to http://127.0.0.1:7860

πŸ“ Project Structure

GENAI/
β”œβ”€β”€ app.py                          # Main application entry point
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ configs/
β”‚   └── envs/
β”‚       β”œβ”€β”€ .env.local             # Local environment config
β”‚       └── .env.dev               # Development environment config
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   └── speech_to_text.py      # Core STT model implementation
β”‚   β”œβ”€β”€ ui/
β”‚   β”‚   └── gradio_app.py          # Gradio web interface
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ config.py              # Configuration management
β”‚   β”‚   └── audio_utils.py         # Audio preprocessing utilities
β”‚   └── tensorflow_integration.py   # TensorFlow optimization
└── models/                         # Model cache directory (auto-created)

🎯 Available Models

Model Type Size Languages Description
distil-whisper Whisper 769M 99 6x faster, 49% smaller, <1% WER difference
whisper-free Whisper 1550M 99 Best accuracy, supports 99 languages
whisper-small Whisper 244M 99 Balanced performance, good for CPU
wav2vec2-hindi Wav2Vec2 300M 1 Specialized for Hindi, AI4Bharat model
wav2vec2-improved Wav2Vec2 300M 1 Improved Hindi model, 54% WER
seamless SeamlessM4T 2.3B 101 Meta's unified model
speecht5 SpeechT5 200M 10+ Microsoft's unified speech model

🌐 Web Interface

Single Audio Transcription

  • Upload audio files or record live
  • Select from 8 different models
  • Choose from 13 Indian languages
  • Enable audio preprocessing for better results
  • View real-time processing statistics

Batch Processing

  • Upload multiple audio files
  • Process all files with selected model
  • Download results in markdown format
  • Progress tracking and error handling

Model Comparison

  • Performance metrics table
  • Model recommendations
  • Technical specifications
  • Use case guidelines

πŸ”§ Configuration

The application follows environment-specific configuration:

Local Development (.env.local)

APP_ENV=local
DEBUG=True
GRADIO_SERVER_NAME=127.0.0.1
GRADIO_SERVER_PORT=7860
DEFAULT_MODEL=distil-whisper
ENABLE_GPU=True

Production (.env.prod)

APP_ENV=prod
DEBUG=False
GRADIO_SERVER_NAME=0.0.0.0
GRADIO_SHARE=True
DEFAULT_MODEL=distil-whisper
ENABLE_GPU=True

πŸ’» Usage Examples

Python API Usage

from src.models.speech_to_text import FreeIndianSpeechToText

# Initialize model
asr = FreeIndianSpeechToText(model_type="distil-whisper")

# Transcribe audio
result = asr.transcribe("audio.wav", language_code="hi")
print(result["text"])

# Batch processing
results = asr.batch_transcribe(["file1.wav", "file2.wav"], "hi")

# Switch models
asr.switch_model("wav2vec2-hindi")

# Get model info
info = asr.get_model_info()

TensorFlow Integration

from src.tensorflow_integration import TensorFlowOptimizer

# Optimize model for deployment
optimizer = TensorFlowOptimizer()
optimizer.convert_to_tensorflow_lite("model_path", "optimized.tflite")

# Benchmark performance
metrics = optimizer.benchmark_model(model, input_shape)

🎡 Supported Audio Formats

  • Input: WAV, MP3, FLAC, M4A, OGG
  • Sample Rate: Automatically resampled to 16kHz
  • Channels: Mono (stereo converted automatically)
  • Duration: Up to 5 minutes per file
  • File Size: Maximum 100MB

πŸš€ Performance Optimization

GPU Acceleration

# Enable CUDA for faster processing
ENABLE_GPU=True

# Mixed precision for memory efficiency
torch_dtype=torch.float16

CPU Optimization

# Use smaller models for CPU-only deployment
model_type="whisper-small"
torch_dtype=torch.float32

Batch Processing

# Process multiple files efficiently
results = asr.batch_transcribe(audio_files, language_code)

πŸ“Š Model Recommendations

Best Overall Choice

Distil-Whisper Large-v3

  • 6x faster than original Whisper
  • 49% smaller model size
  • <1% WER difference
  • Excellent for production deployment

Best Accuracy

OpenAI Whisper Large-v3

  • State-of-the-art accuracy
  • Supports 99 languages
  • Best for complex audio scenarios

Hindi Specialized

Wav2Vec2 Hindi Models

  • Optimized specifically for Hindi
  • Lower computational requirements
  • Good for Hindi-only applications

CPU Deployment

Whisper Small

  • Balanced performance on CPU
  • 244M parameters
  • Good accuracy-speed tradeoff

πŸ”’ Commercial Usage

All models in this application are available under permissive licenses:

  • MIT License: Distil-Whisper, OpenAI Whisper, Wav2Vec2 models
  • Apache 2.0: TensorFlow components
  • CC BY-NC: SeamlessM4T (non-commercial)

βœ… Commercial use allowed for most models βœ… No API costs or usage limits βœ… Full model ownership and customization rights

πŸ› οΈ Development

Adding New Models

  1. Add model configuration
# In speech_to_text.py
"new-model": {
    "model_id": "huggingface/model-name",
    "type": "whisper",  # or wav2vec2, etc.
    "description": "Model description",
    "languages": 50,
    "size": "500M"
}
  1. Implement loading logic
def _load_new_model_type(self):
    # Model-specific loading code
    pass
  1. Add to Gradio interface
# In gradio_app.py
model_choices = [..., "new-model"]

Environment Setup

# Development environment
export APP_ENV=dev
python app.py

# Production environment  
export APP_ENV=prod
python app.py

Testing

# Run basic functionality test
python -c "
from src.models.speech_to_text import FreeIndianSpeechToText
asr = FreeIndianSpeechToText('distil-whisper')
print('βœ… Model loaded successfully')
"

πŸ› Troubleshooting

Common Issues

CUDA Out of Memory

# Use CPU or smaller model
ENABLE_GPU=False
# or
model_type="whisper-small"

Model Download Fails

# Check internet connection and try again
# Models are cached after first download

Audio Format Not Supported

# Convert audio using ffmpeg
ffmpeg -i input.mp4 -ar 16000 -ac 1 output.wav

Gradio Interface Not Loading

# Check if port is available
netstat -an | grep 7860

# Try different port
GRADIO_SERVER_PORT=7861

πŸ“ˆ Performance Benchmarks

Processing Speed (RTF - Real Time Factor)

  • Distil-Whisper: 0.17 RTF (6x faster than real-time)
  • Whisper Large: 1.0 RTF (real-time)
  • Wav2Vec2: 0.3 RTF (3x faster than real-time)

Memory Usage

  • Distil-Whisper: ~2GB GPU memory
  • Whisper Large: ~4GB GPU memory
  • Wav2Vec2: ~1GB GPU memory

Accuracy (Word Error Rate)

  • Distil-Whisper: 8.5% WER on Hindi
  • Whisper Large: 8.1% WER on Hindi
  • Wav2Vec2 Hindi: 12% WER on Hindi

🀝 Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Individual models may have different licenses:

  • Distil-Whisper: MIT
  • OpenAI Whisper: MIT
  • Wav2Vec2: MIT
  • SeamlessM4T: CC BY-NC 4.0

πŸ™ Acknowledgments

  • Hugging Face for the Transformers library and model hosting
  • OpenAI for the Whisper models
  • Meta for SeamlessM4T and Wav2Vec2
  • AI4Bharat for Indian language models
  • Gradio for the web interface framework

πŸ“ž Support


Made with ❀️ for the Indian AI community