Spaces:

aryan083
/

Speech-To-Text

Running

App Files Files Community

Speech-To-Text / README.md

aryan083

Fix YAML metadata duplication in README.md

547723c 8 months ago

preview code

raw

history blame contribute delete

9.77 kB

metadata

title: Indian Speech-to-Text Models
emoji: 🎤
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
tags:
  - speech-to-text
  - indian-languages
  - hindi
  - tamil
  - bengali
  - whisper
  - wav2vec2
  - gradio
  - transformers
  - tensorflow
models:
  - distil-whisper/distil-large-v3
  - openai/whisper-large-v3
  - ai4bharat/indicwav2vec-hindi
datasets:
  - mozilla-foundation/common_voice_11_0

🎤 Complete Guide to Free Open-Source Speech-to-Text Models for Indian Languages

A comprehensive web application showcasing free, open-source speech-to-text models optimized for Indian languages. Built with TensorFlow, Gradio, and Transformers.

🌟 Features

8 Free Models: Distil-Whisper, OpenAI Whisper, Wav2Vec2, SeamlessM4T, and more
13 Indian Languages: Hindi, Tamil, Bengali, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia, Assamese, Urdu, English
Real-time Processing: Live audio recording and file upload
Batch Processing: Process multiple audio files simultaneously
Model Comparison: Performance metrics and recommendations
TensorFlow Integration: Optimized inference and deployment
Commercial License: All models are free for commercial use

🚀 Quick Start

Prerequisites

Python 3.8+
CUDA (optional, for GPU acceleration)

Installation

Clone the repository

git clone <repository-url>
cd GENAI

Install dependencies

pip install -r requirements.txt

Set environment

# Windows
set APP_ENV=local

# Linux/Mac
export APP_ENV=local

Run the application

python app.py

Open your browser Navigate to http://127.0.0.1:7860

📁 Project Structure

GENAI/
├── app.py                          # Main application entry point
├── requirements.txt                # Python dependencies
├── configs/
│   └── envs/
│       ├── .env.local             # Local environment config
│       └── .env.dev               # Development environment config
├── src/
│   ├── models/
│   │   └── speech_to_text.py      # Core STT model implementation
│   ├── ui/
│   │   └── gradio_app.py          # Gradio web interface
│   ├── utils/
│   │   ├── config.py              # Configuration management
│   │   └── audio_utils.py         # Audio preprocessing utilities
│   └── tensorflow_integration.py   # TensorFlow optimization
└── models/                         # Model cache directory (auto-created)

🎯 Available Models

Model	Type	Size	Languages	Description
distil-whisper	Whisper	769M	99	6x faster, 49% smaller, <1% WER difference
whisper-free	Whisper	1550M	99	Best accuracy, supports 99 languages
whisper-small	Whisper	244M	99	Balanced performance, good for CPU
wav2vec2-hindi	Wav2Vec2	300M	1	Specialized for Hindi, AI4Bharat model
wav2vec2-improved	Wav2Vec2	300M	1	Improved Hindi model, 54% WER
seamless	SeamlessM4T	2.3B	101	Meta's unified model
speecht5	SpeechT5	200M	10+	Microsoft's unified speech model

🌐 Web Interface

Single Audio Transcription

Upload audio files or record live
Select from 8 different models
Choose from 13 Indian languages
Enable audio preprocessing for better results
View real-time processing statistics

Batch Processing

Upload multiple audio files
Process all files with selected model
Download results in markdown format
Progress tracking and error handling

Model Comparison

Performance metrics table
Model recommendations
Technical specifications
Use case guidelines

🔧 Configuration

The application follows environment-specific configuration:

Local Development (`.env.local`)

APP_ENV=local
DEBUG=True
GRADIO_SERVER_NAME=127.0.0.1
GRADIO_SERVER_PORT=7860
DEFAULT_MODEL=distil-whisper
ENABLE_GPU=True

Production (`.env.prod`)

APP_ENV=prod
DEBUG=False
GRADIO_SERVER_NAME=0.0.0.0
GRADIO_SHARE=True
DEFAULT_MODEL=distil-whisper
ENABLE_GPU=True

💻 Usage Examples

Python API Usage

from src.models.speech_to_text import FreeIndianSpeechToText

# Initialize model
asr = FreeIndianSpeechToText(model_type="distil-whisper")

# Transcribe audio
result = asr.transcribe("audio.wav", language_code="hi")
print(result["text"])

# Batch processing
results = asr.batch_transcribe(["file1.wav", "file2.wav"], "hi")

# Switch models
asr.switch_model("wav2vec2-hindi")

# Get model info
info = asr.get_model_info()

TensorFlow Integration

from src.tensorflow_integration import TensorFlowOptimizer

# Optimize model for deployment
optimizer = TensorFlowOptimizer()
optimizer.convert_to_tensorflow_lite("model_path", "optimized.tflite")

# Benchmark performance
metrics = optimizer.benchmark_model(model, input_shape)

🎵 Supported Audio Formats

Input: WAV, MP3, FLAC, M4A, OGG
Sample Rate: Automatically resampled to 16kHz
Channels: Mono (stereo converted automatically)
Duration: Up to 5 minutes per file
File Size: Maximum 100MB

🚀 Performance Optimization

GPU Acceleration

# Enable CUDA for faster processing
ENABLE_GPU=True

# Mixed precision for memory efficiency
torch_dtype=torch.float16

CPU Optimization

# Use smaller models for CPU-only deployment
model_type="whisper-small"
torch_dtype=torch.float32

Batch Processing

# Process multiple files efficiently
results = asr.batch_transcribe(audio_files, language_code)

📊 Model Recommendations

Best Overall Choice

Distil-Whisper Large-v3

6x faster than original Whisper
49% smaller model size
<1% WER difference
Excellent for production deployment

Best Accuracy

OpenAI Whisper Large-v3

State-of-the-art accuracy
Supports 99 languages
Best for complex audio scenarios

Hindi Specialized

Wav2Vec2 Hindi Models

Optimized specifically for Hindi
Lower computational requirements
Good for Hindi-only applications

CPU Deployment

Whisper Small

Balanced performance on CPU
244M parameters
Good accuracy-speed tradeoff

🔒 Commercial Usage

All models in this application are available under permissive licenses:

MIT License: Distil-Whisper, OpenAI Whisper, Wav2Vec2 models
Apache 2.0: TensorFlow components
CC BY-NC: SeamlessM4T (non-commercial)

✅ Commercial use allowed for most models ✅ No API costs or usage limits ✅ Full model ownership and customization rights

🛠️ Development

Adding New Models

Add model configuration

# In speech_to_text.py
"new-model": {
    "model_id": "huggingface/model-name",
    "type": "whisper",  # or wav2vec2, etc.
    "description": "Model description",
    "languages": 50,
    "size": "500M"
}

Implement loading logic

def _load_new_model_type(self):
    # Model-specific loading code
    pass

Add to Gradio interface

# In gradio_app.py
model_choices = [..., "new-model"]

Environment Setup

# Development environment
export APP_ENV=dev
python app.py

# Production environment  
export APP_ENV=prod
python app.py

Testing

# Run basic functionality test
python -c "
from src.models.speech_to_text import FreeIndianSpeechToText
asr = FreeIndianSpeechToText('distil-whisper')
print('✅ Model loaded successfully')
"

🐛 Troubleshooting

Common Issues

CUDA Out of Memory

# Use CPU or smaller model
ENABLE_GPU=False
# or
model_type="whisper-small"

Model Download Fails

# Check internet connection and try again
# Models are cached after first download

Audio Format Not Supported

# Convert audio using ffmpeg
ffmpeg -i input.mp4 -ar 16000 -ac 1 output.wav

Gradio Interface Not Loading

# Check if port is available
netstat -an | grep 7860

# Try different port
GRADIO_SERVER_PORT=7861

📈 Performance Benchmarks

Processing Speed (RTF - Real Time Factor)

Distil-Whisper: 0.17 RTF (6x faster than real-time)
Whisper Large: 1.0 RTF (real-time)
Wav2Vec2: 0.3 RTF (3x faster than real-time)

Memory Usage

Distil-Whisper: ~2GB GPU memory
Whisper Large: ~4GB GPU memory
Wav2Vec2: ~1GB GPU memory

Accuracy (Word Error Rate)

Distil-Whisper: 8.5% WER on Hindi
Whisper Large: 8.1% WER on Hindi
Wav2Vec2 Hindi: 12% WER on Hindi

🤝 Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Individual models may have different licenses:

Distil-Whisper: MIT
OpenAI Whisper: MIT
Wav2Vec2: MIT
SeamlessM4T: CC BY-NC 4.0

🙏 Acknowledgments

Hugging Face for the Transformers library and model hosting
OpenAI for the Whisper models
Meta for SeamlessM4T and Wav2Vec2
AI4Bharat for Indian language models
Gradio for the web interface framework

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: Wiki

Made with ❤️ for the Indian AI community

🎤 Complete Guide to Free Open-Source Speech-to-Text Models for Indian Languages

🌟 Features

🚀 Quick Start

Prerequisites

Installation

📁 Project Structure

🎯 Available Models

🌐 Web Interface

Single Audio Transcription

Batch Processing

Model Comparison

🔧 Configuration

Local Development (.env.local)

Production (.env.prod)

💻 Usage Examples

Python API Usage

TensorFlow Integration

🎵 Supported Audio Formats

🚀 Performance Optimization

GPU Acceleration

CPU Optimization

Batch Processing

📊 Model Recommendations

Best Overall Choice

Best Accuracy

Hindi Specialized

CPU Deployment

🔒 Commercial Usage

🛠️ Development

Adding New Models

Environment Setup

Testing

🐛 Troubleshooting

Common Issues

📈 Performance Benchmarks

Processing Speed (RTF - Real Time Factor)

Memory Usage

Accuracy (Word Error Rate)

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

Local Development (`.env.local`)

Production (`.env.prod`)