Spaces:
Running
Running
metadata
title: IndicConformer STT API
emoji: ποΈ
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
IndicConformer Speech-to-Text API ποΈ
Fast and accurate Speech-to-Text API for 22 Indian languages powered by AI4Bharat's IndicConformer model.
π Features
- 22 Indian Languages Supported: Hindi, Telugu, Bengali, Tamil, and 18 more
- Long Audio Support: Process up to 30 minutes of audio
- Parallel Processing: Fast transcription with chunked inference
- Multiple Formats: Supports WAV, MP3, FLAC, M4A
π Quick Start
API Endpoints
- Base URL: Your Space URL
- Documentation:
/docs(Interactive Swagger UI) - Transcribe:
POST /transcribe - Health Check:
GET /health
Example Usage
Using cURL
curl -X POST "https://your-space-url.hf.space/transcribe" \
-F "file=@audio.wav" \
-F "language=hi"
Using Python
import requests
url = "https://your-space-url.hf.space/transcribe"
files = {"file": open("audio.wav", "rb")}
data = {"language": "hi"}
response = requests.post(url, files=files, data=data)
print(response.json())
Using JavaScript
const formData = new FormData();
formData.append('file', audioFile);
formData.append('language', 'hi');
const response = await fetch('https://your-space-url.hf.space/transcribe', {
method: 'POST',
body: formData
});
const result = await response.json();
console.log(result.transcription);
π£οΈ Supported Languages
| Code | Language | Code | Language |
|---|---|---|---|
hi |
Hindi | te |
Telugu |
bn |
Bengali | ta |
Tamil |
mr |
Marathi | gu |
Gujarati |
kn |
Kannada | ml |
Malayalam |
pa |
Punjabi | or |
Odia |
as |
Assamese | ur |
Urdu |
ne |
Nepali | kok |
Konkani |
sd |
Sindhi | doi |
Dogri |
brx |
Bodo | mai |
Maithili |
mni |
Manipuri | ks |
Kashmiri |
sa |
Sanskrit | sat |
Santali |
π Response Format
{
"success": true,
"transcription": "ΰ€ΰ€ͺΰ€ΰ€Ύ ΰ€ΰ₯ΰ€ΰ₯ΰ€Έΰ₯ΰ€ ΰ€―ΰ€Ήΰ€Ύΰ€",
"metadata": {
"audio_duration": 45.2,
"audio_duration_minutes": 0.75,
"inference_time": 2.1543,
"rtf": 0.0476,
"language": "hi",
"decoder": "rnnt",
"num_chunks": 2
}
}
β‘ Performance
- Real-Time Factor (RTF): ~0.05 (20x faster than real-time on GPU)
- Max Audio Length: 30 minutes
- Chunk Processing: 30s chunks with 2s overlap for optimal accuracy
π οΈ Model Information
- Model: ai4bharat/indic-conformer-600m-multilingual
- Decoder: RNNT (Recurrent Neural Network Transducer)
- Architecture: Conformer (600M parameters)
π Notes
- Audio files are automatically resampled to 16kHz mono
- Longer audio files are split into chunks for parallel processing
- GPU acceleration is automatically used when available
- Maximum audio duration is 30 minutes per request
π€ Credits
Built with:
π License
MIT License