indic_conformer / README.md
arshadul's picture
Upload 4 files
a0bf6f4 verified
metadata
title: IndicConformer STT API
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit

IndicConformer Speech-to-Text API πŸŽ™οΈ

Fast and accurate Speech-to-Text API for 22 Indian languages powered by AI4Bharat's IndicConformer model.

🌟 Features

  • 22 Indian Languages Supported: Hindi, Telugu, Bengali, Tamil, and 18 more
  • Long Audio Support: Process up to 30 minutes of audio
  • Parallel Processing: Fast transcription with chunked inference
  • Multiple Formats: Supports WAV, MP3, FLAC, M4A

πŸš€ Quick Start

API Endpoints

  • Base URL: Your Space URL
  • Documentation: /docs (Interactive Swagger UI)
  • Transcribe: POST /transcribe
  • Health Check: GET /health

Example Usage

Using cURL

curl -X POST "https://your-space-url.hf.space/transcribe" \
  -F "file=@audio.wav" \
  -F "language=hi"

Using Python

import requests

url = "https://your-space-url.hf.space/transcribe"

files = {"file": open("audio.wav", "rb")}
data = {"language": "hi"}

response = requests.post(url, files=files, data=data)
print(response.json())

Using JavaScript

const formData = new FormData();
formData.append('file', audioFile);
formData.append('language', 'hi');

const response = await fetch('https://your-space-url.hf.space/transcribe', {
  method: 'POST',
  body: formData
});

const result = await response.json();
console.log(result.transcription);

πŸ—£οΈ Supported Languages

Code Language Code Language
hi Hindi te Telugu
bn Bengali ta Tamil
mr Marathi gu Gujarati
kn Kannada ml Malayalam
pa Punjabi or Odia
as Assamese ur Urdu
ne Nepali kok Konkani
sd Sindhi doi Dogri
brx Bodo mai Maithili
mni Manipuri ks Kashmiri
sa Sanskrit sat Santali

πŸ“Š Response Format

{
  "success": true,
  "transcription": "ΰ€†ΰ€ͺΰ€•ΰ€Ύ ΰ€Ÿΰ₯‡ΰ€•ΰ₯ΰ€Έΰ₯ΰ€Ÿ ΰ€―ΰ€Ήΰ€Ύΰ€‚",
  "metadata": {
    "audio_duration": 45.2,
    "audio_duration_minutes": 0.75,
    "inference_time": 2.1543,
    "rtf": 0.0476,
    "language": "hi",
    "decoder": "rnnt",
    "num_chunks": 2
  }
}

⚑ Performance

  • Real-Time Factor (RTF): ~0.05 (20x faster than real-time on GPU)
  • Max Audio Length: 30 minutes
  • Chunk Processing: 30s chunks with 2s overlap for optimal accuracy

πŸ› οΈ Model Information

πŸ“ Notes

  • Audio files are automatically resampled to 16kHz mono
  • Longer audio files are split into chunks for parallel processing
  • GPU acceleration is automatically used when available
  • Maximum audio duration is 30 minutes per request

🀝 Credits

Built with:

πŸ“„ License

MIT License