A better version of whisper-medium model is available: equal-ai/whisper-transliterate
equal-ai/whisper-transliterate:
- GITHUB LINK:
Table of Contents:
Key Features
- Hinglish-Native Transcription — Transcribes audio directly into Hinglish as it is naturally spoken, rather than forcing output into formal Hindi or English, which significantly reduces grammatical mismatches.
- Whisper-Based Architecture — Built on OpenAI's Whisper architecture, making it plug-and-play with the 🤗 Transformers library.
- Noise-Robust — Handles real-world background noise gracefully; stays silent instead of hallucinating transcriptions when no speech is detected.
- Hallucination-Resistant — Purpose-built to minimize phantom transcriptions, keeping outputs grounded and accurate.
- ~39% Performance Gain — Achieves an average of ~39% improvement over the base Whisper model across all benchmarking datasets.
Training
Data
- ~550 Hours of noisy, Indian-accented Hindi audio collected specifically for this task.
- No off-the-shelf datasets — existing Hinglish ASR datasets were insufficient, so a proprietary dataset was curated from scratch to match real-world conditions.
- Human-in-the-loop Labeling — Audio was first transcribed using a SOTA model, with human reviewers stepping in to catch and correct errors.
- Noise-First Philosophy — Data collection was deliberately biased toward noisy environments, reflecting how the model will actually be used across Indian homes, streets, and offices.
- Minimal Preprocessing — Audio was chunked to under 30 seconds with a maximum of 2 speakers per clip. Beyond that, the source audio was left untouched to preserve its natural acoustic character.
Fine-tuning
- Custom Trainer — A purpose-built training loop with custom callbacks for granular observability throughout the fine-tuning process.
- Dynamic Layer Freezing — Activation patterns were profiled on a representative subset of training data to identify the most task-relevant layers. Only those layers were kept trainable, accelerating convergence while keeping compute costs low.
- DeepSpeed Integration — DeepSpeed was used for memory-efficient, high-throughput training across the full dataset.
Performance Overview
Quantitative Performance Overview
Note:
- The below WER scores are for Hinglish text generated by our model and the original whisper model
| Dataset | Whisper Large V3 | equal-ai/whisper-transliterate |
|---|---|---|
| Common-Voice | 61.9432 | 32.4314 |
| FLEURS | 50.8425 | 28.6806 |
| Indic-Voices | 82.5621 | 60.8224 |
Usage:
Using Transformers
- To run the model, first install the Transformers library
pip install -U transformers
- The model can be used with the
pipelineclass to transcribe audios of arbitrary length:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Specify the pre-trained model ID
model_id = "equal-ai/whisper-transliterate"
# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype, # Use appropriate precision (float16 for GPU, float32 for CPU)
low_cpu_mem_usage=True, # Optimize memory usage during loading
use_safetensors=True # Use safetensors format for better security
)
model.to(device) # Move model to specified device
# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)
# Create speech recognition pipeline
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
generate_kwargs={
"task": "transcribe", # Set task to transcription
"language": "en" # Specify English language
}
)
# Process audio file and print transcription
sample = "sample.wav" # Input audio file path
result = pipe(sample) # Run inference
print(result["text"]) # Print transcribed text
- Downloads last month
- -
Model tree for equal-ai/whisper-transliterate
Base model
openai/whisper-large-v3Evaluation results
- WER on google/fleurstest set self-reported28.681
- WER on mozilla-foundation/common_voice_20_0test set self-reported32.431
- WER on Indic-Voicestest set self-reported60.822