A better version of whisper-medium model is available: equal-ai/whisper-transliterate

equal-ai/whisper-transliterate:

  • GITHUB LINK:

Table of Contents:

Key Features

  1. Hinglish-Native Transcription — Transcribes audio directly into Hinglish as it is naturally spoken, rather than forcing output into formal Hindi or English, which significantly reduces grammatical mismatches.
  2. Whisper-Based Architecture — Built on OpenAI's Whisper architecture, making it plug-and-play with the 🤗 Transformers library.
  3. Noise-Robust — Handles real-world background noise gracefully; stays silent instead of hallucinating transcriptions when no speech is detected.
  4. Hallucination-Resistant — Purpose-built to minimize phantom transcriptions, keeping outputs grounded and accurate.
  5. ~39% Performance Gain — Achieves an average of ~39% improvement over the base Whisper model across all benchmarking datasets.

Training

Data

  • ~550 Hours of noisy, Indian-accented Hindi audio collected specifically for this task.
  • No off-the-shelf datasets — existing Hinglish ASR datasets were insufficient, so a proprietary dataset was curated from scratch to match real-world conditions.
  • Human-in-the-loop Labeling — Audio was first transcribed using a SOTA model, with human reviewers stepping in to catch and correct errors.
  • Noise-First Philosophy — Data collection was deliberately biased toward noisy environments, reflecting how the model will actually be used across Indian homes, streets, and offices.
  • Minimal Preprocessing — Audio was chunked to under 30 seconds with a maximum of 2 speakers per clip. Beyond that, the source audio was left untouched to preserve its natural acoustic character.

Fine-tuning

  • Custom Trainer — A purpose-built training loop with custom callbacks for granular observability throughout the fine-tuning process.
  • Dynamic Layer Freezing — Activation patterns were profiled on a representative subset of training data to identify the most task-relevant layers. Only those layers were kept trainable, accelerating convergence while keeping compute costs low.
  • DeepSpeed Integration — DeepSpeed was used for memory-efficient, high-throughput training across the full dataset.

Performance Overview

Quantitative Performance Overview

Note:

  • The below WER scores are for Hinglish text generated by our model and the original whisper model
Dataset Whisper Large V3 equal-ai/whisper-transliterate
Common-Voice 61.9432 32.4314
FLEURS 50.8425 28.6806
Indic-Voices 82.5621 60.8224

Usage:

Using Transformers

  • To run the model, first install the Transformers library

pip install -U transformers

  • The model can be used with the pipeline class to transcribe audios of arbitrary length:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Specify the pre-trained model ID
model_id = "equal-ai/whisper-transliterate"

# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch_dtype,        # Use appropriate precision (float16 for GPU, float32 for CPU)
    low_cpu_mem_usage=True,         # Optimize memory usage during loading
    use_safetensors=True            # Use safetensors format for better security
)
model.to(device)                    # Move model to specified device

# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)

# Create speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={
        "task": "transcribe",       # Set task to transcription
        "language": "en"            # Specify English language
    }
)

# Process audio file and print transcription
sample = "sample.wav"               # Input audio file path
result = pipe(sample)               # Run inference
print(result["text"])               # Print transcribed text
Downloads last month
-
Safetensors
Model size
2B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for equal-ai/whisper-transliterate

Finetuned
(779)
this model

Evaluation results