Whisper-Hindi2Hinglish-Apex:

GITHUB LINK: github link
SPEECH-TO-TEXT ARENA: Speech-To-Text Arena

Key Features:

Hinglish as a language: Native ability to transcribe hindi into hinglish, making it easier to understand and process.
Faster Inference: The model achieves SOTA performance while being upto 8x faster.
Performance Increase: ~42% average performance increase versus pretrained model across benchmarking datasets.
Ranked #1 on Speech-To-Text Arena, beating competitors on transcription preference metric.

Training:

Data:

Duration: A total of ~700 Hrs of noisy Indian-accented Hindi audios.
Collection: Curated using a collection of open-source and proprietary dataset.
Labelling: Automated labelling using SOTA model, with manual human correction.

Finetuning:

Novel Finetuning Techniques: Advanced fine-tuning techniques were used to specifically target and increase performance on noisy Indian-accented audios.
- Custom Dynamic Layer Freezing: Identifying most active layers during inference and perfomed targeted training of those layers.
- Dataset Augmentations: Devised data augmentation techniques to increase model robustness.

Performance Overview

Qualitative Performance Overview

Audio	Whisper Large V3	Whisper-Hindi2Hinglish-Apex
	maynata pura, canta maynata	Mehnat to poora karte hain.
	Where did they come from?	Haan vahi dekh aapko bataen na.
	A Pantral Logan.	Aap pandrah log hain.
	Thank you, Sanchez.	Thik hai saal ki.
	Rangers, I can tell you.	Nahin, just thank you, thank you.
	Uh-huh. They can't.	Haan haan, dekhe hain.

Quantitative Performance Overview

Note:

To accurately measure Hinglish transcription performance, the original Hindi ground truth for each dataset was first transliterated to Hinglish. The WER scores below were calculated against this transliterated reference text.
To check our model's real-world performance against other SOTA models please head to our Speech-To-Text Arena arena space.

Dataset	Whisper Large V3	Whisper-Hindi2Hinglish-Prime	Whisper-Hindi2Hinglish-Apex
Common-Voice	61.9432	32.4314	35.9586
FLEURS	50.8425	28.6806	29.78694
Indic-Voices	82.5621	60.8224	47.6380

Usage:

Using Transformers

To run the model, first install the Transformers library

pip install -U transformers

The model can be used with the pipeline class to transcribe audios of arbitrary length:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Specify the pre-trained model ID
model_id = "Oriserve/Whisper-Hindi2Hinglish-Apex"

# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,        # Use appropriate precision (float16 for GPU, float32 for CPU)
    low_cpu_mem_usage=True,         # Optimize memory usage during loading
    use_safetensors=True            # Use safetensors format for better security
)
model.to(device)                    # Move model to specified device

# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)

# Create speech recognition pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={
        "task": "transcribe",       # Set task to transcription
        "language": "en"            # Specify English language
    }
)

# Process audio file and print transcription
sample = "sample.wav"               # Input audio file path
result = pipe(sample)               # Run inference
print(result["text"])               # Print transcribed text

Using Flash Attention 2

Flash-Attention 2 can be used to make the transcription fast. If your GPU supports Flash-Attention you can use it by, first installing Flash Attention:

pip install flash-attn --no-build-isolation

Once installed you can then load the model using the below code:

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, attn_implementation="flash_attention_2")

Using the OpenAI Whisper module

First, install the openai-whisper library

pip install -U openai-whisper tqdm

Convert the huggingface checkpoint to a pytorch model

import torch
from transformers import AutoModelForSpeechSeq2Seq
import re
from tqdm import tqdm
from collections import OrderedDict
import json

# Load parameter name mapping from HF to OpenAI format
with open('convert_hf2openai.json', 'r') as f:
    reverse_translation = json.load(f)

reverse_translation = OrderedDict(reverse_translation)

def save_model(model, save_path):
    def reverse_translate(current_param):
        # Convert parameter names using regex patterns
        for pattern, repl in reverse_translation.items():
            if re.match(pattern, current_param):
                return re.sub(pattern, repl, current_param)

    # Extract model dimensions from config
    config = model.config
    model_dims = {
        "n_mels": config.num_mel_bins,           # Number of mel spectrogram bins
        "n_vocab": config.vocab_size,            # Vocabulary size
        "n_audio_ctx": config.max_source_positions,    # Max audio context length
        "n_audio_state": config.d_model,         # Audio encoder state dimension
        "n_audio_head": config.encoder_attention_heads,  # Audio encoder attention heads
        "n_audio_layer": config.encoder_layers,   # Number of audio encoder layers
        "n_text_ctx": config.max_target_positions,     # Max text context length
        "n_text_state": config.d_model,          # Text decoder state dimension
        "n_text_head": config.decoder_attention_heads,  # Text decoder attention heads
        "n_text_layer": config.decoder_layers,    # Number of text decoder layers
    }

    # Convert model state dict to Whisper format
    original_model_state_dict = model.state_dict()
    new_state_dict = {}

    for key, value in tqdm(original_model_state_dict.items()):
        key = key.replace("model.", "")          # Remove 'model.' prefix
        new_key = reverse_translate(key)         # Convert parameter names
        if new_key is not None:
            new_state_dict[new_key] = value

    # Create final model dictionary
    pytorch_model = {"dims": model_dims, "model_state_dict": new_state_dict}

    # Save converted model
    torch.save(pytorch_model, save_path)

# Load Hugging Face model
model_id = "Oriserve/Whisper-Hindi2Hinglish-Apex"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,        # Optimize memory usage
    use_safetensors=True           # Use safetensors format
)

# Convert and save model
model_save_path = "Whisper-Hindi2Hinglish-Apex.pt"
save_model(model,model_save_path)

Transcribe

import whisper
# Load converted model with Whisper and transcribe
model = whisper.load_model("Whisper-Hindi2Hinglish-Apex.pt")
result = model.transcribe("sample.wav")
print(result["text"])

Miscellaneous

This model is from a family of transformers-based ASR models trained by Oriserve. To compare this model against other models from the same family or other SOTA models please head to our Speech-To-Text Arena. To learn more about our other models, and other queries regarding AI voice agents you can reach out to us at our email ai-team@oriserve.com

Downloads last month: 9,090

Safetensors

Model size

0.8B params

Tensor type

BF16

Model tree for Oriserve/Whisper-Hindi2Hinglish-Apex

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Finetuned

(527)

this model

Finetunes

3 models

Space using Oriserve/Whisper-Hindi2Hinglish-Apex 1

Collection including Oriserve/Whisper-Hindi2Hinglish-Apex

ASR Models (Indic)

Collection

List of all our SOTA ASR models trained specifically for Indian languages and accents. • 3 items • Updated Oct 30, 2025

Evaluation results

WER on google/fleurs
test set self-reported

29.787
WER on mozilla-foundation/common_voice_20_0
test set self-reported

35.959
WER on Indic-Voices
test set self-reported

47.638

Oriserve
/

Whisper-Hindi2Hinglish-Apex