Instructions to use Trelis/whisper-hinglish-preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Trelis/whisper-hinglish-preview with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="Trelis/whisper-hinglish-preview")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Trelis/whisper-hinglish-preview") model = AutoModelForMultimodalLM.from_pretrained("Trelis/whisper-hinglish-preview") - Notebooks
- Google Colab
- Kaggle
Whisper Hinglish (Preview)
A Whisper-large-v3 model specialised for Hinglish (HindiβEnglish code-switched) speech, with strong pure-Hindi and English transcription. It is the best-performing open weights model in our internal evaluation across code-switch, Hindi, and English benchmarks.
Research preview. Numbers and weights may change. Evaluated on internal benchmarks; see disclaimers below.
Try it live
Available on Trelis Router (https://router.trelis.com/models):
- UI β log in and upload an audio clip to transcribe right in the browser.
- API β
POST https://router.trelis.com/api/v1/transcribefor programmatic access (requires an API key).
Evaluation
Corpus WER (%, lower is better) under a script-safe indic-hindi normaliser (NFC + Indic
normalisation, keeps Devanagari matras/nuktas, strips punctuation; not the Whisper default, which strips
matras and inflates Devanagari WER). Compared against two leading commercial APIs: Sarvam (Saaras-v3)
and ElevenLabs Scribe-v2.
π Hinglish β code-switched (Hindi + English in one utterance, each in their native script)
| Benchmark | whisper-hinglish-preview | Sarvam | Scribe-v2 | whisper-large-v3 | Vaani |
|---|---|---|---|---|---|
| CoSHE-500 (conversational CS) | 13.67 | 11.47 αΆα΅ | 12.43 | 29.74 | 73.96 |
| cs-fleurs (read CS) | 10.19 | 16.47 αΆα΅ | 7.57 | 33.92 | 34.12 |
| hiacc-adult (accented CS) | 12.73 | 14.44 αΆα΅ | 16.98 | 28.53 | 60.09 |
| hiacc-child (accented CS) | 10.69 | 14.11 αΆα΅ | 18.36 | 27.91 | 32.17 |
π΅ Hindi (pure Devanagari)
| Benchmark | whisper-hinglish-preview | Sarvam | Scribe-v2 | whisper-large-v3 | Vaani |
|---|---|---|---|---|---|
| Common Voice Hindi (cv-hi) | 12.86 | 12.40 | 13.44 | 30.82 | 14.48 |
| FLEURS-hi | 12.57 | 10.07 | 11.33 | 27.50 | 11.58 |
βͺ English
| Benchmark | whisper-hinglish-preview | Sarvam | Scribe-v2 | whisper-large-v3 | Vaani |
|---|---|---|---|---|---|
| FLEURS-en | 6.93 | 5.14 | 4.01 | 4.81 | 101.66 |
Bold = best on that row.
How to use
Like any Whisper model, specify the language when you transcribe.
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import soundfile as sf, torch
repo = "Trelis/whisper-hinglish-preview"
proc = WhisperProcessor.from_pretrained(repo)
model = WhisperForConditionalGeneration.from_pretrained(repo, torch_dtype=torch.bfloat16).to("cuda").eval()
audio, sr = sf.read("clip.wav") # 16 kHz mono
feat = proc.feature_extractor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda", torch.bfloat16)
# Hindi audio β force <|hi|> ; English audio β force <|en|>
ids = proc.tokenizer.convert_tokens_to_ids
prompt = [ids("<|startoftranscript|>"), ids("<|hi|>"), ids("<|transcribe|>"), ids("<|notimestamps|>")]
out = model.generate(input_features=feat,
decoder_input_ids=torch.tensor([prompt]).to("cuda"),
max_new_tokens=440)
print(proc.tokenizer.decode(out[0], skip_special_tokens=True))
Code-switched audio. The model uses a dedicated <|mixedcode|> marker/token for utterances that mix
Devanagari and Latin script. Insert it right after the language token, choosing the language token by the
dominant script of the utterance:
mc = proc.tokenizer("<|mixedcode|>", add_special_tokens=False).input_ids
prompt = [ids("<|startoftranscript|>"), ids("<|hi|>"), *mc, ids("<|transcribe|>"), ids("<|notimestamps|>")]
Disclaimers
- Commercial-API WERs on pure Hindi benchmarks here are pessimistic. Sarvam and Scribe keep English loanwords in Latin script and numbers as digits, whereas our references render everything in Devanagari. A translit-blind WER then charges a substitution per loanword/number against them. The comparison is apples-to-apples on our Devanagari-reference protocol, not a claim about their raw quality.
- αΆα΅ Sarvam evaluated in its code-mixed mode.
- Specify the language (
<|hi|>/<|en|>) as shown above β standard Whisper usage β for the reported quality.
Attributions
- Architecture base:
openai/whisper-large-v3. - Starting checkpoint β Whisper-Vaani. Our Hindi/Hinglish training started from
ARTPARK-IISc/whisper-large-v3-vaani-hindi, a Vaani-fine-tuned Whisper-large-v3 from the Vaani project (ARTPARK @ IISc). We gratefully credit the Whisper-Vaani model and the Vaani team. - Evaluation benchmark: CoSHE-500 is derived from
soketlabs/CoSHE-Eval(Soket Labs, CC-BY-NC-4.0).
- Downloads last month
- 198
Model tree for Trelis/whisper-hinglish-preview
Base model
openai/whisper-small