Caspi-1.7B
Hebrew ASR, done properly.
Caspi-1.7B is a Hebrew automatic speech recognition model built by fine-tuning Qwen/Qwen3-ASR-1.7B for real Hebrew speech.
Caspi exists for one reason: the base multilingual models are strong, but Hebrew deserves a model that is actually tuned for Hebrew — its vocabulary, phonetic edge cases, spelling patterns, and real-world audio conditions.
This model is aimed at single-pass Hebrew ASR with strong quality across conversational, crowd-sourced, and broadcast-style speech.
Despite major advances in speech recognition, Hebrew ASR has seen relatively little dedicated model development, with most systems relying on multilingual Whisper variants.
Caspi aims to push Hebrew ASR forward by training directly on Hebrew speech data and optimizing for real-world Hebrew transcription.
What Caspi is for
- Hebrew transcription
- Single-pass ASR inference
- Offline and batch transcription
- Research, benchmarking, and production experimentation
- A stronger Hebrew-focused alternative to the multilingual base model
- Batch ASR Inference
Why Caspi
Hebrew ASR is deceptively hard.
Short function words, phonetically similar terms, compressed voice-note audio, domain-specific names, and inconsistent orthography can all wreck transcription quality. Caspi was trained specifically to push performance where generic multilingual checkpoints tend to slip.
Compared to the base model, Caspi is intended to provide:
- better Hebrew recognition quality
- stronger handling of Hebrew vocabulary and orthographic patterns
- improved robustness on real Hebrew speech datasets
- a more serious baseline for Hebrew ASR evaluation and deployment
This is not a general multilingual release.
Caspi is a Hebrew-specialized checkpoint.
Base model
- Base checkpoint:
Qwen/Qwen3-ASR-1.7B - Model family: Qwen3-ASR
- Base paper: Qwen3-ASR Technical Report
Caspi inherits the architecture and inference ecosystem of Qwen3-ASR, while adapting the model specifically for Hebrew ASR.
| Model | Supported Languages | Supported Dialects | Inference Mode | Audio Types |
|---|---|---|---|---|
| Caspi-1.7B | Hebrew (he), Chinese (zh), English (en), Cantonese (yue), Arabic (ar), German (de), French (fr), Spanish (es), Portuguese (pt), Indonesian (id), Italian (it), Korean (ko), Russian (ru), Thai (th), Vietnamese (vi), Japanese (ja), Turkish (tr), Hindi (hi), Malay (ms), Dutch (nl), Swedish (sv), Danish (da), Finnish (fi), Polish (pl), Czech (cs), Filipino (fil), Persian (fa), Greek (el), Hungarian (hu), Macedonian (mk), Romanian (ro) | Anhui, Dongbei, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shandong, Shaanxi, Shanxi, Sichuan, Tianjin, Yunnan, Zhejiang, Cantonese (Hong Kong accent), Cantonese (Guangdong accent), Wu language, Minnan language. | Offline / Streaming | Speech, Singing Voice, Songs with BGM |
Training data
Caspi was fine-tuned on Hebrew speech-transcription data including:
ivrit-ai/crowd-transcribe-v5ivrit-ai/crowd-recital-whisper-training
These datasets were used to adapt the multilingual base model toward stronger Hebrew recognition.
Notes on the data
Training focused on Hebrew audio + transcript pairs. As with any ASR system, performance is highly sensitive to:
- transcript consistency
- segmentation quality
- domain mismatch
- noise and compression
- spelling normalization
Some of the hardest Hebrew ASR failure modes remain:
- short function words
- phonetically similar forms
- noisy and low-bitrate audio
- proper nouns, abbreviations, and domain-heavy vocabulary
Intended use
Caspi is intended for:
- Hebrew ASR research
- transcription of Hebrew recordings
- experimentation with Hebrew speech systems
- benchmarking Hebrew ASR models
- downstream speech products and prototypes
Example use cases
- transcribing spoken Hebrew audio
- transcribing interviews and conversations
- transcribing voice notes
- evaluating Hebrew ASR quality across domains
- building Hebrew-first speech pipelines
Evaluation
Caspi was evaluated on Hebrew ASR benchmarks and internal evaluation sets.
Current evaluation sets
eval-d1eval-whatsapphebrew-speech-kan
Results
WER: Word Error Rate, lower is better
| Dataset | Caspi WER | Ivrit v3 WER |
|---|---|---|
| eval-d1 | 4.2% | 5.1% |
| eval-whatsapp | 6% | 7.2% |
| hebrew-speech-kan | 7.1% | 6.4% |
| Matti Caspi Songs | 2.4% | 3.7% |
| average | 4.96% | 5.6% |
Takeaway
Caspi improves over the compared Hebrew Whisper baseline on eval-d1, eval-whatsapp, and on the overall average, while remaining competitive on KAN-style broadcast speech.
That makes it a strong Hebrew ASR checkpoint for real-world use, especially on conversational and less curated audio.
Evaluation notes
- If you publish benchmark claims, specify whether decoding used greedy or beam search
- Keep normalization policy consistent across models
- Comparisons are only meaningful if decoding and preprocessing conditions are matched fairly
Inference
Caspi uses the same overall inference ecosystem as the base Qwen3-ASR model.
Depending on your setup, you can use:
- the
qwen-asrpackage - Transformers-based inference
- vLLM-based inference
- optional forced alignment via
Qwen/Qwen3-ForcedAligner-0.6B
Because Caspi is a fine-tuned derivative of Qwen3-ASR-1.7B, usage is similar to the base model — just replace the model name with OzLabs/Caspi-1.7B.
Quick example
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"OzLabs/Caspi-1.7B",
dtype=torch.bfloat16,
device_map="cuda:0",
max_inference_batch_size=32,
max_new_tokens=256,
)
results = model.transcribe(
audio="path/to/hebrew_audio.wav",
language="Hebrew",
)
print(results[0].language)
print(results[0].text)
Python package usage
Transformers backend
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"OzLabs/Caspi-1.7B",
dtype=torch.bfloat16,
device_map="cuda:0",
max_inference_batch_size=32,
max_new_tokens=256,
)
results = model.transcribe(
audio="audio path / url",
language=None,
)
print(results[0].language)
print(results[0].text)
With timestamps
import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"OzLabs/Caspi-1.7B",
dtype=torch.bfloat16,
device_map="cuda:0",
max_inference_batch_size=32,
max_new_tokens=256,
forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
forced_aligner_kwargs=dict(
dtype=torch.bfloat16,
device_map="cuda:0",
),
)
results = model.transcribe(
audio=[
"path/to/audio",
"audio.url"
],
language=["Hebrew", "English"],
return_time_stamps=True,
)
for r in results:
print(r.language, r.text, r.time_stamps[0])
vLLM backend
import torch
from qwen_asr import Qwen3ASRModel
if __name__ == '__main__':
model = Qwen3ASRModel.LLM(
model="OzLabs/Caspi-1.7B",
gpu_memory_utilization=0.7,
max_inference_batch_size=128,
max_new_tokens=4096,
forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
forced_aligner_kwargs=dict(
dtype=torch.bfloat16,
device_map="cuda:0",
),
)
results = model.transcribe(
audio=[
"path/to/audio",
"path/to/audio",
],
language=["Hebrew", "English"],
return_time_stamps=True,
)
for r in results:
print(r.language, r.text, r.time_stamps[0])
Serve with vLLM
qwen-asr-serve OzLabs/Caspi-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000
Request example
import requests
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"messages": [
{
"role": "user",
"content": [
{
"type": "audio_url",
"audio_url": {
"url": "path/to/audio"
},
}
],
}
]
}
response = requests.post(url, headers=headers, json=data, timeout=300)
response.raise_for_status()
content = response.json()["choices"][0]["message"]["content"]
print(content)
from qwen_asr import parse_asr_output
language, text = parse_asr_output(content)
print(language)
print(text)
Streaming inference
Caspi supports streaming inference through the vLLM backend.
Streaming is useful when you want lower-latency transcription, but note:
- no batch inference in streaming mode
- no timestamps in streaming mode
See the upstream Qwen3-ASR examples for the streaming backend implementation.
Streaming demo
qwen-asr-demo-streaming \
--asr-model-path OzLabs/Caspi-1.7B \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.9
Forced aligner
For timestamp prediction and alignment, Caspi can be used together with:
Qwen/Qwen3-ForcedAligner-0.6B
Example
import torch
from qwen_asr import Qwen3ForcedAligner
model = Qwen3ForcedAligner.from_pretrained(
"Qwen/Qwen3-ForcedAligner-0.6B",
dtype=torch.bfloat16,
device_map="cuda:0",
)
results = model.align(
audio="path/to/audio",
text="איך זה שכוכב אחד מעז",
language="Hebrew",
)
print(results[0])
print(results[0][0].text, results[0][0].start_time, results[0][0].end_time)
Offline inference with vLLM
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
llm = LLM(
model="OzLabs/Caspi-1.7B"
)
audio_asset = AudioAsset("winning_call")
conversation = [
{
"role": "user",
"content": [
{
"type": "audio_url",
"audio_url": {"url": audio_asset.url}
}
]
}
]
sampling_params = SamplingParams(temperature=0.01, max_tokens=256)
outputs = llm.chat(conversation, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
Recommended usage notes
For best results:
- use reasonably clean audio when possible
- segment long audio into shorter utterances
- keep sample rate aligned with the base model’s preprocessing expectations
- use beam search if latency allows
- apply consistent Hebrew text normalization during evaluation
Limitations
Caspi is strong, but Hebrew ASR is still hard.
Common failure modes include:
- short phonetically similar words such as
על / אל,אם / עם,לא / לו - noisy or low-bitrate speech
- overlapping speakers
- accented or highly informal speech
- domain-specific names, abbreviations, and slang
- code-switching between Hebrew and other languages
Performance will vary depending on:
- recording quality
- segmentation quality
- speaker style
- domain match between train and test data
Ethical considerations
ASR systems can mis-transcribe people’s speech, especially under:
- noisy conditions
- accented speech
- overlapping speakers
- low-quality microphones
- compressed audio pipelines
For sensitive, high-stakes, or public-facing use cases, transcripts should be reviewed by a human.
Acknowledgements
Caspi is built on top of Qwen3-ASR-1.7B from the Qwen team.
We also thank the creators and contributors of the Hebrew datasets used for fine-tuning, especially the Ivrit.AI community datasets.
Citation
If you use Caspi in research or applications, please cite both the original Qwen3-ASR work and this checkpoint.
Base model
@article{Qwen3-ASR,
title={Qwen3-ASR Technical Report},
author={Xian Shi and Xiong Wang and Zhifang Guo and Yongqi Wang and Pei Zhang and Xinyu Zhang and Zishan Guo and Hongkun Hao and Yu Xi and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
journal={arXiv preprint arXiv:2601.21337},
year={2026}
}
Caspi
@misc{caspi_hebrew_asr,
title={Caspi-1.7B: Hebrew ASR fine-tuned from Qwen3-ASR-1.7B},
author={Oz Labs},
year={2026},
howpublished={Hugging Face model card}
}
- Downloads last month
- 137
Model tree for OzLabs/Caspi-1.7B
Base model
Qwen/Qwen3-ASR-1.7B