Caspi-1.7B

Hebrew ASR, done properly.

Caspi-1.7B is a Hebrew automatic speech recognition model built by fine-tuning Qwen/Qwen3-ASR-1.7B for real Hebrew speech.

Caspi exists for one reason: the base multilingual models are strong, but Hebrew deserves a model that is actually tuned for Hebrew — its vocabulary, phonetic edge cases, spelling patterns, and real-world audio conditions.

This model is aimed at single-pass Hebrew ASR with strong quality across conversational, crowd-sourced, and broadcast-style speech.

Despite major advances in speech recognition, Hebrew ASR has seen relatively little dedicated model development, with most systems relying on multilingual Whisper variants.

Caspi aims to push Hebrew ASR forward by training directly on Hebrew speech data and optimizing for real-world Hebrew transcription.

What Caspi is for

Hebrew transcription
Single-pass ASR inference
Offline and batch transcription
Research, benchmarking, and production experimentation
A stronger Hebrew-focused alternative to the multilingual base model
Batch ASR Inference

Why Caspi

Hebrew ASR is deceptively hard.

Short function words, phonetically similar terms, compressed voice-note audio, domain-specific names, and inconsistent orthography can all wreck transcription quality. Caspi was trained specifically to push performance where generic multilingual checkpoints tend to slip.

Compared to the base model, Caspi is intended to provide:

better Hebrew recognition quality
stronger handling of Hebrew vocabulary and orthographic patterns
improved robustness on real Hebrew speech datasets
a more serious baseline for Hebrew ASR evaluation and deployment

This is not a general multilingual release.
Caspi is a Hebrew-specialized checkpoint.

Base model

Base checkpoint: Qwen/Qwen3-ASR-1.7B
Model family: Qwen3-ASR
Base paper: Qwen3-ASR Technical Report

Caspi inherits the architecture and inference ecosystem of Qwen3-ASR, while adapting the model specifically for Hebrew ASR.

Model	Supported Languages	Supported Dialects	Inference Mode	Audio Types
Caspi-1.7B	Hebrew (he), Chinese (zh), English (en), Cantonese (yue), Arabic (ar), German (de), French (fr), Spanish (es), Portuguese (pt), Indonesian (id), Italian (it), Korean (ko), Russian (ru), Thai (th), Vietnamese (vi), Japanese (ja), Turkish (tr), Hindi (hi), Malay (ms), Dutch (nl), Swedish (sv), Danish (da), Finnish (fi), Polish (pl), Czech (cs), Filipino (fil), Persian (fa), Greek (el), Hungarian (hu), Macedonian (mk), Romanian (ro)	Anhui, Dongbei, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shandong, Shaanxi, Shanxi, Sichuan, Tianjin, Yunnan, Zhejiang, Cantonese (Hong Kong accent), Cantonese (Guangdong accent), Wu language, Minnan language.	Offline / Streaming	Speech, Singing Voice, Songs with BGM

Training data

Caspi was fine-tuned on Hebrew speech-transcription data including:

ivrit-ai/crowd-transcribe-v5
ivrit-ai/crowd-recital-whisper-training

These datasets were used to adapt the multilingual base model toward stronger Hebrew recognition.

Notes on the data

Training focused on Hebrew audio + transcript pairs. As with any ASR system, performance is highly sensitive to:

transcript consistency
segmentation quality
domain mismatch
noise and compression
spelling normalization

Some of the hardest Hebrew ASR failure modes remain:

short function words
phonetically similar forms
noisy and low-bitrate audio
proper nouns, abbreviations, and domain-heavy vocabulary

Intended use

Caspi is intended for:

Hebrew ASR research
transcription of Hebrew recordings
experimentation with Hebrew speech systems
benchmarking Hebrew ASR models
downstream speech products and prototypes

Example use cases

transcribing spoken Hebrew audio
transcribing interviews and conversations
transcribing voice notes
evaluating Hebrew ASR quality across domains
building Hebrew-first speech pipelines

Evaluation

Caspi was evaluated on Hebrew ASR benchmarks and internal evaluation sets.

Current evaluation sets

eval-d1
eval-whatsapp
hebrew-speech-kan

Results

WER: Word Error Rate, lower is better

Dataset	Caspi WER	Ivrit v3 WER
eval-d1	4.2%	5.1%
eval-whatsapp	6%	7.2%
hebrew-speech-kan	7.1%	6.4%
Matti Caspi Songs	2.4%	3.7%
average	4.96%	5.6%

Takeaway

Caspi improves over the compared Hebrew Whisper baseline on eval-d1, eval-whatsapp, and on the overall average, while remaining competitive on KAN-style broadcast speech.

That makes it a strong Hebrew ASR checkpoint for real-world use, especially on conversational and less curated audio.

Evaluation notes

If you publish benchmark claims, specify whether decoding used greedy or beam search
Keep normalization policy consistent across models
Comparisons are only meaningful if decoding and preprocessing conditions are matched fairly

Inference

Caspi uses the same overall inference ecosystem as the base Qwen3-ASR model.

Depending on your setup, you can use:

the qwen-asr package
Transformers-based inference
vLLM-based inference
optional forced alignment via Qwen/Qwen3-ForcedAligner-0.6B

Because Caspi is a fine-tuned derivative of Qwen3-ASR-1.7B, usage is similar to the base model — just replace the model name with OzLabs/Caspi-1.7B.

Quick example

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "OzLabs/Caspi-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
)

results = model.transcribe(
    audio="path/to/hebrew_audio.wav",
    language="Hebrew",
)

print(results[0].language)
print(results[0].text)

Python package usage

Transformers backend

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "OzLabs/Caspi-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
)

results = model.transcribe(
    audio="audio path / url",
    language=None,
)

print(results[0].language)
print(results[0].text)

With timestamps

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "OzLabs/Caspi-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs=dict(
        dtype=torch.bfloat16,
        device_map="cuda:0",
    ),
)

results = model.transcribe(
    audio=[
      "path/to/audio",
      "audio.url"
    ],
    language=["Hebrew", "English"],
    return_time_stamps=True,
)

for r in results:
    print(r.language, r.text, r.time_stamps[0])

vLLM backend

import torch
from qwen_asr import Qwen3ASRModel

if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model="OzLabs/Caspi-1.7B",
        gpu_memory_utilization=0.7,
        max_inference_batch_size=128,
        max_new_tokens=4096,
        forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
        forced_aligner_kwargs=dict(
            dtype=torch.bfloat16,
            device_map="cuda:0",
        ),
    )

    results = model.transcribe(
        audio=[
            "path/to/audio",
            "path/to/audio",
        ],
        language=["Hebrew", "English"],
        return_time_stamps=True,
    )

    for r in results:
        print(r.language, r.text, r.time_stamps[0])

Serve with vLLM

qwen-asr-serve OzLabs/Caspi-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000

Request example

import requests

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

data = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": "path/to/audio"
                    },
                }
            ],
        }
    ]
}

response = requests.post(url, headers=headers, json=data, timeout=300)
response.raise_for_status()
content = response.json()["choices"][0]["message"]["content"]
print(content)

from qwen_asr import parse_asr_output
language, text = parse_asr_output(content)
print(language)
print(text)

Streaming inference

Caspi supports streaming inference through the vLLM backend.

Streaming is useful when you want lower-latency transcription, but note:

no batch inference in streaming mode
no timestamps in streaming mode

See the upstream Qwen3-ASR examples for the streaming backend implementation.

Streaming demo

qwen-asr-demo-streaming \
  --asr-model-path OzLabs/Caspi-1.7B \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.9

Forced aligner

For timestamp prediction and alignment, Caspi can be used together with:

Qwen/Qwen3-ForcedAligner-0.6B

Example

import torch
from qwen_asr import Qwen3ForcedAligner

model = Qwen3ForcedAligner.from_pretrained(
    "Qwen/Qwen3-ForcedAligner-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

results = model.align(
    audio="path/to/audio",
    text="איך זה שכוכב אחד מעז",
    language="Hebrew",
)

print(results[0])
print(results[0][0].text, results[0][0].start_time, results[0][0].end_time)

Offline inference with vLLM

from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset

llm = LLM(
    model="OzLabs/Caspi-1.7B"
)

audio_asset = AudioAsset("winning_call")

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "audio_url",
                "audio_url": {"url": audio_asset.url}
            }
        ]
    }
]

sampling_params = SamplingParams(temperature=0.01, max_tokens=256)
outputs = llm.chat(conversation, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Recommended usage notes

For best results:

use reasonably clean audio when possible
segment long audio into shorter utterances
keep sample rate aligned with the base model’s preprocessing expectations
use beam search if latency allows
apply consistent Hebrew text normalization during evaluation

Limitations

Caspi is strong, but Hebrew ASR is still hard.

Common failure modes include:

short phonetically similar words such as על / אל, אם / עם, לא / לו
noisy or low-bitrate speech
overlapping speakers
accented or highly informal speech
domain-specific names, abbreviations, and slang
code-switching between Hebrew and other languages

Performance will vary depending on:

recording quality
segmentation quality
speaker style
domain match between train and test data

Ethical considerations

ASR systems can mis-transcribe people’s speech, especially under:

noisy conditions
accented speech
overlapping speakers
low-quality microphones
compressed audio pipelines

For sensitive, high-stakes, or public-facing use cases, transcripts should be reviewed by a human.

Acknowledgements

Caspi is built on top of Qwen3-ASR-1.7B from the Qwen team.

We also thank the creators and contributors of the Hebrew datasets used for fine-tuning, especially the Ivrit.AI community datasets.

Citation

If you use Caspi in research or applications, please cite both the original Qwen3-ASR work and this checkpoint.

Base model

@article{Qwen3-ASR,
  title={Qwen3-ASR Technical Report},
  author={Xian Shi and Xiong Wang and Zhifang Guo and Yongqi Wang and Pei Zhang and Xinyu Zhang and Zishan Guo and Hongkun Hao and Yu Xi and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
  journal={arXiv preprint arXiv:2601.21337},
  year={2026}
}

Caspi

@misc{caspi_hebrew_asr,
  title={Caspi-1.7B: Hebrew ASR fine-tuned from Qwen3-ASR-1.7B},
  author={Oz Labs},
  year={2026},
  howpublished={Hugging Face model card}
}

Downloads last month: 137

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for OzLabs/Caspi-1.7B

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

(7)

this model

Datasets used to train OzLabs/Caspi-1.7B

Paper for OzLabs/Caspi-1.7B

Qwen3-ASR Technical Report

Paper • 2601.21337 • Published Jan 29 • 36