banner-final

Caspi-1.7B

Hebrew ASR, done properly.

Caspi-1.7B is a Hebrew automatic speech recognition model built by fine-tuning Qwen/Qwen3-ASR-1.7B for real Hebrew speech.

Caspi exists for one reason: the base multilingual models are strong, but Hebrew deserves a model that is actually tuned for Hebrew — its vocabulary, phonetic edge cases, spelling patterns, and real-world audio conditions.

This model is aimed at single-pass Hebrew ASR with strong quality across conversational, crowd-sourced, and broadcast-style speech.

Despite major advances in speech recognition, Hebrew ASR has seen relatively little dedicated model development, with most systems relying on multilingual Whisper variants.

Caspi aims to push Hebrew ASR forward by training directly on Hebrew speech data and optimizing for real-world Hebrew transcription.

What Caspi is for

  • Hebrew transcription
  • Single-pass ASR inference
  • Offline and batch transcription
  • Research, benchmarking, and production experimentation
  • A stronger Hebrew-focused alternative to the multilingual base model
  • Batch ASR Inference

Why Caspi

Hebrew ASR is deceptively hard.

Short function words, phonetically similar terms, compressed voice-note audio, domain-specific names, and inconsistent orthography can all wreck transcription quality. Caspi was trained specifically to push performance where generic multilingual checkpoints tend to slip.

Compared to the base model, Caspi is intended to provide:

  • better Hebrew recognition quality
  • stronger handling of Hebrew vocabulary and orthographic patterns
  • improved robustness on real Hebrew speech datasets
  • a more serious baseline for Hebrew ASR evaluation and deployment

This is not a general multilingual release.
Caspi is a Hebrew-specialized checkpoint.


Base model

  • Base checkpoint: Qwen/Qwen3-ASR-1.7B
  • Model family: Qwen3-ASR
  • Base paper: Qwen3-ASR Technical Report

Caspi inherits the architecture and inference ecosystem of Qwen3-ASR, while adapting the model specifically for Hebrew ASR.


Model Supported Languages Supported Dialects Inference Mode Audio Types
Caspi-1.7B Hebrew (he), Chinese (zh), English (en), Cantonese (yue), Arabic (ar), German (de), French (fr), Spanish (es), Portuguese (pt), Indonesian (id), Italian (it), Korean (ko), Russian (ru), Thai (th), Vietnamese (vi), Japanese (ja), Turkish (tr), Hindi (hi), Malay (ms), Dutch (nl), Swedish (sv), Danish (da), Finnish (fi), Polish (pl), Czech (cs), Filipino (fil), Persian (fa), Greek (el), Hungarian (hu), Macedonian (mk), Romanian (ro) Anhui, Dongbei, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shandong, Shaanxi, Shanxi, Sichuan, Tianjin, Yunnan, Zhejiang, Cantonese (Hong Kong accent), Cantonese (Guangdong accent), Wu language, Minnan language. Offline / Streaming Speech, Singing Voice, Songs with BGM

Training data

Caspi was fine-tuned on Hebrew speech-transcription data including:

  • ivrit-ai/crowd-transcribe-v5
  • ivrit-ai/crowd-recital-whisper-training

These datasets were used to adapt the multilingual base model toward stronger Hebrew recognition.

Notes on the data

Training focused on Hebrew audio + transcript pairs. As with any ASR system, performance is highly sensitive to:

  • transcript consistency
  • segmentation quality
  • domain mismatch
  • noise and compression
  • spelling normalization

Some of the hardest Hebrew ASR failure modes remain:

  • short function words
  • phonetically similar forms
  • noisy and low-bitrate audio
  • proper nouns, abbreviations, and domain-heavy vocabulary

Intended use

Caspi is intended for:

  • Hebrew ASR research
  • transcription of Hebrew recordings
  • experimentation with Hebrew speech systems
  • benchmarking Hebrew ASR models
  • downstream speech products and prototypes

Example use cases

  • transcribing spoken Hebrew audio
  • transcribing interviews and conversations
  • transcribing voice notes
  • evaluating Hebrew ASR quality across domains
  • building Hebrew-first speech pipelines

Evaluation

Caspi was evaluated on Hebrew ASR benchmarks and internal evaluation sets.

Current evaluation sets

  • eval-d1
  • eval-whatsapp
  • hebrew-speech-kan

Results

WER: Word Error Rate, lower is better

Dataset Caspi WER Ivrit v3 WER
eval-d1 4.2% 5.1%
eval-whatsapp 6% 7.2%
hebrew-speech-kan 7.1% 6.4%
Matti Caspi Songs 2.4% 3.7%
average 4.96% 5.6%

Takeaway

Caspi improves over the compared Hebrew Whisper baseline on eval-d1, eval-whatsapp, and on the overall average, while remaining competitive on KAN-style broadcast speech.

That makes it a strong Hebrew ASR checkpoint for real-world use, especially on conversational and less curated audio.

Evaluation notes

  • If you publish benchmark claims, specify whether decoding used greedy or beam search
  • Keep normalization policy consistent across models
  • Comparisons are only meaningful if decoding and preprocessing conditions are matched fairly

Inference

Caspi uses the same overall inference ecosystem as the base Qwen3-ASR model.

Depending on your setup, you can use:

  • the qwen-asr package
  • Transformers-based inference
  • vLLM-based inference
  • optional forced alignment via Qwen/Qwen3-ForcedAligner-0.6B

Because Caspi is a fine-tuned derivative of Qwen3-ASR-1.7B, usage is similar to the base model — just replace the model name with OzLabs/Caspi-1.7B.

Quick example

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "OzLabs/Caspi-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
)

results = model.transcribe(
    audio="path/to/hebrew_audio.wav",
    language="Hebrew",
)

print(results[0].language)
print(results[0].text)

Python package usage

Transformers backend

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "OzLabs/Caspi-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
)

results = model.transcribe(
    audio="audio path / url",
    language=None,
)

print(results[0].language)
print(results[0].text)

With timestamps

import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "OzLabs/Caspi-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs=dict(
        dtype=torch.bfloat16,
        device_map="cuda:0",
    ),
)

results = model.transcribe(
    audio=[
      "path/to/audio",
      "audio.url"
    ],
    language=["Hebrew", "English"],
    return_time_stamps=True,
)

for r in results:
    print(r.language, r.text, r.time_stamps[0])

vLLM backend

import torch
from qwen_asr import Qwen3ASRModel

if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model="OzLabs/Caspi-1.7B",
        gpu_memory_utilization=0.7,
        max_inference_batch_size=128,
        max_new_tokens=4096,
        forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
        forced_aligner_kwargs=dict(
            dtype=torch.bfloat16,
            device_map="cuda:0",
        ),
    )

    results = model.transcribe(
        audio=[
            "path/to/audio",
            "path/to/audio",
        ],
        language=["Hebrew", "English"],
        return_time_stamps=True,
    )

    for r in results:
        print(r.language, r.text, r.time_stamps[0])

Serve with vLLM

qwen-asr-serve OzLabs/Caspi-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000

Request example

import requests

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

data = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": "path/to/audio"
                    },
                }
            ],
        }
    ]
}

response = requests.post(url, headers=headers, json=data, timeout=300)
response.raise_for_status()
content = response.json()["choices"][0]["message"]["content"]
print(content)

from qwen_asr import parse_asr_output
language, text = parse_asr_output(content)
print(language)
print(text)

Streaming inference

Caspi supports streaming inference through the vLLM backend.

Streaming is useful when you want lower-latency transcription, but note:

  • no batch inference in streaming mode
  • no timestamps in streaming mode

See the upstream Qwen3-ASR examples for the streaming backend implementation.

Streaming demo

qwen-asr-demo-streaming \
  --asr-model-path OzLabs/Caspi-1.7B \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.9

Forced aligner

For timestamp prediction and alignment, Caspi can be used together with:

  • Qwen/Qwen3-ForcedAligner-0.6B

Example

import torch
from qwen_asr import Qwen3ForcedAligner

model = Qwen3ForcedAligner.from_pretrained(
    "Qwen/Qwen3-ForcedAligner-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

results = model.align(
    audio="path/to/audio",
    text="איך זה שכוכב אחד מעז",
    language="Hebrew",
)

print(results[0])
print(results[0][0].text, results[0][0].start_time, results[0][0].end_time)

Offline inference with vLLM

from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset

llm = LLM(
    model="OzLabs/Caspi-1.7B"
)

audio_asset = AudioAsset("winning_call")

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "audio_url",
                "audio_url": {"url": audio_asset.url}
            }
        ]
    }
]

sampling_params = SamplingParams(temperature=0.01, max_tokens=256)
outputs = llm.chat(conversation, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

Recommended usage notes

For best results:

  • use reasonably clean audio when possible
  • segment long audio into shorter utterances
  • keep sample rate aligned with the base model’s preprocessing expectations
  • use beam search if latency allows
  • apply consistent Hebrew text normalization during evaluation

Limitations

Caspi is strong, but Hebrew ASR is still hard.

Common failure modes include:

  • short phonetically similar words such as על / אל, אם / עם, לא / לו
  • noisy or low-bitrate speech
  • overlapping speakers
  • accented or highly informal speech
  • domain-specific names, abbreviations, and slang
  • code-switching between Hebrew and other languages

Performance will vary depending on:

  • recording quality
  • segmentation quality
  • speaker style
  • domain match between train and test data

Ethical considerations

ASR systems can mis-transcribe people’s speech, especially under:

  • noisy conditions
  • accented speech
  • overlapping speakers
  • low-quality microphones
  • compressed audio pipelines

For sensitive, high-stakes, or public-facing use cases, transcripts should be reviewed by a human.


Acknowledgements

Caspi is built on top of Qwen3-ASR-1.7B from the Qwen team.

We also thank the creators and contributors of the Hebrew datasets used for fine-tuning, especially the Ivrit.AI community datasets.


Citation

If you use Caspi in research or applications, please cite both the original Qwen3-ASR work and this checkpoint.

Base model

@article{Qwen3-ASR,
  title={Qwen3-ASR Technical Report},
  author={Xian Shi and Xiong Wang and Zhifang Guo and Yongqi Wang and Pei Zhang and Xinyu Zhang and Zishan Guo and Hongkun Hao and Yu Xi and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
  journal={arXiv preprint arXiv:2601.21337},
  year={2026}
}

Caspi

@misc{caspi_hebrew_asr,
  title={Caspi-1.7B: Hebrew ASR fine-tuned from Qwen3-ASR-1.7B},
  author={Oz Labs},
  year={2026},
  howpublished={Hugging Face model card}
}
Downloads last month
137
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OzLabs/Caspi-1.7B

Finetuned
(7)
this model

Datasets used to train OzLabs/Caspi-1.7B

Paper for OzLabs/Caspi-1.7B