---
license: apache-2.0
library_name: transformers
pipeline_tag: automatic-speech-recognition
base_model: Qwen/Qwen3-ASR-1.7B
language:
  - da
tags:
  - audio
  - speech
  - automatic-speech-recognition
  - danish
  - qwen3-asr
  - trust-remote-code
  - custom-code
---

# Capacit-ai/saga

`Capacit-ai/saga` is a state-of-the-art Danish automatic speech recognition model based on `Qwen/Qwen3-ASR-1.7B`.

The model is optimized for fast inference, with aggressive input downsampling and variable chunk sizing unlike the competing models, this enables our Saga model to achieve state-of-the-art performance, while being significantly more efficient.

The model was trained on an nvidia B200, with the use of the [`CoRal dataset`](https://huggingface.co/CoRal-project/datasets) family, courtesy of the [`Danish Innovation fund`](https://innovationsfonden.dk/da) and the [`Alexandra Institute`](https://alexandra.dk)

This repository is intended for Danish transcription only. The underlying Qwen3-ASR base model is multilingual, but this finetuned checkpoint is Danish-focused and the model has unlearned most of its multilingual capabilities.

## Model Summary

- Base model: `Qwen/Qwen3-ASR-1.7B`
- Task: automatic speech recognition
- Primary language: Danish
- Input audio: 16 kHz mono waveform

## Quickstart

Install the packages:

```bash
pip install -U transformers soundfile torch qwen-asr
```

Then load the model with `transformers`:

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

MODEL_ID = "capacit-ai/saga"
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
DTYPE = torch.bfloat16 if DEVICE.startswith("cuda") else torch.float32

processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, trust_remote_code=True, torch_dtype=DTYPE,
)
model.to(DEVICE)
model.eval()

audio = processor.load_audio("audio.wav")
text = model.transcribe(audio, processor)
print(text)
```

## Long-Form Audio

The base Qwen3-ASR architecture supports long inputs, but the most stable long-form decoding in this project came from accumulated-audio continuation decoding rather than a single naive generate call. The `model.transcribe()` method already implements this strategy it walks through the audio in `step_seconds` chunks, re-feeding the accumulated waveform together with previously decoded text so the model keeps prior context. The `step_seconds`, `rollback_tokens`, and `max_new_tokens` parameters can be tuned for your use case.

The `processor.load_audio` and `model.transcribe` methods accept the following parameters:


```python
# Load and resample any audio file to a mono float32 waveform
audio = processor.load_audio(
    path="audio.wav",
    target_sr=16_000,          # target sample rate (default: 16 000)
)

# Transcribe with accumulated-audio continuation decoding
text = model.transcribe(
    audio,
    processor,
    language="Danish",         # language tag in the prompt (default: "Danish")
    target_sr=16_000,          # must match load_audio target_sr (default: 16 000)
    step_seconds=15.0,         # seconds of new audio per continuation step (default: 15.0)
    rollback_tokens=8,         # token rollback for prefix overlap (default: 8)
    max_new_tokens=2048,       # generation budget per step (default: 2048)
)
```

## 🚀Fast inference🚀, vllm
```bash
pip install -U qwen-asr[vllm]
```

```bash
MAX_JOBS=4 pip install -U flash-attn --no-build-isolation
```

```python
import librosa
from qwen_asr import Qwen3ASRModel

def transcribe_single_file(audio_path, model_id="capacit-ai/saga"):
    model = Qwen3ASRModel.LLM(model=model_id, gpu_memory_utilization=0.92)
    audio, _ = librosa.load(audio_path, sr=16000)
    output = model.transcribe(audio=[(audio, 16000)], language=["Danish"])
    return output[0].text

if __name__ == "__main__":
    print(transcribe_single_file("audio.wav"))
```


## Evaluation

All of the finetuned models has been trained on CoRal data, as it's the most comprehensive and high quality (open-source) danish ASR dataset family, therefore we evaluated them on CoRal.
All Qwen based models where evaluated using the same script and all Whisper based models where evaluated using the same script.

Upcoming: More unseen datasets and performance metrics on the way!

| Dataset | Model | Samples | CER  | WER |
| --- | --- | --- | --- | --- | 
| CoRal read_aloud (test) | capacit-ai/saga | 8000 | 6.7% | 15.6% | 
| CoRal read_aloud (test) | Qwen/Qwen3-ASR-1.7B | 8000 | 15.0% | 33.6% | 
| CoRal read_aloud (test) | pluttodk/milo-asr |8000 | 7.6% | 16.8% |
| CoRal read_aloud (test) | openai/whisper-large-v3 | 8000 | 10.3% | 25.2% |
| CoRal read_aloud (test) | CoRal-project/roest-v3-whisper-1.5b | 8000 | 4.7% | 11.6% |
| CoRal read_aloud (test) | syvai/hviske-v3-conversation| 8000 | 7.7% | 18.2% |

![plot](./cer_by_model.png)

![plot](./wer_by_model.png)

| Model | RTFx |
| --- | --- |
| capacit-ai/saga | 470 | 
| Qwen/Qwen3-ASR-1.7B| 585 |
| openai/whisper-large-v3 | 50 | 

![plot](./rtfx_by_model.png)

- RTFx figures are with vllm and fastattention enabled for Qwen backends, we succesfully ran pluttodk/milo-asr with a vllm backend and saw no significant drop in WER or CER.
- All evaluation metrics where created using a single RTX 5090 instance.

## Acknowledgements
Credit to the talented Qwen team, for making efficient and accurate models and open sourcing them.

And credit to the [`Danish Innovation fund`](https://innovationsfonden.dk/da), [`Alexandra Institute`](https://alexandra.dk) and partners, for the CoRal datasets.

- Datasets: [`CoRal-project`](https://huggingface.co/CoRal-project/datasets)
- Base model: [`Qwen/Qwen3-ASR-1.7B`](https://huggingface.co/Qwen/Qwen3-ASR-1.7B)
- Original project documentation: [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR)

## Creator
This model has been finetuned and model card authored by [`Andreas Eefsen`](https://www.linkedin.com/in/andreas-e-444780221/), [`Capacit A/S Copenhagen`](https://www.linkedin.com/company/capacit-as).