| --- |
| license: apache-2.0 |
| library_name: transformers |
| pipeline_tag: automatic-speech-recognition |
| base_model: Qwen/Qwen3-ASR-1.7B |
| language: |
| - da |
| tags: |
| - audio |
| - speech |
| - automatic-speech-recognition |
| - danish |
| - qwen3-asr |
| - trust-remote-code |
| - custom-code |
| --- |
| |
| # Capacit-ai/saga |
|
|
| `Capacit-ai/saga` is a state-of-the-art Danish automatic speech recognition model based on `Qwen/Qwen3-ASR-1.7B`. |
|
|
| The model is optimized for fast inference, with aggressive input downsampling and variable chunk sizing unlike the competing models, this enables our Saga model to achieve state-of-the-art performance, while being significantly more efficient. |
|
|
| The model was trained on an nvidia B200, with the use of the [`CoRal dataset`](https://huggingface.co/CoRal-project/datasets) family, courtesy of the [`Danish Innovation fund`](https://innovationsfonden.dk/da) and the [`Alexandra Institute`](https://alexandra.dk) |
|
|
| This repository is intended for Danish transcription only. The underlying Qwen3-ASR base model is multilingual, but this finetuned checkpoint is Danish-focused and the model has unlearned most of its multilingual capabilities. |
|
|
| ## Model Summary |
|
|
| - Base model: `Qwen/Qwen3-ASR-1.7B` |
| - Task: automatic speech recognition |
| - Primary language: Danish |
| - Input audio: 16 kHz mono waveform |
|
|
| ## Quickstart |
|
|
| Install the packages: |
|
|
| ```bash |
| pip install -U transformers soundfile torch qwen-asr |
| ``` |
|
|
| Then load the model with `transformers`: |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoProcessor |
| |
| MODEL_ID = "capacit-ai/saga" |
| DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu" |
| DTYPE = torch.bfloat16 if DEVICE.startswith("cuda") else torch.float32 |
| |
| processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| MODEL_ID, trust_remote_code=True, torch_dtype=DTYPE, |
| ) |
| model.to(DEVICE) |
| model.eval() |
| |
| audio = processor.load_audio("audio.wav") |
| text = model.transcribe(audio, processor) |
| print(text) |
| ``` |
|
|
| ## Long-Form Audio |
|
|
| The base Qwen3-ASR architecture supports long inputs, but the most stable long-form decoding in this project came from accumulated-audio continuation decoding rather than a single naive generate call. The `model.transcribe()` method already implements this strategy it walks through the audio in `step_seconds` chunks, re-feeding the accumulated waveform together with previously decoded text so the model keeps prior context. The `step_seconds`, `rollback_tokens`, and `max_new_tokens` parameters can be tuned for your use case. |
|
|
| The `processor.load_audio` and `model.transcribe` methods accept the following parameters: |
|
|
|
|
| ```python |
| # Load and resample any audio file to a mono float32 waveform |
| audio = processor.load_audio( |
| path="audio.wav", |
| target_sr=16_000, # target sample rate (default: 16 000) |
| ) |
| |
| # Transcribe with accumulated-audio continuation decoding |
| text = model.transcribe( |
| audio, |
| processor, |
| language="Danish", # language tag in the prompt (default: "Danish") |
| target_sr=16_000, # must match load_audio target_sr (default: 16 000) |
| step_seconds=15.0, # seconds of new audio per continuation step (default: 15.0) |
| rollback_tokens=8, # token rollback for prefix overlap (default: 8) |
| max_new_tokens=2048, # generation budget per step (default: 2048) |
| ) |
| ``` |
|
|
| ## 🚀Fast inference🚀, vllm |
| ```bash |
| pip install -U qwen-asr[vllm] |
| ``` |
|
|
| ```bash |
| MAX_JOBS=4 pip install -U flash-attn --no-build-isolation |
| ``` |
|
|
| ```python |
| import librosa |
| from qwen_asr import Qwen3ASRModel |
| |
| def transcribe_single_file(audio_path, model_id="capacit-ai/saga"): |
| model = Qwen3ASRModel.LLM(model=model_id, gpu_memory_utilization=0.92) |
| audio, _ = librosa.load(audio_path, sr=16000) |
| output = model.transcribe(audio=[(audio, 16000)], language=["Danish"]) |
| return output[0].text |
| |
| if __name__ == "__main__": |
| print(transcribe_single_file("audio.wav")) |
| ``` |
|
|
|
|
| ## Evaluation |
|
|
| All of the finetuned models has been trained on CoRal data, as it's the most comprehensive and high quality (open-source) danish ASR dataset family, therefore we evaluated them on CoRal. |
| All Qwen based models where evaluated using the same script and all Whisper based models where evaluated using the same script. |
|
|
| Upcoming: More unseen datasets and performance metrics on the way! |
|
|
| | Dataset | Model | Samples | CER | WER | |
| | --- | --- | --- | --- | --- | |
| | CoRal read_aloud (test) | capacit-ai/saga | 8000 | 6.7% | 15.6% | |
| | CoRal read_aloud (test) | Qwen/Qwen3-ASR-1.7B | 8000 | 15.0% | 33.6% | |
| | CoRal read_aloud (test) | pluttodk/milo-asr |8000 | 7.6% | 16.8% | |
| | CoRal read_aloud (test) | openai/whisper-large-v3 | 8000 | 10.3% | 25.2% | |
| | CoRal read_aloud (test) | CoRal-project/roest-v3-whisper-1.5b | 8000 | 4.7% | 11.6% | |
| | CoRal read_aloud (test) | syvai/hviske-v3-conversation| 8000 | 7.7% | 18.2% | |
|
|
|  |
|
|
|  |
|
|
| | Model | RTFx | |
| | --- | --- | |
| | capacit-ai/saga | 470 | |
| | Qwen/Qwen3-ASR-1.7B| 585 | |
| | openai/whisper-large-v3 | 50 | |
|
|
|  |
|
|
| - RTFx figures are with vllm and fastattention enabled for Qwen backends, we succesfully ran pluttodk/milo-asr with a vllm backend and saw no significant drop in WER or CER. |
| - All evaluation metrics where created using a single RTX 5090 instance. |
|
|
| ## Acknowledgements |
| Credit to the talented Qwen team, for making efficient and accurate models and open sourcing them. |
|
|
| And credit to the [`Danish Innovation fund`](https://innovationsfonden.dk/da), [`Alexandra Institute`](https://alexandra.dk) and partners, for the CoRal datasets. |
|
|
| - Datasets: [`CoRal-project`](https://huggingface.co/CoRal-project/datasets) |
| - Base model: [`Qwen/Qwen3-ASR-1.7B`](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) |
| - Original project documentation: [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR) |
|
|
| ## Creator |
| This model has been finetuned and model card authored by [`Andreas Eefsen`](https://www.linkedin.com/in/andreas-e-444780221/), [`Capacit A/S Copenhagen`](https://www.linkedin.com/company/capacit-as). |