--- license: apache-2.0 library_name: transformers pipeline_tag: automatic-speech-recognition base_model: Qwen/Qwen3-ASR-1.7B language: - da tags: - audio - speech - automatic-speech-recognition - danish - qwen3-asr - trust-remote-code - custom-code --- # Capacit-ai/saga `Capacit-ai/saga` is a state-of-the-art Danish automatic speech recognition model based on `Qwen/Qwen3-ASR-1.7B`. The model is optimized for fast inference, with aggressive input downsampling and variable chunk sizing unlike the competing models, this enables our Saga model to achieve state-of-the-art performance, while being significantly more efficient. The model was trained on an nvidia B200, with the use of the [`CoRal dataset`](https://huggingface.co/CoRal-project/datasets) family, courtesy of the [`Danish Innovation fund`](https://innovationsfonden.dk/da) and the [`Alexandra Institute`](https://alexandra.dk) This repository is intended for Danish transcription only. The underlying Qwen3-ASR base model is multilingual, but this finetuned checkpoint is Danish-focused and the model has unlearned most of its multilingual capabilities. ## Model Summary - Base model: `Qwen/Qwen3-ASR-1.7B` - Task: automatic speech recognition - Primary language: Danish - Input audio: 16 kHz mono waveform ## Quickstart Install the packages: ```bash pip install -U transformers soundfile torch qwen-asr ``` Then load the model with `transformers`: ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor MODEL_ID = "capacit-ai/saga" DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu" DTYPE = torch.bfloat16 if DEVICE.startswith("cuda") else torch.float32 processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( MODEL_ID, trust_remote_code=True, torch_dtype=DTYPE, ) model.to(DEVICE) model.eval() audio = processor.load_audio("audio.wav") text = model.transcribe(audio, processor) print(text) ``` ## Long-Form Audio The base Qwen3-ASR architecture supports long inputs, but the most stable long-form decoding in this project came from accumulated-audio continuation decoding rather than a single naive generate call. The `model.transcribe()` method already implements this strategy it walks through the audio in `step_seconds` chunks, re-feeding the accumulated waveform together with previously decoded text so the model keeps prior context. The `step_seconds`, `rollback_tokens`, and `max_new_tokens` parameters can be tuned for your use case. The `processor.load_audio` and `model.transcribe` methods accept the following parameters: ```python # Load and resample any audio file to a mono float32 waveform audio = processor.load_audio( path="audio.wav", target_sr=16_000, # target sample rate (default: 16 000) ) # Transcribe with accumulated-audio continuation decoding text = model.transcribe( audio, processor, language="Danish", # language tag in the prompt (default: "Danish") target_sr=16_000, # must match load_audio target_sr (default: 16 000) step_seconds=15.0, # seconds of new audio per continuation step (default: 15.0) rollback_tokens=8, # token rollback for prefix overlap (default: 8) max_new_tokens=2048, # generation budget per step (default: 2048) ) ``` ## 🚀Fast inference🚀, vllm ```bash pip install -U qwen-asr[vllm] ``` ```bash MAX_JOBS=4 pip install -U flash-attn --no-build-isolation ``` ```python import librosa from qwen_asr import Qwen3ASRModel def transcribe_single_file(audio_path, model_id="capacit-ai/saga"): model = Qwen3ASRModel.LLM(model=model_id, gpu_memory_utilization=0.92) audio, _ = librosa.load(audio_path, sr=16000) output = model.transcribe(audio=[(audio, 16000)], language=["Danish"]) return output[0].text if __name__ == "__main__": print(transcribe_single_file("audio.wav")) ``` ## Evaluation All of the finetuned models has been trained on CoRal data, as it's the most comprehensive and high quality (open-source) danish ASR dataset family, therefore we evaluated them on CoRal. All Qwen based models where evaluated using the same script and all Whisper based models where evaluated using the same script. Upcoming: More unseen datasets and performance metrics on the way! | Dataset | Model | Samples | CER | WER | | --- | --- | --- | --- | --- | | CoRal read_aloud (test) | capacit-ai/saga | 8000 | 6.7% | 15.6% | | CoRal read_aloud (test) | Qwen/Qwen3-ASR-1.7B | 8000 | 15.0% | 33.6% | | CoRal read_aloud (test) | pluttodk/milo-asr |8000 | 7.6% | 16.8% | | CoRal read_aloud (test) | openai/whisper-large-v3 | 8000 | 10.3% | 25.2% | | CoRal read_aloud (test) | CoRal-project/roest-v3-whisper-1.5b | 8000 | 4.7% | 11.6% | | CoRal read_aloud (test) | syvai/hviske-v3-conversation| 8000 | 7.7% | 18.2% | ![plot](./cer_by_model.png) ![plot](./wer_by_model.png) | Model | RTFx | | --- | --- | | capacit-ai/saga | 470 | | Qwen/Qwen3-ASR-1.7B| 585 | | openai/whisper-large-v3 | 50 | ![plot](./rtfx_by_model.png) - RTFx figures are with vllm and fastattention enabled for Qwen backends, we succesfully ran pluttodk/milo-asr with a vllm backend and saw no significant drop in WER or CER. - All evaluation metrics where created using a single RTX 5090 instance. ## Acknowledgements Credit to the talented Qwen team, for making efficient and accurate models and open sourcing them. And credit to the [`Danish Innovation fund`](https://innovationsfonden.dk/da), [`Alexandra Institute`](https://alexandra.dk) and partners, for the CoRal datasets. - Datasets: [`CoRal-project`](https://huggingface.co/CoRal-project/datasets) - Base model: [`Qwen/Qwen3-ASR-1.7B`](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) - Original project documentation: [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR) ## Creator This model has been finetuned and model card authored by [`Andreas Eefsen`](https://www.linkedin.com/in/andreas-e-444780221/), [`Capacit A/S Copenhagen`](https://www.linkedin.com/company/capacit-as).