saga / README.md

Update README.md

f1f90b0 verified 9 days ago

6.08 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: automatic-speech-recognition
	base_model: Qwen/Qwen3-ASR-1.7B
	language:
	- da
	tags:
	- audio
	- speech
	- automatic-speech-recognition
	- danish
	- qwen3-asr
	- trust-remote-code
	- custom-code
	---

	# Capacit-ai/saga

	`Capacit-ai/saga` is a state-of-the-art Danish automatic speech recognition model based on `Qwen/Qwen3-ASR-1.7B`.

	The model is optimized for fast inference, with aggressive input downsampling and variable chunk sizing unlike the competing models, this enables our Saga model to achieve state-of-the-art performance, while being significantly more efficient.

	The model was trained on an nvidia B200, with the use of the [`CoRal dataset`](https://huggingface.co/CoRal-project/datasets) family, courtesy of the [`Danish Innovation fund`](https://innovationsfonden.dk/da) and the [`Alexandra Institute`](https://alexandra.dk)

	This repository is intended for Danish transcription only. The underlying Qwen3-ASR base model is multilingual, but this finetuned checkpoint is Danish-focused and the model has unlearned most of its multilingual capabilities.

	## Model Summary

	- Base model: `Qwen/Qwen3-ASR-1.7B`
	- Task: automatic speech recognition
	- Primary language: Danish
	- Input audio: 16 kHz mono waveform

	## Quickstart

	Install the packages:

	```bash
	pip install -U transformers soundfile torch qwen-asr
	```

	Then load the model with `transformers`:

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoProcessor

	MODEL_ID = "capacit-ai/saga"
	DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
	DTYPE = torch.bfloat16 if DEVICE.startswith("cuda") else torch.float32

	processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	MODEL_ID, trust_remote_code=True, torch_dtype=DTYPE,
	)
	model.to(DEVICE)
	model.eval()

	audio = processor.load_audio("audio.wav")
	text = model.transcribe(audio, processor)
	print(text)
	```

	## Long-Form Audio

	The base Qwen3-ASR architecture supports long inputs, but the most stable long-form decoding in this project came from accumulated-audio continuation decoding rather than a single naive generate call. The `model.transcribe()` method already implements this strategy it walks through the audio in `step_seconds` chunks, re-feeding the accumulated waveform together with previously decoded text so the model keeps prior context. The `step_seconds`, `rollback_tokens`, and `max_new_tokens` parameters can be tuned for your use case.

	The `processor.load_audio` and `model.transcribe` methods accept the following parameters:


	```python
	# Load and resample any audio file to a mono float32 waveform
	audio = processor.load_audio(
	path="audio.wav",
	target_sr=16_000, # target sample rate (default: 16 000)
	)

	# Transcribe with accumulated-audio continuation decoding
	text = model.transcribe(
	audio,
	processor,
	language="Danish", # language tag in the prompt (default: "Danish")
	target_sr=16_000, # must match load_audio target_sr (default: 16 000)
	step_seconds=15.0, # seconds of new audio per continuation step (default: 15.0)
	rollback_tokens=8, # token rollback for prefix overlap (default: 8)
	max_new_tokens=2048, # generation budget per step (default: 2048)
	)
	```

	## 🚀Fast inference🚀, vllm
	```bash
	pip install -U qwen-asr[vllm]
	```

	```bash
	MAX_JOBS=4 pip install -U flash-attn --no-build-isolation
	```

	```python
	import librosa
	from qwen_asr import Qwen3ASRModel

	def transcribe_single_file(audio_path, model_id="capacit-ai/saga"):
	model = Qwen3ASRModel.LLM(model=model_id, gpu_memory_utilization=0.92)
	audio, _ = librosa.load(audio_path, sr=16000)
	output = model.transcribe(audio=[(audio, 16000)], language=["Danish"])
	return output[0].text

	if __name__ == "__main__":
	print(transcribe_single_file("audio.wav"))
	```


	## Evaluation

	All of the finetuned models has been trained on CoRal data, as it's the most comprehensive and high quality (open-source) danish ASR dataset family, therefore we evaluated them on CoRal.
	All Qwen based models where evaluated using the same script and all Whisper based models where evaluated using the same script.

	Upcoming: More unseen datasets and performance metrics on the way!

	\| Dataset \| Model \| Samples \| CER \| WER \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| CoRal read_aloud (test) \| capacit-ai/saga \| 8000 \| 6.7% \| 15.6% \|
	\| CoRal read_aloud (test) \| Qwen/Qwen3-ASR-1.7B \| 8000 \| 15.0% \| 33.6% \|
	\| CoRal read_aloud (test) \| pluttodk/milo-asr \|8000 \| 7.6% \| 16.8% \|
	\| CoRal read_aloud (test) \| openai/whisper-large-v3 \| 8000 \| 10.3% \| 25.2% \|
	\| CoRal read_aloud (test) \| CoRal-project/roest-v3-whisper-1.5b \| 8000 \| 4.7% \| 11.6% \|
	\| CoRal read_aloud (test) \| syvai/hviske-v3-conversation\| 8000 \| 7.7% \| 18.2% \|

	![plot](./cer_by_model.png)

	![plot](./wer_by_model.png)

	\| Model \| RTFx \|
	\| --- \| --- \|
	\| capacit-ai/saga \| 470 \|
	\| Qwen/Qwen3-ASR-1.7B\| 585 \|
	\| openai/whisper-large-v3 \| 50 \|

	![plot](./rtfx_by_model.png)

	- RTFx figures are with vllm and fastattention enabled for Qwen backends, we succesfully ran pluttodk/milo-asr with a vllm backend and saw no significant drop in WER or CER.
	- All evaluation metrics where created using a single RTX 5090 instance.

	## Acknowledgements
	Credit to the talented Qwen team, for making efficient and accurate models and open sourcing them.

	And credit to the [`Danish Innovation fund`](https://innovationsfonden.dk/da), [`Alexandra Institute`](https://alexandra.dk) and partners, for the CoRal datasets.

	- Datasets: [`CoRal-project`](https://huggingface.co/CoRal-project/datasets)
	- Base model: [`Qwen/Qwen3-ASR-1.7B`](https://huggingface.co/Qwen/Qwen3-ASR-1.7B)
	- Original project documentation: [Qwen3-ASR](https://github.com/QwenLM/Qwen3-ASR)

	## Creator
	This model has been finetuned and model card authored by [`Andreas Eefsen`](https://www.linkedin.com/in/andreas-e-444780221/), [`Capacit A/S Copenhagen`](https://www.linkedin.com/company/capacit-as).