File size: 5,295 Bytes
9fb4a22 33a98d4 9fb4a22 33a98d4 4e7608d 33a98d4 9fb4a22 31f21f5 4e7608d 31f21f5 4e7608d 31f21f5 4e7608d d7e820f 9fb4a22 31f21f5 d7e820f 31f21f5 4f64cce 31f21f5 d7e820f 9fb4a22 31f21f5 d7e820f 31f21f5 d7e820f 31f21f5 d7e820f 31f21f5 d7e820f 31f21f5 d7e820f 31f21f5 4f94810 d7e820f 31f21f5 d7e820f 31f21f5 d7e820f 31f21f5 d7e820f 31f21f5 d7e820f cf4cc6f ecfac7e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
---
tags:
- speech-to-text
- peft
- lora
- danish
- fine-tuned
- voxtral
- whisper
language:
- da
metrics:
- wer
- cer
base_model:
- mistralai/Voxtral-Small-24B-2507
datasets:
- CoRal-project/coral
model-index:
- name: danstral-v1
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: CoRal read-aloud
type: alexandrainst/coral
split: test
args: read_aloud
metrics:
- type: cer
value: x
name: CER
- type: wer
value: x
name: WER
---
# Voxtral-Small-24B LoRA Fine-tuned on CoRaL
**Danstral** is a state-of-the-art 24B parameter model for Danish automatic speech recognition (ASR). It combines the decoder and audio-adapter of [**Voxtral-Small-24B-2507**](https://huggingface.co/mistralai/Voxtral-Small-24B-2507) with the audio encoder from [**roest-whisper-large-v1**](https://huggingface.co/CoRal-project/roest-whisper-large-v1). The decoder and audio-adapter were fine-tuned using LoRA for 2 epochs (40 hours) on the Danish [CoRaL dataset](https://huggingface.co/CoRal-project/coral), using three NVIDIA L40 GPUs. While it achieves state-of-the-art performance on CoRaL, it is a massive model and likely overkill compared to Whisper-based models.
---
## Evaluation Results
| Model | Number of parameters | [CoRaL](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRaL](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
|:---|---:|---:|---:|
| [hinge/danstral-v1](https://huggingface.co/hinge/danstral-v1) | 24B | **4.2% ± 0.2%** | **9.7% ± 0.3%** |
| [Alvenir/coral-1-whisper-large](https://huggingface.co/Alvenir/coral-1-whisper-large) | 1.540B | 4.3% ± 0.2% | 10.4% ± 0.3% |
| [nvidia/parakeet-rnnt-110m-da-dk](https://huggingface.co/nvidia/parakeet-rnnt-110m-da-dk)| 0.110B | - | 10.7% |
| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m) | 0.315B | 6.6% ± 0.2% | 17.0% ± 0.4% |
| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1.540B | 4.7% ± 0.07% | 11.8% ± 0.3% |
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1.540B | 11.4% ± 0.3% | 28.3% ± 0.6% |
---
## Limitations
- Danstral-v1 is huge. It's 16x the size of **coral-1-whisper-large** with only modest performance improvements. However, the LoRA adapter itself is only 25 million parameters.
- Danstral-v1 is a fine-tuned version of **voxtral-small-24b**, whose encoder is a fine-tuned version of **mistral-small-24b**. Mistral does not disclose its training datasets, but it is likely that Danish Wikipedia articles were used. Since the CoRaL test split also contains read-aloud samples from Danish Wikipedia, there is a risk of data leakage, which could influence the test scores.
- The model was fine-tuned solely on the CoRaL v1 dataset, so performance may deteriorate for other data sources.
---
## Future Work and Ideas
- **Further optimization.** The state-of-the-art performance was achieved with a 25M parameter LoRA adapter. I only conducted a few experiments, and there are likely more performance gains to be had by tweaking the LoRA configuration or by conducting a full parameter fine-tune.
- **Knowledge distillation.** Danstral-v1 can be used for knowledge distillation to train smaller models.
---
## How to Use
See [https://github.com/ChristianHinge/danstral](https://github.com/ChristianHinge/danstral) for the training script.
```python
from transformers import VoxtralForConditionalGeneration, AutoProcessor, WhisperForConditionalGeneration
import torch
from peft import PeftModel
from datasets import load_dataset, Audio
repo_id = "mistralai/Voxtral-Small-24B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map="auto",attn_implementation="flash_attention_2")
# Load audio encoder
whisper_model = WhisperForConditionalGeneration.from_pretrained(
"CoRal-project/roest-whisper-large-v1",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2"
)
whisper_encoder_state_dict = whisper_model.model.encoder.state_dict()
model.audio_tower.load_state_dict(whisper_encoder_state_dict)
# Load LoRA adapters
model = PeftModel.from_pretrained(model, "hinge/danstral-v1")
coral = load_dataset("CoRal-project/coral", "read_aloud")
coral = coral.cast_column("audio", Audio(sampling_rate=16000))
for i in range(10):
sample = coral["test"][i]
audio_data = sample['audio']
ground_truth = sample['text']
inputs = processor.apply_transcription_request(language="da", audio=audio_data['array'], format=["WAV"], model_id=repo_id)
inputs = inputs.to("cuda:0", dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=256,do_sample=False)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(f"Ground Truth: {ground_truth}")
print(f"Prediction: {decoded_outputs[0]}")
print("-" * 40)
```
## Shoutouts
- Viktor Stenby Johansson and Rasmus Asgaard for ASR hackathon and ideation
- The CoRal project and Alexandra Institute for curating Danish datasets and leading the effort in Danish NLP
|