File size: 5,295 Bytes

9fb4a22
 
33a98d4
 
9fb4a22
33a98d4
 
 
 
 
 
 
 
 
 
 
 
 
4e7608d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33a98d4
9fb4a22
31f21f5
4e7608d
31f21f5
4e7608d
31f21f5
4e7608d
d7e820f
9fb4a22
31f21f5
d7e820f
31f21f5
 
4f64cce
31f21f5
d7e820f
 
9fb4a22
31f21f5
d7e820f
 
31f21f5
 
 
d7e820f
31f21f5
d7e820f
31f21f5
 
 
 
 
d7e820f
31f21f5
d7e820f
31f21f5
4f94810
d7e820f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31f21f5
d7e820f
 
31f21f5
d7e820f
 
31f21f5
d7e820f
31f21f5
d7e820f
cf4cc6f
ecfac7e

---
tags:
- speech-to-text
- peft
- lora
- danish
- fine-tuned
- voxtral
- whisper
language:
- da
metrics:
- wer
- cer
base_model:
- mistralai/Voxtral-Small-24B-2507
datasets:
- CoRal-project/coral
model-index:
- name: danstral-v1
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: CoRal read-aloud
      type: alexandrainst/coral
      split: test
      args: read_aloud
    metrics:
    - type: cer
      value: x
      name: CER
    - type: wer
      value: x
      name: WER
---

# Voxtral-Small-24B LoRA Fine-tuned on CoRaL

**Danstral** is a state-of-the-art 24B parameter model for Danish automatic speech recognition (ASR). It combines the decoder and audio-adapter of [**Voxtral-Small-24B-2507**](https://huggingface.co/mistralai/Voxtral-Small-24B-2507) with the audio encoder from [**roest-whisper-large-v1**](https://huggingface.co/CoRal-project/roest-whisper-large-v1). The decoder and audio-adapter were fine-tuned using LoRA for 2 epochs (40 hours) on the Danish [CoRaL dataset](https://huggingface.co/CoRal-project/coral), using three NVIDIA L40 GPUs. While it achieves state-of-the-art performance on CoRaL, it is a massive model and likely overkill compared to Whisper-based models.

---

## Evaluation Results

| Model | Number of parameters | [CoRaL](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRaL](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
|:---|---:|---:|---:|
| [hinge/danstral-v1](https://huggingface.co/hinge/danstral-v1) | 24B | **4.2% ± 0.2%** | **9.7% ± 0.3%** |
| [Alvenir/coral-1-whisper-large](https://huggingface.co/Alvenir/coral-1-whisper-large) | 1.540B | 4.3% ± 0.2% | 10.4% ± 0.3% |
| [nvidia/parakeet-rnnt-110m-da-dk](https://huggingface.co/nvidia/parakeet-rnnt-110m-da-dk)| 0.110B | - | 10.7% |
| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m) | 0.315B | 6.6% ± 0.2% | 17.0% ± 0.4% |
| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1.540B | 4.7% ± 0.07% | 11.8% ± 0.3% |
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1.540B | 11.4% ± 0.3% | 28.3% ± 0.6% |

---

## Limitations
- Danstral-v1 is huge. It's 16x the size of **coral-1-whisper-large** with only modest performance improvements. However, the LoRA adapter itself is only 25 million parameters.
- Danstral-v1 is a fine-tuned version of **voxtral-small-24b**, whose encoder is a fine-tuned version of **mistral-small-24b**. Mistral does not disclose its training datasets, but it is likely that Danish Wikipedia articles were used. Since the CoRaL test split also contains read-aloud samples from Danish Wikipedia, there is a risk of data leakage, which could influence the test scores.
- The model was fine-tuned solely on the CoRaL v1 dataset, so performance may deteriorate for other data sources.

---

## Future Work and Ideas
- **Further optimization.** The state-of-the-art performance was achieved with a 25M parameter LoRA adapter. I only conducted a few experiments, and there are likely more performance gains to be had by tweaking the LoRA configuration or by conducting a full parameter fine-tune.
- **Knowledge distillation.** Danstral-v1 can be used for knowledge distillation to train smaller models.

---

## How to Use

See [https://github.com/ChristianHinge/danstral](https://github.com/ChristianHinge/danstral) for the training script.

```python
from transformers import VoxtralForConditionalGeneration, AutoProcessor, WhisperForConditionalGeneration
import torch
from peft import PeftModel
from datasets import load_dataset, Audio

repo_id = "mistralai/Voxtral-Small-24B-2507"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map="auto",attn_implementation="flash_attention_2")

# Load audio encoder
whisper_model = WhisperForConditionalGeneration.from_pretrained(
    "CoRal-project/roest-whisper-large-v1",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)

whisper_encoder_state_dict = whisper_model.model.encoder.state_dict()
model.audio_tower.load_state_dict(whisper_encoder_state_dict)

# Load LoRA adapters
model = PeftModel.from_pretrained(model, "hinge/danstral-v1")

coral = load_dataset("CoRal-project/coral", "read_aloud")
coral = coral.cast_column("audio", Audio(sampling_rate=16000))

for i in range(10):
    sample = coral["test"][i]
    audio_data = sample['audio']
    ground_truth = sample['text']

    inputs = processor.apply_transcription_request(language="da", audio=audio_data['array'], format=["WAV"], model_id=repo_id)
    inputs = inputs.to("cuda:0", dtype=torch.bfloat16)

    outputs = model.generate(**inputs, max_new_tokens=256,do_sample=False)
    decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)

    print(f"Ground Truth: {ground_truth}")
    print(f"Prediction: {decoded_outputs[0]}")
    print("-" * 40)
```
## Shoutouts
- Viktor Stenby Johansson and Rasmus Asgaard for ASR hackathon and ideation 
- The CoRal project and Alexandra Institute for curating Danish datasets and leading the effort in Danish NLP