|
|
--- |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
license: other |
|
|
library_name: transformers |
|
|
base_model: Qwen/Qwen2-Audio-7B-Instruct |
|
|
tags: |
|
|
- audio |
|
|
- speech-recognition |
|
|
- code-switching |
|
|
- dpo |
|
|
- qwen2-audio |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- mer |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
|
|
|
# Qwen2-Audio-7B-DPO-CodeSwitch |
|
|
|
|
|
A fine-tuned version of [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) trained with DPO (Direct Preference Optimization) on code-switching speech transcription data. |
|
|
|
|
|
## Evaluation Results (MER - Mixed Error Rate, lower is better) |
|
|
|
|
|
| Benchmark | Baseline | This Model | Improvement | |
|
|
|-----------|----------|------------|-------------| |
|
|
| **SEAME-SGE** | 0.9511 | **0.8552** | **-10.1%** | |
|
|
| **SEAME-MAN** | 0.7289 | **0.5830** | **-20.0%** | |
|
|
| **EMILIA** | 0.4470 | **0.4208** | **-5.9%** | |
|
|
| **CS-Dialogue** | 0.3891 | **0.3140** | **-19.3%** | |
|
|
|
|
|
### Benchmark Descriptions |
|
|
- **SEAME-SGE**: SEAME dev set (Singapore English focused) - 3,222 samples ([AudioLLMs/seame_dev_sge](https://huggingface.co/datasets/AudioLLMs/seame_dev_sge)) |
|
|
- **SEAME-MAN**: SEAME dev set (Mandarin focused) - 2,610 samples ([AudioLLMs/seame_dev_man](https://huggingface.co/datasets/AudioLLMs/seame_dev_man)) |
|
|
- **EMILIA**: Synthetic code-switching evaluation set (1,000 samples) |
|
|
- **CS-Dialogue**: Code-switching dialogue evaluation set (359 samples) |
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
### Model Architecture |
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Base Model | [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) | |
|
|
| Training Type | Full Fine-Tuning | |
|
|
| Total Parameters | ~7B | |
|
|
|
|
|
### Training Hyperparameters |
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Training Method | DPO (Direct Preference Optimization) | |
|
|
| DPO Beta | 0.5 | |
|
|
| Learning Rate | 1e-6 | |
|
|
| LR Scheduler | Cosine | |
|
|
| Warmup Ratio | 0.1 | |
|
|
| Batch Size (per device) | 1 | |
|
|
| Gradient Accumulation Steps | 8 | |
|
|
| Precision | BF16 | |
|
|
| Max Sequence Length | 2048 | |
|
|
| Weight Decay | 0.01 | |
|
|
| Max Gradient Norm | 1.0 | |
|
|
| FSDP | Full Shard | |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor |
|
|
import torch |
|
|
import librosa |
|
|
|
|
|
# Load model |
|
|
model = Qwen2AudioForConditionalGeneration.from_pretrained( |
|
|
"myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch", |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
processor = AutoProcessor.from_pretrained( |
|
|
"myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
model.eval() |
|
|
|
|
|
# Load audio |
|
|
audio, sr = librosa.load("path/to/audio.wav", sr=16000) |
|
|
|
|
|
# Process inputs |
|
|
conversation = [ |
|
|
{"role": "user", "content": [ |
|
|
{"type": "audio", "audio_url": "path/to/audio.wav"}, |
|
|
{"type": "text", "text": "Please transcribe this audio."} |
|
|
]} |
|
|
] |
|
|
|
|
|
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) |
|
|
audios = [audio] |
|
|
|
|
|
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True) |
|
|
inputs = {k: v.to(model.device) for k, v in inputs.items()} |
|
|
|
|
|
# Generate |
|
|
with torch.no_grad(): |
|
|
outputs = model.generate(**inputs, max_new_tokens=256) |
|
|
transcription = processor.batch_decode(outputs, skip_special_tokens=True)[0] |
|
|
|
|
|
print(transcription) |
|
|
``` |
|
|
|
|
|
## Files |
|
|
|
|
|
``` |
|
|
βββ README.md # This file |
|
|
βββ config.json # Model configuration |
|
|
βββ model weights # Model weights |
|
|
βββ tokenizer files # Tokenizer assets |
|
|
βββ eval_results/ |
|
|
βββ baseline_seame_sge.json # Baseline results on SEAME-SGE |
|
|
βββ baseline_seame_man.json # Baseline results on SEAME-MAN |
|
|
βββ baseline_emilia.json # Baseline results on EMILIA |
|
|
βββ baseline_cs_dialogue.json # Baseline results on CS-Dialogue |
|
|
βββ trained_seame_sge.json # This model's results on SEAME-SGE |
|
|
βββ trained_seame_man.json # This model's results on SEAME-MAN |
|
|
βββ trained_emilia.json # This model's results on EMILIA |
|
|
βββ trained_cs_dialogue.json # This model's results on CS-Dialogue |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model inherits the license of the base [Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) model. |
|
|
|