myaccountfor's picture
Upload README.md with huggingface_hub
20dbc2b verified
---
language:
- en
- zh
license: other
library_name: transformers
base_model: Qwen/Qwen2-Audio-7B-Instruct
tags:
- audio
- speech-recognition
- code-switching
- dpo
- qwen2-audio
datasets:
- custom
metrics:
- mer
pipeline_tag: automatic-speech-recognition
---
# Qwen2-Audio-7B-DPO-CodeSwitch
A fine-tuned version of [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) trained with DPO (Direct Preference Optimization) on code-switching speech transcription data.
## Evaluation Results (MER - Mixed Error Rate, lower is better)
| Benchmark | Baseline | This Model | Improvement |
|-----------|----------|------------|-------------|
| **SEAME-SGE** | 0.9511 | **0.8552** | **-10.1%** |
| **SEAME-MAN** | 0.7289 | **0.5830** | **-20.0%** |
| **EMILIA** | 0.4470 | **0.4208** | **-5.9%** |
| **CS-Dialogue** | 0.3891 | **0.3140** | **-19.3%** |
### Benchmark Descriptions
- **SEAME-SGE**: SEAME dev set (Singapore English focused) - 3,222 samples ([AudioLLMs/seame_dev_sge](https://huggingface.co/datasets/AudioLLMs/seame_dev_sge))
- **SEAME-MAN**: SEAME dev set (Mandarin focused) - 2,610 samples ([AudioLLMs/seame_dev_man](https://huggingface.co/datasets/AudioLLMs/seame_dev_man))
- **EMILIA**: Synthetic code-switching evaluation set (1,000 samples)
- **CS-Dialogue**: Code-switching dialogue evaluation set (359 samples)
## Training Configuration
### Model Architecture
| Parameter | Value |
|-----------|-------|
| Base Model | [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) |
| Training Type | Full Fine-Tuning |
| Total Parameters | ~7B |
### Training Hyperparameters
| Parameter | Value |
|-----------|-------|
| Training Method | DPO (Direct Preference Optimization) |
| DPO Beta | 0.5 |
| Learning Rate | 1e-6 |
| LR Scheduler | Cosine |
| Warmup Ratio | 0.1 |
| Batch Size (per device) | 1 |
| Gradient Accumulation Steps | 8 |
| Precision | BF16 |
| Max Sequence Length | 2048 |
| Weight Decay | 0.01 |
| Max Gradient Norm | 1.0 |
| FSDP | Full Shard |
## Usage
```python
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
import torch
import librosa
# Load model
model = Qwen2AudioForConditionalGeneration.from_pretrained(
"myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch",
trust_remote_code=True
)
model.eval()
# Load audio
audio, sr = librosa.load("path/to/audio.wav", sr=16000)
# Process inputs
conversation = [
{"role": "user", "content": [
{"type": "audio", "audio_url": "path/to/audio.wav"},
{"type": "text", "text": "Please transcribe this audio."}
]}
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = [audio]
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
# Generate
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=256)
transcription = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(transcription)
```
## Files
```
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ config.json # Model configuration
β”œβ”€β”€ model weights # Model weights
β”œβ”€β”€ tokenizer files # Tokenizer assets
└── eval_results/
β”œβ”€β”€ baseline_seame_sge.json # Baseline results on SEAME-SGE
β”œβ”€β”€ baseline_seame_man.json # Baseline results on SEAME-MAN
β”œβ”€β”€ baseline_emilia.json # Baseline results on EMILIA
β”œβ”€β”€ baseline_cs_dialogue.json # Baseline results on CS-Dialogue
β”œβ”€β”€ trained_seame_sge.json # This model's results on SEAME-SGE
β”œβ”€β”€ trained_seame_man.json # This model's results on SEAME-MAN
β”œβ”€β”€ trained_emilia.json # This model's results on EMILIA
└── trained_cs_dialogue.json # This model's results on CS-Dialogue
```
## License
This model inherits the license of the base [Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) model.