Qwen2-Audio-7B-DPO-CodeSwitch / README.md

Upload README.md with huggingface_hub

20dbc2b verified about 5 hours ago

4.27 kB

	---
	language:
	- en
	- zh
	license: other
	library_name: transformers
	base_model: Qwen/Qwen2-Audio-7B-Instruct
	tags:
	- audio
	- speech-recognition
	- code-switching
	- dpo
	- qwen2-audio
	datasets:
	- custom
	metrics:
	- mer
	pipeline_tag: automatic-speech-recognition
	---

	# Qwen2-Audio-7B-DPO-CodeSwitch

	A fine-tuned version of [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) trained with DPO (Direct Preference Optimization) on code-switching speech transcription data.

	## Evaluation Results (MER - Mixed Error Rate, lower is better)

	\| Benchmark \| Baseline \| This Model \| Improvement \|
	\|-----------\|----------\|------------\|-------------\|
	\| SEAME-SGE \| 0.9511 \| 0.8552 \| -10.1% \|
	\| SEAME-MAN \| 0.7289 \| 0.5830 \| -20.0% \|
	\| EMILIA \| 0.4470 \| 0.4208 \| -5.9% \|
	\| CS-Dialogue \| 0.3891 \| 0.3140 \| -19.3% \|

	### Benchmark Descriptions
	- SEAME-SGE: SEAME dev set (Singapore English focused) - 3,222 samples ([AudioLLMs/seame_dev_sge](https://huggingface.co/datasets/AudioLLMs/seame_dev_sge))
	- SEAME-MAN: SEAME dev set (Mandarin focused) - 2,610 samples ([AudioLLMs/seame_dev_man](https://huggingface.co/datasets/AudioLLMs/seame_dev_man))
	- EMILIA: Synthetic code-switching evaluation set (1,000 samples)
	- CS-Dialogue: Code-switching dialogue evaluation set (359 samples)

	## Training Configuration

	### Model Architecture
	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) \|
	\| Training Type \| Full Fine-Tuning \|
	\| Total Parameters \| ~7B \|

	### Training Hyperparameters
	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Training Method \| DPO (Direct Preference Optimization) \|
	\| DPO Beta \| 0.5 \|
	\| Learning Rate \| 1e-6 \|
	\| LR Scheduler \| Cosine \|
	\| Warmup Ratio \| 0.1 \|
	\| Batch Size (per device) \| 1 \|
	\| Gradient Accumulation Steps \| 8 \|
	\| Precision \| BF16 \|
	\| Max Sequence Length \| 2048 \|
	\| Weight Decay \| 0.01 \|
	\| Max Gradient Norm \| 1.0 \|
	\| FSDP \| Full Shard \|

	## Usage

	```python
	from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
	import torch
	import librosa

	# Load model
	model = Qwen2AudioForConditionalGeneration.from_pretrained(
	"myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True
	)
	processor = AutoProcessor.from_pretrained(
	"myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch",
	trust_remote_code=True
	)

	model.eval()

	# Load audio
	audio, sr = librosa.load("path/to/audio.wav", sr=16000)

	# Process inputs
	conversation = [
	{"role": "user", "content": [
	{"type": "audio", "audio_url": "path/to/audio.wav"},
	{"type": "text", "text": "Please transcribe this audio."}
	]}
	]

	text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
	audios = [audio]

	inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
	inputs = {k: v.to(model.device) for k, v in inputs.items()}

	# Generate
	with torch.no_grad():
	outputs = model.generate(**inputs, max_new_tokens=256)
	transcription = processor.batch_decode(outputs, skip_special_tokens=True)[0]

	print(transcription)
	```

	## Files

	```
	├── README.md # This file
	├── config.json # Model configuration
	├── model weights # Model weights
	├── tokenizer files # Tokenizer assets
	└── eval_results/
	├── baseline_seame_sge.json # Baseline results on SEAME-SGE
	├── baseline_seame_man.json # Baseline results on SEAME-MAN
	├── baseline_emilia.json # Baseline results on EMILIA
	├── baseline_cs_dialogue.json # Baseline results on CS-Dialogue
	├── trained_seame_sge.json # This model's results on SEAME-SGE
	├── trained_seame_man.json # This model's results on SEAME-MAN
	├── trained_emilia.json # This model's results on EMILIA
	└── trained_cs_dialogue.json # This model's results on CS-Dialogue
	```

	## License

	This model inherits the license of the base [Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) model.