--- language: - en - zh license: other library_name: transformers base_model: Qwen/Qwen2-Audio-7B-Instruct tags: - audio - speech-recognition - code-switching - dpo - qwen2-audio datasets: - custom metrics: - mer pipeline_tag: automatic-speech-recognition --- # Qwen2-Audio-7B-DPO-CodeSwitch A fine-tuned version of [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) trained with DPO (Direct Preference Optimization) on code-switching speech transcription data. ## Evaluation Results (MER - Mixed Error Rate, lower is better) | Benchmark | Baseline | This Model | Improvement | |-----------|----------|------------|-------------| | **SEAME-SGE** | 0.9511 | **0.8552** | **-10.1%** | | **SEAME-MAN** | 0.7289 | **0.5830** | **-20.0%** | | **EMILIA** | 0.4470 | **0.4208** | **-5.9%** | | **CS-Dialogue** | 0.3891 | **0.3140** | **-19.3%** | ### Benchmark Descriptions - **SEAME-SGE**: SEAME dev set (Singapore English focused) - 3,222 samples ([AudioLLMs/seame_dev_sge](https://huggingface.co/datasets/AudioLLMs/seame_dev_sge)) - **SEAME-MAN**: SEAME dev set (Mandarin focused) - 2,610 samples ([AudioLLMs/seame_dev_man](https://huggingface.co/datasets/AudioLLMs/seame_dev_man)) - **EMILIA**: Synthetic code-switching evaluation set (1,000 samples) - **CS-Dialogue**: Code-switching dialogue evaluation set (359 samples) ## Training Configuration ### Model Architecture | Parameter | Value | |-----------|-------| | Base Model | [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) | | Training Type | Full Fine-Tuning | | Total Parameters | ~7B | ### Training Hyperparameters | Parameter | Value | |-----------|-------| | Training Method | DPO (Direct Preference Optimization) | | DPO Beta | 0.5 | | Learning Rate | 1e-6 | | LR Scheduler | Cosine | | Warmup Ratio | 0.1 | | Batch Size (per device) | 1 | | Gradient Accumulation Steps | 8 | | Precision | BF16 | | Max Sequence Length | 2048 | | Weight Decay | 0.01 | | Max Gradient Norm | 1.0 | | FSDP | Full Shard | ## Usage ```python from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor import torch import librosa # Load model model = Qwen2AudioForConditionalGeneration.from_pretrained( "myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch", torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) processor = AutoProcessor.from_pretrained( "myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch", trust_remote_code=True ) model.eval() # Load audio audio, sr = librosa.load("path/to/audio.wav", sr=16000) # Process inputs conversation = [ {"role": "user", "content": [ {"type": "audio", "audio_url": "path/to/audio.wav"}, {"type": "text", "text": "Please transcribe this audio."} ]} ] text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [audio] inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True) inputs = {k: v.to(model.device) for k, v in inputs.items()} # Generate with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=256) transcription = processor.batch_decode(outputs, skip_special_tokens=True)[0] print(transcription) ``` ## Files ``` ├── README.md # This file ├── config.json # Model configuration ├── model weights # Model weights ├── tokenizer files # Tokenizer assets └── eval_results/ ├── baseline_seame_sge.json # Baseline results on SEAME-SGE ├── baseline_seame_man.json # Baseline results on SEAME-MAN ├── baseline_emilia.json # Baseline results on EMILIA ├── baseline_cs_dialogue.json # Baseline results on CS-Dialogue ├── trained_seame_sge.json # This model's results on SEAME-SGE ├── trained_seame_man.json # This model's results on SEAME-MAN ├── trained_emilia.json # This model's results on EMILIA └── trained_cs_dialogue.json # This model's results on CS-Dialogue ``` ## License This model inherits the license of the base [Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) model.