myaccountfor
/

Qwen2-Audio-7B-DPO-CodeSwitch

@@ -3,14 +3,13 @@ language:
   - en
   - zh
 license: other
-library_name: peft
 base_model: Qwen/Qwen2-Audio-7B-Instruct
 tags:
   - audio
   - speech-recognition
   - code-switching
   - dpo
-  - lora
   - qwen2-audio
 datasets:
   - custom
@@ -21,157 +20,112 @@ pipeline_tag: automatic-speech-recognition
 # Qwen2-Audio-7B-DPO-CodeSwitch
-A LoRA adapter for [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) fine-tuned with DPO (Direct Preference Optimization) on code-switching speech transcription data.
 ## Evaluation Results (MER - Mixed Error Rate, lower is better)
 | Benchmark | Baseline | This Model | Improvement |
 |-----------|----------|------------|-------------|
-| **SEAME** | 0.5677 | **0.5301** | **-6.6%** |
 | **EMILIA** | 0.4470 | **0.4208** | **-5.9%** |
 | **CS-Dialogue** | 0.3891 | **0.3140** | **-19.3%** |
 ### Benchmark Descriptions
-- **SEAME**: English-Mandarin code-switching conversational speech from Singapore/Malaysia (9,764 samples)
 - **EMILIA**: Synthetic code-switching evaluation set (1,000 samples)
 - **CS-Dialogue**: Code-switching dialogue evaluation set (359 samples)
-## Examples
-Below are examples showing improvements from baseline to DPO-trained model:
-### Example 1: Code-Switching Preserved (Lifestyle)
-| | Transcription |
-|---|---|
-| **Ground Truth** | 我们 都 应该 pursue a healthy lifestyle |
-| **Baseline** | 我们都应该追求健康的生活方式 *(fully translated to Chinese)* |
-| **This Model** | 我们都应该 pursue a healthy lifestyle |
-| **MER** | 1.00 → **0.00** |
-### Example 2: Mixed Language Preserved (Christmas)
-| | Transcription |
-|---|---|
-| **Ground Truth** | every christmas 我 就 应该 是 没有 人 跟我 庆祝 了 [啦] |
-| **Baseline** | every christmas i would - should be no one to tell me *(Chinese translated to English)* |
-| **This Model** | every christmas 我就应该是没有人跟我庆祝了啦 |
-| **MER** | 0.88 → **0.00** |
-### Example 3: Technical Terms Preserved
-| | Transcription |
-|---|---|
-| **Ground Truth** | (呃) 每个 lecture different lecturer 那个 notes 不 不怎么 好的 [啦] |
-| **Baseline** | 呃那个老师不同风格的老师 *(lost technical terms)* |
-| **This Model** | 呃 每个 lecture different lecturer 那个 notes 不不怎么好的啦 |
-| **MER** | 0.75 → **0.00** |
-### Example 4: Complex Code-Switching Preserved
-| | Transcription |
-|---|---|
-| **Ground Truth** | [哦] 还有 什么 好吃 的 吗 还是 你 只是 去 那些 very expensive places like dempsey to eat |
-| **Baseline** | Oh, what else? Oh, yeah, there's always that expensive place like... to eat *(lost Chinese content)* |
-| **This Model** | 哦 还有什么好吃的吗 还是你只是去那些 very expensive places like dancy to eat |
-| **MER** | 0.83 → **0.04** |
-### Example 5: Professional Terms Preserved
-| | Transcription |
-|---|---|
-| **Ground Truth** | [哦] 因为 是个 professional degree |
-| **Baseline** | 哦因为他有个专业的学位 *(translated to Chinese)* |
-| **This Model** | 哦 因为 是个 professional degree |
-| **MER** | 1.00 → **0.00** |
 ## Training Configuration
 ### Model Architecture
 | Parameter | Value |
 |-----------|-------|
 | Base Model | [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) |
-| Adapter Type | LoRA (Low-Rank Adaptation) |
-| LoRA Rank (r) | 256 |
-| LoRA Alpha | 128 |
-| LoRA Dropout | 0.05 |
-| LoRA Target Modules | All attention (q_proj, k_proj, v_proj, o_proj) + MLP (up_proj, down_proj, gate_proj) |
-| Trainable Parameters | ~1.28B (adapter only) |
 ### Training Hyperparameters
 | Parameter | Value |
 |-----------|-------|
 | Training Method | DPO (Direct Preference Optimization) |
-| DPO Beta (β) | 0.3 |
-| DPO Loss | Sigmoid |
-| Learning Rate | 3e-5 |
 | LR Scheduler | Cosine |
 | Warmup Ratio | 0.1 |
 | Batch Size (per device) | 1 |
-| Gradient Accumulation Steps | 4 |
-| Global Batch Size | 32 (8 GPUs × 1 × 4) |
 | Precision | BF16 |
-| Max Sequence Length | 8192 |
 | Weight Decay | 0.01 |
 | Max Gradient Norm | 1.0 |
 ## Usage
 ```python
 from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
-from peft import PeftModel
 import torch
 import librosa
-# Load base model
-base_model = Qwen2AudioForConditionalGeneration.from_pretrained(
-    "Qwen/Qwen2-Audio-7B-Instruct",
     torch_dtype=torch.bfloat16,
     device_map="auto",
     trust_remote_code=True
 )
 processor = AutoProcessor.from_pretrained(
-    "Qwen/Qwen2-Audio-7B-Instruct",
     trust_remote_code=True
 )
-# Load LoRA adapter
-model = PeftModel.from_pretrained(base_model, "myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch")
 model.eval()
-# Inference example
 conversation = [
     {"role": "user", "content": [
         {"type": "audio", "audio_url": "path/to/audio.wav"},
-        {"type": "text", "text": "Please transcribe this speech."}
     ]}
 ]
 text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
-audios = [librosa.load("path/to/audio.wav", sr=processor.feature_extractor.sampling_rate)[0]]
 inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
 inputs = {k: v.to(model.device) for k, v in inputs.items()}
 with torch.no_grad():
-    generated_ids = model.generate(**inputs, max_new_tokens=256)
-transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
 ```
 ## Files
 ```
 ├── README.md                      # This file
-├── adapter_config.json            # LoRA configuration
-├── adapter_model.safetensors      # LoRA adapter weights (~1.28 GB)
 ├── tokenizer files                # Tokenizer assets
 └── eval_results/
-    ├── baseline_seame.json        # Baseline model results on SEAME
-    ├── baseline_emilia.json       # Baseline model results on EMILIA
-    ├── baseline_cs_dialogue.json  # Baseline model results on CS-Dialogue
-    ├── trained_seame.json         # This model's results on SEAME
     ├── trained_emilia.json        # This model's results on EMILIA
     └── trained_cs_dialogue.json   # This model's results on CS-Dialogue
 ```
 ## License
-This adapter inherits the license of the base [Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) model.

   - en
   - zh
 license: other
+library_name: transformers
 base_model: Qwen/Qwen2-Audio-7B-Instruct
 tags:
   - audio
   - speech-recognition
   - code-switching
   - dpo
   - qwen2-audio
 datasets:
   - custom
 # Qwen2-Audio-7B-DPO-CodeSwitch
+A fine-tuned version of [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) trained with DPO (Direct Preference Optimization) on code-switching speech transcription data.
 ## Evaluation Results (MER - Mixed Error Rate, lower is better)
 | Benchmark | Baseline | This Model | Improvement |
 |-----------|----------|------------|-------------|
+| **SEAME-SGE** | 0.9511 | **0.8552** | **-10.1%** |
+| **SEAME-MAN** | 0.7289 | **0.5830** | **-20.0%** |
 | **EMILIA** | 0.4470 | **0.4208** | **-5.9%** |
 | **CS-Dialogue** | 0.3891 | **0.3140** | **-19.3%** |
 ### Benchmark Descriptions
+- **SEAME-SGE**: SEAME dev set (Singapore English focused) - 3,222 samples ([AudioLLMs/seame_dev_sge](https://huggingface.co/datasets/AudioLLMs/seame_dev_sge))
+- **SEAME-MAN**: SEAME dev set (Mandarin focused) - 2,610 samples ([AudioLLMs/seame_dev_man](https://huggingface.co/datasets/AudioLLMs/seame_dev_man))
 - **EMILIA**: Synthetic code-switching evaluation set (1,000 samples)
 - **CS-Dialogue**: Code-switching dialogue evaluation set (359 samples)
 ## Training Configuration
 ### Model Architecture
 | Parameter | Value |
 |-----------|-------|
 | Base Model | [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) |
+| Training Type | Full Fine-Tuning |
+| Total Parameters | ~7B |
 ### Training Hyperparameters
 | Parameter | Value |
 |-----------|-------|
 | Training Method | DPO (Direct Preference Optimization) |
+| DPO Beta | 0.5 |
+| Learning Rate | 1e-6 |
 | LR Scheduler | Cosine |
 | Warmup Ratio | 0.1 |
 | Batch Size (per device) | 1 |
+| Gradient Accumulation Steps | 8 |
 | Precision | BF16 |
+| Max Sequence Length | 2048 |
 | Weight Decay | 0.01 |
 | Max Gradient Norm | 1.0 |
+| FSDP | Full Shard |
 ## Usage
 ```python
 from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
 import torch
 import librosa
+# Load model
+model = Qwen2AudioForConditionalGeneration.from_pretrained(
+    "myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch",
     torch_dtype=torch.bfloat16,
     device_map="auto",
     trust_remote_code=True
 )
 processor = AutoProcessor.from_pretrained(
+    "myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch",
     trust_remote_code=True
 )
 model.eval()
+# Load audio
+audio, sr = librosa.load("path/to/audio.wav", sr=16000)
+# Process inputs
 conversation = [
     {"role": "user", "content": [
         {"type": "audio", "audio_url": "path/to/audio.wav"},
+        {"type": "text", "text": "Please transcribe this audio."}
     ]}
 ]
 text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
+audios = [audio]
 inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
 inputs = {k: v.to(model.device) for k, v in inputs.items()}
+# Generate
 with torch.no_grad():
+    outputs = model.generate(**inputs, max_new_tokens=256)
+    transcription = processor.batch_decode(outputs, skip_special_tokens=True)[0]
+print(transcription)
 ```
 ## Files
 ```
 ├── README.md                      # This file
+├── config.json                    # Model configuration
+├── model weights                  # Model weights
 ├── tokenizer files                # Tokenizer assets
 └── eval_results/
+    ├── baseline_seame_sge.json    # Baseline results on SEAME-SGE
+    ├── baseline_seame_man.json    # Baseline results on SEAME-MAN
+    ├── baseline_emilia.json       # Baseline results on EMILIA
+    ├── baseline_cs_dialogue.json  # Baseline results on CS-Dialogue
+    ├── trained_seame_sge.json     # This model's results on SEAME-SGE
+    ├── trained_seame_man.json     # This model's results on SEAME-MAN
     ├── trained_emilia.json        # This model's results on EMILIA
     └── trained_cs_dialogue.json   # This model's results on CS-Dialogue
 ```
 ## License
+This model inherits the license of the base [Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) model.