myaccountfor commited on
Commit
20dbc2b
Β·
verified Β·
1 Parent(s): 733abfc

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +36 -82
README.md CHANGED
@@ -3,14 +3,13 @@ language:
3
  - en
4
  - zh
5
  license: other
6
- library_name: peft
7
  base_model: Qwen/Qwen2-Audio-7B-Instruct
8
  tags:
9
  - audio
10
  - speech-recognition
11
  - code-switching
12
  - dpo
13
- - lora
14
  - qwen2-audio
15
  datasets:
16
  - custom
@@ -21,157 +20,112 @@ pipeline_tag: automatic-speech-recognition
21
 
22
  # Qwen2-Audio-7B-DPO-CodeSwitch
23
 
24
- A LoRA adapter for [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) fine-tuned with DPO (Direct Preference Optimization) on code-switching speech transcription data.
25
 
26
  ## Evaluation Results (MER - Mixed Error Rate, lower is better)
27
 
28
  | Benchmark | Baseline | This Model | Improvement |
29
  |-----------|----------|------------|-------------|
30
- | **SEAME** | 0.5677 | **0.5301** | **-6.6%** |
 
31
  | **EMILIA** | 0.4470 | **0.4208** | **-5.9%** |
32
  | **CS-Dialogue** | 0.3891 | **0.3140** | **-19.3%** |
33
 
34
  ### Benchmark Descriptions
35
- - **SEAME**: English-Mandarin code-switching conversational speech from Singapore/Malaysia (9,764 samples)
 
36
  - **EMILIA**: Synthetic code-switching evaluation set (1,000 samples)
37
  - **CS-Dialogue**: Code-switching dialogue evaluation set (359 samples)
38
 
39
- ## Examples
40
-
41
- Below are examples showing improvements from baseline to DPO-trained model:
42
-
43
- ### Example 1: Code-Switching Preserved (Lifestyle)
44
- | | Transcription |
45
- |---|---|
46
- | **Ground Truth** | ζˆ‘δ»¬ 都 εΊ”θ―₯ pursue a healthy lifestyle |
47
- | **Baseline** | ζˆ‘δ»¬ιƒ½εΊ”θ―₯追求ε₯εΊ·ηš„η”Ÿζ΄»ζ–ΉεΌ *(fully translated to Chinese)* |
48
- | **This Model** | ζˆ‘δ»¬ιƒ½εΊ”θ―₯ pursue a healthy lifestyle |
49
- | **MER** | 1.00 β†’ **0.00** |
50
-
51
- ### Example 2: Mixed Language Preserved (Christmas)
52
- | | Transcription |
53
- |---|---|
54
- | **Ground Truth** | every christmas ζˆ‘ ε°± εΊ”θ―₯ 是 ζ²‘ζœ‰ δΊΊ θ·Ÿζˆ‘ εΊ†η₯ δΊ† [啦] |
55
- | **Baseline** | every christmas i would - should be no one to tell me *(Chinese translated to English)* |
56
- | **This Model** | every christmas ζˆ‘ε°±εΊ”θ―₯ζ˜―ζ²‘ζœ‰δΊΊθ·Ÿζˆ‘εΊ†η₯δΊ†ε•¦ |
57
- | **MER** | 0.88 β†’ **0.00** |
58
-
59
- ### Example 3: Technical Terms Preserved
60
- | | Transcription |
61
- |---|---|
62
- | **Ground Truth** | (呃) 每δΈͺ lecture different lecturer ι‚£δΈͺ notes 不 δΈζ€ŽδΉˆ ε₯½ηš„ [啦] |
63
- | **Baseline** | 呃那δΈͺθ€εΈˆδΈεŒι£Žζ Όηš„θ€εΈˆ *(lost technical terms)* |
64
- | **This Model** | 呃 每δΈͺ lecture different lecturer ι‚£δΈͺ notes δΈδΈζ€ŽδΉˆε₯½ηš„啦 |
65
- | **MER** | 0.75 β†’ **0.00** |
66
-
67
- ### Example 4: Complex Code-Switching Preserved
68
- | | Transcription |
69
- |---|---|
70
- | **Ground Truth** | [哦] θΏ˜ζœ‰ δ»€δΉˆ ε₯½εƒ ηš„ 吗 还是 δ½  εͺ是 去 ι‚£δΊ› very expensive places like dempsey to eat |
71
- | **Baseline** | Oh, what else? Oh, yeah, there's always that expensive place like... to eat *(lost Chinese content)* |
72
- | **This Model** | 哦 θΏ˜ζœ‰δ»€δΉˆε₯½εƒηš„吗 还是你εͺζ˜―εŽ»ι‚£δΊ› very expensive places like dancy to eat |
73
- | **MER** | 0.83 β†’ **0.04** |
74
-
75
- ### Example 5: Professional Terms Preserved
76
- | | Transcription |
77
- |---|---|
78
- | **Ground Truth** | [哦] ε› δΈΊ 是δΈͺ professional degree |
79
- | **Baseline** | ε“¦ε› δΈΊδ»–ζœ‰δΈͺδΈ“δΈšηš„ε­¦δ½ *(translated to Chinese)* |
80
- | **This Model** | 哦 ε› δΈΊ 是δΈͺ professional degree |
81
- | **MER** | 1.00 β†’ **0.00** |
82
-
83
  ## Training Configuration
84
 
85
  ### Model Architecture
86
  | Parameter | Value |
87
  |-----------|-------|
88
  | Base Model | [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) |
89
- | Adapter Type | LoRA (Low-Rank Adaptation) |
90
- | LoRA Rank (r) | 256 |
91
- | LoRA Alpha | 128 |
92
- | LoRA Dropout | 0.05 |
93
- | LoRA Target Modules | All attention (q_proj, k_proj, v_proj, o_proj) + MLP (up_proj, down_proj, gate_proj) |
94
- | Trainable Parameters | ~1.28B (adapter only) |
95
 
96
  ### Training Hyperparameters
97
  | Parameter | Value |
98
  |-----------|-------|
99
  | Training Method | DPO (Direct Preference Optimization) |
100
- | DPO Beta (Ξ²) | 0.3 |
101
- | DPO Loss | Sigmoid |
102
- | Learning Rate | 3e-5 |
103
  | LR Scheduler | Cosine |
104
  | Warmup Ratio | 0.1 |
105
  | Batch Size (per device) | 1 |
106
- | Gradient Accumulation Steps | 4 |
107
- | Global Batch Size | 32 (8 GPUs Γ— 1 Γ— 4) |
108
  | Precision | BF16 |
109
- | Max Sequence Length | 8192 |
110
  | Weight Decay | 0.01 |
111
  | Max Gradient Norm | 1.0 |
112
-
113
 
114
  ## Usage
115
 
116
  ```python
117
  from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
118
- from peft import PeftModel
119
  import torch
120
  import librosa
121
 
122
- # Load base model
123
- base_model = Qwen2AudioForConditionalGeneration.from_pretrained(
124
- "Qwen/Qwen2-Audio-7B-Instruct",
125
  torch_dtype=torch.bfloat16,
126
  device_map="auto",
127
  trust_remote_code=True
128
  )
129
  processor = AutoProcessor.from_pretrained(
130
- "Qwen/Qwen2-Audio-7B-Instruct",
131
  trust_remote_code=True
132
  )
133
 
134
- # Load LoRA adapter
135
- model = PeftModel.from_pretrained(base_model, "myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch")
136
  model.eval()
137
 
138
- # Inference example
 
 
 
139
  conversation = [
140
  {"role": "user", "content": [
141
  {"type": "audio", "audio_url": "path/to/audio.wav"},
142
- {"type": "text", "text": "Please transcribe this speech."}
143
  ]}
144
  ]
145
 
146
  text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
147
- audios = [librosa.load("path/to/audio.wav", sr=processor.feature_extractor.sampling_rate)[0]]
148
 
149
  inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
150
  inputs = {k: v.to(model.device) for k, v in inputs.items()}
151
 
 
152
  with torch.no_grad():
153
- generated_ids = model.generate(**inputs, max_new_tokens=256)
 
154
 
155
- transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
156
  ```
157
 
158
  ## Files
159
 
160
  ```
161
  β”œβ”€β”€ README.md # This file
162
- β”œβ”€β”€ adapter_config.json # LoRA configuration
163
- β”œβ”€β”€ adapter_model.safetensors # LoRA adapter weights (~1.28 GB)
164
  β”œβ”€β”€ tokenizer files # Tokenizer assets
165
  └── eval_results/
166
- β”œβ”€β”€ baseline_seame.json # Baseline model results on SEAME
167
- β”œβ”€β”€ baseline_emilia.json # Baseline model results on EMILIA
168
- β”œβ”€β”€ baseline_cs_dialogue.json # Baseline model results on CS-Dialogue
169
- β”œβ”€β”€ trained_seame.json # This model's results on SEAME
 
 
170
  β”œβ”€β”€ trained_emilia.json # This model's results on EMILIA
171
  └── trained_cs_dialogue.json # This model's results on CS-Dialogue
172
  ```
173
 
174
-
175
  ## License
176
 
177
- This adapter inherits the license of the base [Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) model.
 
3
  - en
4
  - zh
5
  license: other
6
+ library_name: transformers
7
  base_model: Qwen/Qwen2-Audio-7B-Instruct
8
  tags:
9
  - audio
10
  - speech-recognition
11
  - code-switching
12
  - dpo
 
13
  - qwen2-audio
14
  datasets:
15
  - custom
 
20
 
21
  # Qwen2-Audio-7B-DPO-CodeSwitch
22
 
23
+ A fine-tuned version of [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) trained with DPO (Direct Preference Optimization) on code-switching speech transcription data.
24
 
25
  ## Evaluation Results (MER - Mixed Error Rate, lower is better)
26
 
27
  | Benchmark | Baseline | This Model | Improvement |
28
  |-----------|----------|------------|-------------|
29
+ | **SEAME-SGE** | 0.9511 | **0.8552** | **-10.1%** |
30
+ | **SEAME-MAN** | 0.7289 | **0.5830** | **-20.0%** |
31
  | **EMILIA** | 0.4470 | **0.4208** | **-5.9%** |
32
  | **CS-Dialogue** | 0.3891 | **0.3140** | **-19.3%** |
33
 
34
  ### Benchmark Descriptions
35
+ - **SEAME-SGE**: SEAME dev set (Singapore English focused) - 3,222 samples ([AudioLLMs/seame_dev_sge](https://huggingface.co/datasets/AudioLLMs/seame_dev_sge))
36
+ - **SEAME-MAN**: SEAME dev set (Mandarin focused) - 2,610 samples ([AudioLLMs/seame_dev_man](https://huggingface.co/datasets/AudioLLMs/seame_dev_man))
37
  - **EMILIA**: Synthetic code-switching evaluation set (1,000 samples)
38
  - **CS-Dialogue**: Code-switching dialogue evaluation set (359 samples)
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ## Training Configuration
41
 
42
  ### Model Architecture
43
  | Parameter | Value |
44
  |-----------|-------|
45
  | Base Model | [Qwen/Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) |
46
+ | Training Type | Full Fine-Tuning |
47
+ | Total Parameters | ~7B |
 
 
 
 
48
 
49
  ### Training Hyperparameters
50
  | Parameter | Value |
51
  |-----------|-------|
52
  | Training Method | DPO (Direct Preference Optimization) |
53
+ | DPO Beta | 0.5 |
54
+ | Learning Rate | 1e-6 |
 
55
  | LR Scheduler | Cosine |
56
  | Warmup Ratio | 0.1 |
57
  | Batch Size (per device) | 1 |
58
+ | Gradient Accumulation Steps | 8 |
 
59
  | Precision | BF16 |
60
+ | Max Sequence Length | 2048 |
61
  | Weight Decay | 0.01 |
62
  | Max Gradient Norm | 1.0 |
63
+ | FSDP | Full Shard |
64
 
65
  ## Usage
66
 
67
  ```python
68
  from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
 
69
  import torch
70
  import librosa
71
 
72
+ # Load model
73
+ model = Qwen2AudioForConditionalGeneration.from_pretrained(
74
+ "myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch",
75
  torch_dtype=torch.bfloat16,
76
  device_map="auto",
77
  trust_remote_code=True
78
  )
79
  processor = AutoProcessor.from_pretrained(
80
+ "myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch",
81
  trust_remote_code=True
82
  )
83
 
 
 
84
  model.eval()
85
 
86
+ # Load audio
87
+ audio, sr = librosa.load("path/to/audio.wav", sr=16000)
88
+
89
+ # Process inputs
90
  conversation = [
91
  {"role": "user", "content": [
92
  {"type": "audio", "audio_url": "path/to/audio.wav"},
93
+ {"type": "text", "text": "Please transcribe this audio."}
94
  ]}
95
  ]
96
 
97
  text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
98
+ audios = [audio]
99
 
100
  inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
101
  inputs = {k: v.to(model.device) for k, v in inputs.items()}
102
 
103
+ # Generate
104
  with torch.no_grad():
105
+ outputs = model.generate(**inputs, max_new_tokens=256)
106
+ transcription = processor.batch_decode(outputs, skip_special_tokens=True)[0]
107
 
108
+ print(transcription)
109
  ```
110
 
111
  ## Files
112
 
113
  ```
114
  β”œβ”€β”€ README.md # This file
115
+ β”œβ”€β”€ config.json # Model configuration
116
+ β”œβ”€β”€ model weights # Model weights
117
  β”œβ”€β”€ tokenizer files # Tokenizer assets
118
  └── eval_results/
119
+ β”œβ”€β”€ baseline_seame_sge.json # Baseline results on SEAME-SGE
120
+ β”œβ”€β”€ baseline_seame_man.json # Baseline results on SEAME-MAN
121
+ β”œβ”€β”€ baseline_emilia.json # Baseline results on EMILIA
122
+ β”œβ”€β”€ baseline_cs_dialogue.json # Baseline results on CS-Dialogue
123
+ β”œβ”€β”€ trained_seame_sge.json # This model's results on SEAME-SGE
124
+ β”œβ”€β”€ trained_seame_man.json # This model's results on SEAME-MAN
125
  β”œβ”€β”€ trained_emilia.json # This model's results on EMILIA
126
  └── trained_cs_dialogue.json # This model's results on CS-Dialogue
127
  ```
128
 
 
129
  ## License
130
 
131
+ This model inherits the license of the base [Qwen2-Audio-7B-Instruct](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct) model.