|
|
--- |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
license: other |
|
|
library_name: transformers |
|
|
base_model: MERaLiON/MERaLiON-2-3B |
|
|
tags: |
|
|
- audio |
|
|
- speech-recognition |
|
|
- code-switching |
|
|
- dpo |
|
|
- meralion |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- mer |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
|
|
|
# MERaLiON-2-3B-DPO-CodeSwitch |
|
|
|
|
|
A fine-tuned version of [MERaLiON/MERaLiON-2-3B](https://huggingface.co/MERaLiON/MERaLiON-2-3B) trained with DPO (Direct Preference Optimization) on code-switching speech transcription data. |
|
|
|
|
|
## Evaluation Results (MER - Mixed Error Rate, lower is better) |
|
|
|
|
|
| Benchmark | Baseline | This Model | Improvement | |
|
|
|-----------|----------|------------|-------------| |
|
|
| **SEAME-SGE** | 0.3238 | **0.3175** | **-2.0%** | |
|
|
| **SEAME-MAN** | 0.2579 | **0.2561** | **-0.7%** | |
|
|
| **EMILIA** | 0.3201 | **0.3041** | **-5.0%** | |
|
|
| **CS-Dialogue** | 0.2541 | **0.2258** | **-11.1%** | |
|
|
|
|
|
### Benchmark Descriptions |
|
|
- **SEAME-SGE**: SEAME dev set (Singapore English focused) - 3,222 samples ([AudioLLMs/seame_dev_sge](https://huggingface.co/datasets/AudioLLMs/seame_dev_sge)) |
|
|
- **SEAME-MAN**: SEAME dev set (Mandarin focused) - 2,610 samples ([AudioLLMs/seame_dev_man](https://huggingface.co/datasets/AudioLLMs/seame_dev_man)) |
|
|
- **EMILIA**: Synthetic code-switching evaluation set (1,000 samples) |
|
|
- **CS-Dialogue**: Code-switching dialogue evaluation set (359 samples) |
|
|
|
|
|
## Examples |
|
|
|
|
|
Below are examples showing improvements from baseline to DPO-trained model: |
|
|
|
|
|
### Example 1: Hallucination Fixed |
|
|
| | Transcription | |
|
|
|---|---| |
|
|
| **Ground Truth** | 你们 是 一首 歌 也是 教 一个 session [啊] [哦] [嗯] | |
|
|
| **Baseline** | 你们是一首歌也是教一个 session (oh) 我们也是 session 那个 sessional practice 的... *(hallucinated extra content)* | |
|
|
| **This Model** | 你们是一首歌也是教一个 session (啊) (哦) | |
|
|
| **MER** | 2.20 → **0.07** | |
|
|
|
|
|
### Example 2: Code-Switching Preserved (Maid) |
|
|
| | Transcription | |
|
|
|---|---| |
|
|
| **Ground Truth** | [啊] 然后 因为 我们 家里 有 一个 maid 的 [吗] 我 妈妈 有请 一个 maid [mah] 那个 是 打扫 屋子 的 东西 这样 之类 [吗] that is why 可以 [咯] 因为 | |
|
|
| **Baseline** | (ah) 然后因为我们家里有一个 maid 的 (mah) 妈妈就请一个 maid 的 (mah) (mah) (mah)... *(repeated filler words)* | |
|
|
| **This Model** | (啊) 然后因为我们家里有一个 maid 的 (mah) 我妈妈就请一个 maid (mah) 那个是打扫屋子的东西这样子 (leh) (mah) that's why 可以 (loh) 因为 | |
|
|
| **MER** | 1.02 → **0.17** | |
|
|
|
|
|
### Example 3: English Location Preserved (Temasek Poly) |
|
|
| | Transcription | |
|
|
|---|---| |
|
|
| **Ground Truth** | 我 住 temasek poly 那边 | |
|
|
| **Baseline** | 我住达马士科波利那边 *(transliterated to Chinese)* | |
|
|
| **This Model** | 我住 tamasek poly 那边 | |
|
|
| **MER** | 1.00 → **0.17** | |
|
|
|
|
|
### Example 4: Code-Switching Preserved (Exam) |
|
|
| | Transcription | |
|
|
|---|---| |
|
|
| **Ground Truth** | 考 得 很 考 得 like shit | |
|
|
| **Baseline** | 课程很课程很 like shit *(wrong Chinese characters)* | |
|
|
| **This Model** | 考得很 考得 like shit | |
|
|
| **MER** | 0.71 → **0.00** | |
|
|
|
|
|
### Example 5: Mixed Language Preserved (Youth) |
|
|
| | Transcription | |
|
|
|---|---| |
|
|
| **Ground Truth** | not really youth [lah] 还是 youth 了 三十岁 | |
|
|
| **Baseline** | not really you (lah) 还是 you (lah) 三十岁 (oh) *(lost "youth")* | |
|
|
| **This Model** | not really youth (lah) 还是 youth 了三十岁 | |
|
|
| **MER** | 0.36 → **0.00** | |
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
### Model Architecture |
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Base Model | [MERaLiON/MERaLiON-2-3B](https://huggingface.co/MERaLiON/MERaLiON-2-3B) | |
|
|
| Training Type | Full Fine-Tuning | |
|
|
| Total Parameters | ~3.47B | |
|
|
| Trainable Parameters | ~3.47B | |
|
|
|
|
|
### Training Hyperparameters |
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Training Method | DPO (Direct Preference Optimization) | |
|
|
| DPO Beta | 0.5 | |
|
|
| Learning Rate | 1e-6 | |
|
|
| LR Scheduler | Cosine | |
|
|
| Warmup Ratio | 0.1 | |
|
|
| Batch Size (per device) | 1 | |
|
|
| Gradient Accumulation Steps | 8 | |
|
|
| Global Batch Size | 256 (32 GPUs x 1 x 8) | |
|
|
| Precision | BF16 | |
|
|
| Max Sequence Length | 2048 | |
|
|
| Weight Decay | 0.01 | |
|
|
| Max Gradient Norm | 1.0 | |
|
|
| Training Steps | 300 | |
|
|
| FSDP | Full Shard | |
|
|
|
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from meralion2_model.modeling_meralion2 import MERaLiON2ForConditionalGeneration |
|
|
from meralion2_model.processing_meralion2 import MERaLiON2Processor |
|
|
import torch |
|
|
import librosa |
|
|
|
|
|
# Load model |
|
|
model = MERaLiON2ForConditionalGeneration.from_pretrained( |
|
|
"myaccountfor/MERaLiON-2-3B-DPO-CodeSwitch", |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
processor = MERaLiON2Processor.from_pretrained( |
|
|
"myaccountfor/MERaLiON-2-3B-DPO-CodeSwitch", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
model.eval() |
|
|
|
|
|
# Load audio |
|
|
audio, sr = librosa.load("path/to/audio.wav", sr=16000) |
|
|
|
|
|
# Build prompt |
|
|
prompt = "Please transcribe this speech." |
|
|
input_text = f"Instruction: {prompt} \nFollow the text instruction based on the following audio: <SpeechHere>" |
|
|
|
|
|
# Apply chat template |
|
|
conversation = [{"role": "user", "content": input_text}] |
|
|
chat_prompt = "<bos><start_of_turn>user\n" + input_text + "<end_of_turn>\n<start_of_turn>model\n" |
|
|
|
|
|
# Process inputs |
|
|
inputs = processor(text=[chat_prompt], audios=[audio]) |
|
|
inputs = {k: v.to(model.device) for k, v in inputs.items()} |
|
|
|
|
|
# Generate |
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=256, |
|
|
do_sample=False, |
|
|
use_cache=True, |
|
|
) |
|
|
generated_ids = outputs[:, inputs['input_ids'].size(1):] |
|
|
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
|
|
|
print(transcription) |
|
|
``` |
|
|
|
|
|
## Files |
|
|
|
|
|
``` |
|
|
├── README.md # This file |
|
|
├── config.json # Model configuration |
|
|
├── pytorch_model.bin # Model weights (~8.1 GB) |
|
|
├── tokenizer files # Tokenizer assets |
|
|
└── eval_results/ |
|
|
├── baseline_seame_sge.json # Baseline results on SEAME-SGE |
|
|
├── baseline_seame_man.json # Baseline results on SEAME-MAN |
|
|
├── baseline_emilia.json # Baseline results on EMILIA |
|
|
├── baseline_cs_dialogue.json # Baseline results on CS-Dialogue |
|
|
├── trained_seame_sge.json # This model's results on SEAME-SGE |
|
|
├── trained_seame_man.json # This model's results on SEAME-MAN |
|
|
├── trained_emilia.json # This model's results on EMILIA |
|
|
└── trained_cs_dialogue.json # This model's results on CS-Dialogue |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model inherits the license of the base [MERaLiON-2-3B](https://huggingface.co/MERaLiON/MERaLiON-2-3B) model. |
|
|
|