MERaLiON-2-3B-DPO-CodeSwitch
A fine-tuned version of MERaLiON/MERaLiON-2-3B trained with DPO (Direct Preference Optimization) on code-switching speech transcription data.
Evaluation Results (MER - Mixed Error Rate, lower is better)
| Benchmark |
Baseline |
This Model |
Improvement |
| SEAME |
0.3372 |
0.2530 |
-25.0% |
| EMILIA |
0.3201 |
0.3041 |
-5.0% |
| CS-Dialogue |
0.2541 |
0.2258 |
-11.1% |
Benchmark Descriptions
- SEAME: English-Mandarin code-switching conversational speech from Singapore/Malaysia (9,764 samples)
- EMILIA: Synthetic code-switching evaluation set (1,000 samples)
- CS-Dialogue: Code-switching dialogue evaluation set (359 samples)
Examples
Below are examples showing improvements from baseline to DPO-trained model:
Example 1: Hallucination Fixed
|
Transcription |
| Ground Truth |
你们 是 一首 歌 也是 教 一个 session [啊] [哦] [嗯] |
| Baseline |
你们是一首歌也是教一个 session (oh) 我们也是 session 那个 sessional practice 的... (hallucinated extra content) |
| This Model |
你们是一首歌也是教一个 session (啊) (哦) |
| MER |
2.20 → 0.07 |
Example 2: Code-Switching Preserved (Maid)
|
Transcription |
| Ground Truth |
[啊] 然后 因为 我们 家里 有 一个 maid 的 [吗] 我 妈妈 有请 一个 maid [mah] 那个 是 打扫 屋子 的 东西 这样 之类 [吗] that is why 可以 [咯] 因为 |
| Baseline |
(ah) 然后因为我们家里有一个 maid 的 (mah) 妈妈就请一个 maid 的 (mah) (mah) (mah)... (repeated filler words) |
| This Model |
(啊) 然后因为我们家里有一个 maid 的 (mah) 我妈妈就请一个 maid (mah) 那个是打扫屋子的东西这样子 (leh) (mah) that's why 可以 (loh) 因为 |
| MER |
1.02 → 0.17 |
Example 3: English Location Preserved (Temasek Poly)
|
Transcription |
| Ground Truth |
我 住 temasek poly 那边 |
| Baseline |
我住达马士科波利那边 (transliterated to Chinese) |
| This Model |
我住 tamasek poly 那边 |
| MER |
1.00 → 0.17 |
Example 4: Code-Switching Preserved (Exam)
|
Transcription |
| Ground Truth |
考 得 很 考 得 like shit |
| Baseline |
课程很课程很 like shit (wrong Chinese characters) |
| This Model |
考得很 考得 like shit |
| MER |
0.71 → 0.00 |
Example 5: Mixed Language Preserved (Youth)
|
Transcription |
| Ground Truth |
not really youth [lah] 还是 youth 了 三十岁 |
| Baseline |
not really you (lah) 还是 you (lah) 三十岁 (oh) (lost "youth") |
| This Model |
not really youth (lah) 还是 youth 了三十岁 |
| MER |
0.36 → 0.00 |
Training Configuration
Model Architecture
| Parameter |
Value |
| Base Model |
MERaLiON/MERaLiON-2-3B |
| Training Type |
Full Fine-Tuning |
| Total Parameters |
~3.47B |
| Trainable Parameters |
~3.47B |
Training Hyperparameters
| Parameter |
Value |
| Training Method |
DPO (Direct Preference Optimization) |
| DPO Beta |
0.5 |
| Learning Rate |
1e-6 |
| LR Scheduler |
Cosine |
| Warmup Ratio |
0.1 |
| Batch Size (per device) |
1 |
| Gradient Accumulation Steps |
8 |
| Global Batch Size |
256 (32 GPUs x 1 x 8) |
| Precision |
BF16 |
| Max Sequence Length |
2048 |
| Weight Decay |
0.01 |
| Max Gradient Norm |
1.0 |
| Training Steps |
300 |
| FSDP |
Full Shard |
Usage
from meralion2_model.modeling_meralion2 import MERaLiON2ForConditionalGeneration
from meralion2_model.processing_meralion2 import MERaLiON2Processor
import torch
import librosa
model = MERaLiON2ForConditionalGeneration.from_pretrained(
"myaccountfor/MERaLiON-2-3B-DPO-CodeSwitch",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = MERaLiON2Processor.from_pretrained(
"myaccountfor/MERaLiON-2-3B-DPO-CodeSwitch",
trust_remote_code=True
)
model.eval()
audio, sr = librosa.load("path/to/audio.wav", sr=16000)
prompt = "Please transcribe this speech."
input_text = f"Instruction: {prompt} \nFollow the text instruction based on the following audio: <SpeechHere>"
conversation = [{"role": "user", "content": input_text}]
chat_prompt = "<bos><start_of_turn>user\n" + input_text + "<end_of_turn>\n<start_of_turn>model\n"
inputs = processor(text=[chat_prompt], audios=[audio])
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
use_cache=True,
)
generated_ids = outputs[:, inputs['input_ids'].size(1):]
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)
Files
├── README.md # This file
├── config.json # Model configuration
├── pytorch_model.bin # Model weights (~8.1 GB)
├── tokenizer files # Tokenizer assets
└── eval_results/
├── baseline_seame.json # Baseline results on SEAME
├── baseline_emilia.json # Baseline results on EMILIA
├── baseline_cs_dialogue.json # Baseline results on CS-Dialogue
├── trained_seame.json # This model's results on SEAME
├── trained_emilia.json # This model's results on EMILIA
└── trained_cs_dialogue.json # This model's results on CS-Dialogue
License
This model inherits the license of the base MERaLiON-2-3B model.