MERaLiON-2-3B-DPO-CodeSwitch

A fine-tuned version of MERaLiON/MERaLiON-2-3B trained with DPO (Direct Preference Optimization) on code-switching speech transcription data.

Evaluation Results (MER - Mixed Error Rate, lower is better)

Benchmark	Baseline	This Model	Improvement
SEAME	0.3372	0.2530	-25.0%
EMILIA	0.3201	0.3041	-5.0%
CS-Dialogue	0.2541	0.2258	-11.1%

Benchmark Descriptions

SEAME: English-Mandarin code-switching conversational speech from Singapore/Malaysia (9,764 samples)
EMILIA: Synthetic code-switching evaluation set (1,000 samples)
CS-Dialogue: Code-switching dialogue evaluation set (359 samples)

Examples

Below are examples showing improvements from baseline to DPO-trained model:

Example 1: Hallucination Fixed

	Transcription
Ground Truth	你们是一首歌也是教一个 session [啊] [哦] [嗯]
Baseline	你们是一首歌也是教一个 session (oh) 我们也是 session 那个 sessional practice 的... (hallucinated extra content)
This Model	你们是一首歌也是教一个 session (啊) (哦)
MER	2.20 → 0.07

Example 2: Code-Switching Preserved (Maid)

	Transcription
Ground Truth	[啊] 然后因为我们家里有一个 maid 的 [吗] 我妈妈有请一个 maid [mah] 那个是打扫屋子的东西这样之类 [吗] that is why 可以 [咯] 因为
Baseline	(ah) 然后因为我们家里有一个 maid 的 (mah) 妈妈就请一个 maid 的 (mah) (mah) (mah)... (repeated filler words)
This Model	(啊) 然后因为我们家里有一个 maid 的 (mah) 我妈妈就请一个 maid (mah) 那个是打扫屋子的东西这样子 (leh) (mah) that's why 可以 (loh) 因为
MER	1.02 → 0.17

Example 3: English Location Preserved (Temasek Poly)

	Transcription
Ground Truth	我住 temasek poly 那边
Baseline	我住达马士科波利那边 (transliterated to Chinese)
This Model	我住 tamasek poly 那边
MER	1.00 → 0.17

Example 4: Code-Switching Preserved (Exam)

	Transcription
Ground Truth	考得很考得 like shit
Baseline	课程很课程很 like shit (wrong Chinese characters)
This Model	考得很考得 like shit
MER	0.71 → 0.00

Example 5: Mixed Language Preserved (Youth)

	Transcription
Ground Truth	not really youth [lah] 还是 youth 了三十岁
Baseline	not really you (lah) 还是 you (lah) 三十岁 (oh) (lost "youth")
This Model	not really youth (lah) 还是 youth 了三十岁
MER	0.36 → 0.00

Training Configuration

Model Architecture

Parameter	Value
Base Model	MERaLiON/MERaLiON-2-3B
Training Type	Full Fine-Tuning
Total Parameters	~3.47B
Trainable Parameters	~3.47B

Training Hyperparameters

Parameter	Value
Training Method	DPO (Direct Preference Optimization)
DPO Beta	0.5
Learning Rate	1e-6
LR Scheduler	Cosine
Warmup Ratio	0.1
Batch Size (per device)	1
Gradient Accumulation Steps	8
Global Batch Size	256 (32 GPUs x 1 x 8)
Precision	BF16
Max Sequence Length	2048
Weight Decay	0.01
Max Gradient Norm	1.0
Training Steps	300
FSDP	Full Shard

Usage

from meralion2_model.modeling_meralion2 import MERaLiON2ForConditionalGeneration
from meralion2_model.processing_meralion2 import MERaLiON2Processor
import torch
import librosa

# Load model
model = MERaLiON2ForConditionalGeneration.from_pretrained(
    "myaccountfor/MERaLiON-2-3B-DPO-CodeSwitch",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = MERaLiON2Processor.from_pretrained(
    "myaccountfor/MERaLiON-2-3B-DPO-CodeSwitch",
    trust_remote_code=True
)

model.eval()

# Load audio
audio, sr = librosa.load("path/to/audio.wav", sr=16000)

# Build prompt
prompt = "Please transcribe this speech."
input_text = f"Instruction: {prompt} \nFollow the text instruction based on the following audio: <SpeechHere>"

# Apply chat template
conversation = [{"role": "user", "content": input_text}]
chat_prompt = "<bos><start_of_turn>user\n" + input_text + "<end_of_turn>\n<start_of_turn>model\n"

# Process inputs
inputs = processor(text=[chat_prompt], audios=[audio])
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        use_cache=True,
    )
    generated_ids = outputs[:, inputs['input_ids'].size(1):]
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(transcription)

Files

├── README.md                      # This file
├── config.json                    # Model configuration
├── pytorch_model.bin              # Model weights (~8.1 GB)
├── tokenizer files                # Tokenizer assets
└── eval_results/
    ├── baseline_seame.json        # Baseline results on SEAME
    ├── baseline_emilia.json       # Baseline results on EMILIA
    ├── baseline_cs_dialogue.json  # Baseline results on CS-Dialogue
    ├── trained_seame.json         # This model's results on SEAME
    ├── trained_emilia.json        # This model's results on EMILIA
    └── trained_cs_dialogue.json   # This model's results on CS-Dialogue

License

This model inherits the license of the base MERaLiON-2-3B model.

Downloads last month: 146

Model tree for myaccountfor/MERaLiON-2-3B-DPO-CodeSwitch

Base model

google/gemma-2-9b

Finetuned

google/gemma-2-9b-it

Finetuned

MERaLiON/MERaLiON-2-3B

Finetuned

(1)

this model