MERaLiON-2-3B-DPO-CodeSwitch / README.md

Fix swapped CS-Dialogue baseline/trained results

35380f2 verified 3 days ago

6.66 kB

	---
	language:
	- en
	- zh
	license: other
	library_name: transformers
	base_model: MERaLiON/MERaLiON-2-3B
	tags:
	- audio
	- speech-recognition
	- code-switching
	- dpo
	- meralion
	datasets:
	- custom
	metrics:
	- mer
	pipeline_tag: automatic-speech-recognition
	---

	# MERaLiON-2-3B-DPO-CodeSwitch

	A fine-tuned version of [MERaLiON/MERaLiON-2-3B](https://huggingface.co/MERaLiON/MERaLiON-2-3B) trained with DPO (Direct Preference Optimization) on code-switching speech transcription data.

	## Evaluation Results (MER - Mixed Error Rate, lower is better)

	\| Benchmark \| Baseline \| This Model \| Improvement \|
	\|-----------\|----------\|------------\|-------------\|
	\| SEAME-SGE \| 0.3238 \| 0.3175 \| -2.0% \|
	\| SEAME-MAN \| 0.2579 \| 0.2561 \| -0.7% \|
	\| EMILIA \| 0.3201 \| 0.3041 \| -5.0% \|
	\| CS-Dialogue \| 0.2541 \| 0.2258 \| -11.1% \|

	### Benchmark Descriptions
	- SEAME-SGE: SEAME dev set (Singapore English focused) - 3,222 samples ([AudioLLMs/seame_dev_sge](https://huggingface.co/datasets/AudioLLMs/seame_dev_sge))
	- SEAME-MAN: SEAME dev set (Mandarin focused) - 2,610 samples ([AudioLLMs/seame_dev_man](https://huggingface.co/datasets/AudioLLMs/seame_dev_man))
	- EMILIA: Synthetic code-switching evaluation set (1,000 samples)
	- CS-Dialogue: Code-switching dialogue evaluation set (359 samples)

	## Examples

	Below are examples showing improvements from baseline to DPO-trained model:

	### Example 1: Hallucination Fixed
	\| \| Transcription \|
	\|---\|---\|
	\| Ground Truth \| 你们是一首歌也是教一个 session [啊] [哦] [嗯] \|
	\| Baseline \| 你们是一首歌也是教一个 session (oh) 我们也是 session 那个 sessional practice 的... (hallucinated extra content) \|
	\| This Model \| 你们是一首歌也是教一个 session (啊) (哦) \|
	\| MER \| 2.20 → 0.07 \|

	### Example 2: Code-Switching Preserved (Maid)
	\| \| Transcription \|
	\|---\|---\|
	\| Ground Truth \| [啊] 然后因为我们家里有一个 maid 的 [吗] 我妈妈有请一个 maid [mah] 那个是打扫屋子的东西这样之类 [吗] that is why 可以 [咯] 因为 \|
	\| Baseline \| (ah) 然后因为我们家里有一个 maid 的 (mah) 妈妈就请一个 maid 的 (mah) (mah) (mah)... (repeated filler words) \|
	\| This Model \| (啊) 然后因为我们家里有一个 maid 的 (mah) 我妈妈就请一个 maid (mah) 那个是打扫屋子的东西这样子 (leh) (mah) that's why 可以 (loh) 因为 \|
	\| MER \| 1.02 → 0.17 \|

	### Example 3: English Location Preserved (Temasek Poly)
	\| \| Transcription \|
	\|---\|---\|
	\| Ground Truth \| 我住 temasek poly 那边 \|
	\| Baseline \| 我住达马士科波利那边 (transliterated to Chinese) \|
	\| This Model \| 我住 tamasek poly 那边 \|
	\| MER \| 1.00 → 0.17 \|

	### Example 4: Code-Switching Preserved (Exam)
	\| \| Transcription \|
	\|---\|---\|
	\| Ground Truth \| 考得很考得 like shit \|
	\| Baseline \| 课程很课程很 like shit (wrong Chinese characters) \|
	\| This Model \| 考得很考得 like shit \|
	\| MER \| 0.71 → 0.00 \|

	### Example 5: Mixed Language Preserved (Youth)
	\| \| Transcription \|
	\|---\|---\|
	\| Ground Truth \| not really youth [lah] 还是 youth 了三十岁 \|
	\| Baseline \| not really you (lah) 还是 you (lah) 三十岁 (oh) (lost "youth") \|
	\| This Model \| not really youth (lah) 还是 youth 了三十岁 \|
	\| MER \| 0.36 → 0.00 \|

	## Training Configuration

	### Model Architecture
	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| [MERaLiON/MERaLiON-2-3B](https://huggingface.co/MERaLiON/MERaLiON-2-3B) \|
	\| Training Type \| Full Fine-Tuning \|
	\| Total Parameters \| ~3.47B \|
	\| Trainable Parameters \| ~3.47B \|

	### Training Hyperparameters
	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Training Method \| DPO (Direct Preference Optimization) \|
	\| DPO Beta \| 0.5 \|
	\| Learning Rate \| 1e-6 \|
	\| LR Scheduler \| Cosine \|
	\| Warmup Ratio \| 0.1 \|
	\| Batch Size (per device) \| 1 \|
	\| Gradient Accumulation Steps \| 8 \|
	\| Global Batch Size \| 256 (32 GPUs x 1 x 8) \|
	\| Precision \| BF16 \|
	\| Max Sequence Length \| 2048 \|
	\| Weight Decay \| 0.01 \|
	\| Max Gradient Norm \| 1.0 \|
	\| Training Steps \| 300 \|
	\| FSDP \| Full Shard \|


	## Usage

	```python
	from meralion2_model.modeling_meralion2 import MERaLiON2ForConditionalGeneration
	from meralion2_model.processing_meralion2 import MERaLiON2Processor
	import torch
	import librosa

	# Load model
	model = MERaLiON2ForConditionalGeneration.from_pretrained(
	"myaccountfor/MERaLiON-2-3B-DPO-CodeSwitch",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True
	)
	processor = MERaLiON2Processor.from_pretrained(
	"myaccountfor/MERaLiON-2-3B-DPO-CodeSwitch",
	trust_remote_code=True
	)

	model.eval()

	# Load audio
	audio, sr = librosa.load("path/to/audio.wav", sr=16000)

	# Build prompt
	prompt = "Please transcribe this speech."
	input_text = f"Instruction: {prompt} \nFollow the text instruction based on the following audio: <SpeechHere>"

	# Apply chat template
	conversation = [{"role": "user", "content": input_text}]
	chat_prompt = "<bos><start_of_turn>user\n" + input_text + "<end_of_turn>\n<start_of_turn>model\n"

	# Process inputs
	inputs = processor(text=[chat_prompt], audios=[audio])
	inputs = {k: v.to(model.device) for k, v in inputs.items()}

	# Generate
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=256,
	do_sample=False,
	use_cache=True,
	)
	generated_ids = outputs[:, inputs['input_ids'].size(1):]
	transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

	print(transcription)
	```

	## Files

	```
	├── README.md # This file
	├── config.json # Model configuration
	├── pytorch_model.bin # Model weights (~8.1 GB)
	├── tokenizer files # Tokenizer assets
	└── eval_results/
	├── baseline_seame_sge.json # Baseline results on SEAME-SGE
	├── baseline_seame_man.json # Baseline results on SEAME-MAN
	├── baseline_emilia.json # Baseline results on EMILIA
	├── baseline_cs_dialogue.json # Baseline results on CS-Dialogue
	├── trained_seame_sge.json # This model's results on SEAME-SGE
	├── trained_seame_man.json # This model's results on SEAME-MAN
	├── trained_emilia.json # This model's results on EMILIA
	└── trained_cs_dialogue.json # This model's results on CS-Dialogue
	```

	## License

	This model inherits the license of the base [MERaLiON-2-3B](https://huggingface.co/MERaLiON/MERaLiON-2-3B) model.