Instructions to use MYJOKERML/SDiaReward-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MYJOKERML/SDiaReward-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("audio-classification", model="MYJOKERML/SDiaReward-7B")# Load model directly from transformers import AutoTokenizer, QwenOmniThinkerReward tokenizer = AutoTokenizer.from_pretrained("MYJOKERML/SDiaReward-7B") model = QwenOmniThinkerReward.from_pretrained("MYJOKERML/SDiaReward-7B") - Notebooks
- Google Colab
- Kaggle
SDiaReward-7B
SDiaReward-7B is a reward model for spoken dialogue, built on
Qwen2.5-Omni-7B with a pooling layer and a linear scoring head
(QwenOmniThinkerReward). Given a multi-turn conversation containing interleaved
speech and text, it produces a scalar reward that reflects two qualities:
Modality-awareness β prosody, emotion, and acoustic naturalness (real human speech vs. synthetic TTS).
Colloquialness β spontaneous spoken style vs. scripted written style.
π Paper: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness (ACL 2026 Main Conference) β arXiv:2603.14889
π» Code: https://github.com/MM-Speech/SDiaReward
π Training data (gated):
SDiaRewarddataset.π§© Smaller variant:
SDiaReward-3B.
Evaluation
On the ESDR-Bench validation set:
| Model | Accuracy | Eval loss | Margin |
|---|---|---|---|
| SDiaReward-7B | 0.971 | 0.358 | 1.167 |
| SDiaReward-3B | 0.917 | 0.419 | 0.869 |
Usage
This checkpoint uses a custom reward architecture (QwenOmniThinkerReward: a
Qwen2.5-Omni backbone with a pooling layer and a scalar reward head). The custom
modeling code is not bundled in this repository, so the weights cannot be
loaded with a plain AutoModel/pipeline call. To load the model and score
conversations, follow the loading and inference instructions in the official code
repository:
π https://github.com/MM-Speech/SDiaReward (checkpoint id: MYJOKERML/SDiaReward-7B)
Training
Trained with TRL's reward trainer on the SDiaReward preference dataset (11,630 episode-level preference pairs, ~200 hours of paired speech). Backbone: Qwen2.5-Omni-7B; pooling: mean + center; 2 epochs.
License & intended use
Released under Apache-2.0 for research use. The reward signal is intended for evaluating and improving spoken-dialogue systems; it is not a safety classifier.
Citation
@article{lu2026modeling,
title={Modeling and benchmarking spoken dialogue rewards with modality and colloquialness},
author={Lu, Jingyu and Wang, Yuhan and Zhuo, Fan and Cheng, Xize and Pan, Changhao and Pu, Xueyi and Chen, Yifu and Wen, Chenyuhao and Liang, Tianle and Zhao, Zhou},
journal={arXiv preprint arXiv:2603.14889},
year={2026}
}
- Downloads last month
- -
Model tree for MYJOKERML/SDiaReward-7B
Base model
Qwen/Qwen2.5-Omni-7B