---
license: cc-by-nc-4.0
language:
- zh
- en
library_name: transformers
tags:
- speech
- audio
- speech-evaluation
- expressive-speech
- mandarin
- chain-of-thought
- ceaeval
pipeline_tag: audio-text-to-text
---
# CEAEval-Model (CEAEval-M)
**CEAEval-M** is the speech-LLM *judge* released together with our ACL paper
*"Evaluating the Expressive Appropriateness of Speech in Rich Contexts"*.
Given a Mandarin speech segment together with an *ideal expressive plan*
inferred from its surrounding narrative context, CEAEval-M produces
```
step-by-step comparison of ideal vs. actual expression,
with … spans pointing to
audio-grounded cues (emotion / rhythm / intonation /
recording condition / paralinguistic events)
X.X # overall expressive appropriateness ∈ [0.0, 5.0]
```
This is the *judge* half of the planner–judge decoupled pipeline defined
in the paper. It is designed to work with a frozen text-only planner
([Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)) that first summarizes
long narrative context into a four-tuple
`{emotion, rhythm, intonation, recording_condition}` via multi-context
voting.
## What's released here
- Model weights in `safetensors` (4 shards) plus `config.json`,
`generation_config.json`, tokenizer, preprocessor, and chat template.
- **Six extra special tokens** the judge uses during training and
inference (``, ``, ``, ``,
``, ``) — already merged into the tokenizer
and embedding matrix.
- A patched `modeling_*` path that implements the **adaptive audio
attention bias** mechanism described in Sec. 3.3.4 and Appendix F of
the paper (region-wise bias over system-prompt / audio / CoT regions).
- `test_datas/` with **anonymised** sanity samples (audio + JSON) so
you can verify the pipeline end-to-end without touching the main
dataset.
The full inference pipeline (planner + judge, audio pre-processing,
batch driver, sanity examples) lives in the code repository — see
[Related resources](#related-resources).
## Intended use and limitations
- Intended as a **research benchmark and diagnostic tool** for
expressive-speech generation / selection, not as a standalone
decision-making system. Expressive appropriateness is inherently
subjective; predictions should be interpreted with appropriate human
oversight.
- Trained and evaluated on **Mandarin audiobook speech**. Applying the
model to other languages, styles, or domains (short commands,
non-narrative dialogue, etc.) may produce unreliable scores.
## Related resources
This model is one of three companion releases for the paper. **Please
use them together:**
| Resource | Link |
| --- | --- |
| 📄 Paper | *Evaluating the Expressive Appropriateness of Speech in Rich Contexts* (ACL) |
| 💻 Code | |
| 🤖 Model (this repo) | |
| 📚 Dataset (CEAEval-D) | |
| 🌐 Project page / demo | |
## License
Released under **CC BY-NC 4.0** — non-commercial academic research use
only. The released weights do not contain or expose raw audio,
transcripts, or any personally identifiable information.