---
license: cc-by-nc-4.0
language:
- zh
- en
library_name: transformers
tags:
- speech
- audio
- speech-evaluation
- expressive-speech
- mandarin
- chain-of-thought
- ceaeval
pipeline_tag: audio-text-to-text
---

# CEAEval-Model (CEAEval-M)

**CEAEval-M** is the speech-LLM *judge* released together with our ACL paper
*"Evaluating the Expressive Appropriateness of Speech in Rich Contexts"*.

Given a Mandarin speech segment together with an *ideal expressive plan*
inferred from its surrounding narrative context, CEAEval-M produces

```
<think> step-by-step comparison of ideal vs. actual expression,
        with <focus_audio>…</focus_audio> spans pointing to
        audio-grounded cues (emotion / rhythm / intonation /
        recording condition / paralinguistic events) </think>
<score>X.X</score>     # overall expressive appropriateness ∈ [0.0, 5.0]
```

This is the *judge* half of the planner–judge decoupled pipeline defined
in the paper. It is designed to work with a frozen text-only planner
([Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)) that first summarizes
long narrative context into a four-tuple
`{emotion, rhythm, intonation, recording_condition}` via multi-context
voting.

## What's released here

- Model weights in `safetensors` (4 shards) plus `config.json`,
  `generation_config.json`, tokenizer, preprocessor, and chat template.
- **Six extra special tokens** the judge uses during training and
  inference (`<think>`, `</think>`, `<score>`, `</score>`,
  `<focus_audio>`, `</focus_audio>`) — already merged into the tokenizer
  and embedding matrix.
- A patched `modeling_*` path that implements the **adaptive audio
  attention bias** mechanism described in Sec. 3.3.4 and Appendix F of
  the paper (region-wise bias over system-prompt / audio / CoT regions).
- `test_datas/` with **anonymised** sanity samples (audio + JSON) so
  you can verify the pipeline end-to-end without touching the main
  dataset.

The full inference pipeline (planner + judge, audio pre-processing,
batch driver, sanity examples) lives in the code repository — see
[Related resources](#related-resources).


## Intended use and limitations

- Intended as a **research benchmark and diagnostic tool** for
  expressive-speech generation / selection, not as a standalone
  decision-making system. Expressive appropriateness is inherently
  subjective; predictions should be interpreted with appropriate human
  oversight.
- Trained and evaluated on **Mandarin audiobook speech**. Applying the
  model to other languages, styles, or domains (short commands,
  non-narrative dialogue, etc.) may produce unreliable scores.

## Related resources

This model is one of three companion releases for the paper. **Please
use them together:**

| Resource | Link |
| --- | --- |
| 📄 Paper | *Evaluating the Expressive Appropriateness of Speech in Rich Contexts* (ACL) |
| 💻 Code | <https://github.com/wangtianrui/CEAEval> |
| 🤖 Model (this repo) | <https://huggingface.co/TianRW/CEAEval-Model> |
| 📚 Dataset (CEAEval-D) | <https://huggingface.co/datasets/TianRW/CEAEval-Data> |
| 🌐 Project page / demo | <https://wangtianrui.github.io/ceaeval/> |


## License

Released under **CC BY-NC 4.0** — non-commercial academic research use
only. The released weights do not contain or expose raw audio,
transcripts, or any personally identifiable information.