Audio-Text-to-Text
Transformers
Safetensors
Chinese
English
qwen2_5_omni_thinker
speech
audio
speech-evaluation
expressive-speech
mandarin
chain-of-thought
ceaeval
Instructions to use TianRW/CEAEval-Model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TianRW/CEAEval-Model with Transformers:
# Load model directly from transformers import AutoTokenizer, GatedAttenQwen2_5omnithinker tokenizer = AutoTokenizer.from_pretrained("TianRW/CEAEval-Model") model = GatedAttenQwen2_5omnithinker.from_pretrained("TianRW/CEAEval-Model") - Notebooks
- Google Colab
- Kaggle
File size: 3,379 Bytes
14b3ebf 7b20070 14b3ebf 7b20070 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | ---
license: cc-by-nc-4.0
language:
- zh
- en
library_name: transformers
tags:
- speech
- audio
- speech-evaluation
- expressive-speech
- mandarin
- chain-of-thought
- ceaeval
pipeline_tag: audio-text-to-text
---
# CEAEval-Model (CEAEval-M)
**CEAEval-M** is the speech-LLM *judge* released together with our ACL paper
*"Evaluating the Expressive Appropriateness of Speech in Rich Contexts"*.
Given a Mandarin speech segment together with an *ideal expressive plan*
inferred from its surrounding narrative context, CEAEval-M produces
```
<think> step-by-step comparison of ideal vs. actual expression,
with <focus_audio>…</focus_audio> spans pointing to
audio-grounded cues (emotion / rhythm / intonation /
recording condition / paralinguistic events) </think>
<score>X.X</score> # overall expressive appropriateness ∈ [0.0, 5.0]
```
This is the *judge* half of the planner–judge decoupled pipeline defined
in the paper. It is designed to work with a frozen text-only planner
([Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)) that first summarizes
long narrative context into a four-tuple
`{emotion, rhythm, intonation, recording_condition}` via multi-context
voting.
## What's released here
- Model weights in `safetensors` (4 shards) plus `config.json`,
`generation_config.json`, tokenizer, preprocessor, and chat template.
- **Six extra special tokens** the judge uses during training and
inference (`<think>`, `</think>`, `<score>`, `</score>`,
`<focus_audio>`, `</focus_audio>`) — already merged into the tokenizer
and embedding matrix.
- A patched `modeling_*` path that implements the **adaptive audio
attention bias** mechanism described in Sec. 3.3.4 and Appendix F of
the paper (region-wise bias over system-prompt / audio / CoT regions).
- `test_datas/` with **anonymised** sanity samples (audio + JSON) so
you can verify the pipeline end-to-end without touching the main
dataset.
The full inference pipeline (planner + judge, audio pre-processing,
batch driver, sanity examples) lives in the code repository — see
[Related resources](#related-resources).
## Intended use and limitations
- Intended as a **research benchmark and diagnostic tool** for
expressive-speech generation / selection, not as a standalone
decision-making system. Expressive appropriateness is inherently
subjective; predictions should be interpreted with appropriate human
oversight.
- Trained and evaluated on **Mandarin audiobook speech**. Applying the
model to other languages, styles, or domains (short commands,
non-narrative dialogue, etc.) may produce unreliable scores.
## Related resources
This model is one of three companion releases for the paper. **Please
use them together:**
| Resource | Link |
| --- | --- |
| 📄 Paper | *Evaluating the Expressive Appropriateness of Speech in Rich Contexts* (ACL) |
| 💻 Code | <https://github.com/wangtianrui/CEAEval> |
| 🤖 Model (this repo) | <https://huggingface.co/TianRW/CEAEval-Model> |
| 📚 Dataset (CEAEval-D) | <https://huggingface.co/datasets/TianRW/CEAEval-Data> |
| 🌐 Project page / demo | <https://wangtianrui.github.io/ceaeval/> |
## License
Released under **CC BY-NC 4.0** — non-commercial academic research use
only. The released weights do not contain or expose raw audio,
transcripts, or any personally identifiable information.
|