| --- |
| license: cc-by-nc-4.0 |
| language: |
| - zh |
| - en |
| library_name: transformers |
| tags: |
| - speech |
| - audio |
| - speech-evaluation |
| - expressive-speech |
| - mandarin |
| - chain-of-thought |
| - ceaeval |
| pipeline_tag: audio-text-to-text |
| --- |
| |
| # CEAEval-Model (CEAEval-M) |
|
|
| **CEAEval-M** is the speech-LLM *judge* released together with our ACL paper |
| *"Evaluating the Expressive Appropriateness of Speech in Rich Contexts"*. |
|
|
| Given a Mandarin speech segment together with an *ideal expressive plan* |
| inferred from its surrounding narrative context, CEAEval-M produces |
|
|
| ``` |
| <think> step-by-step comparison of ideal vs. actual expression, |
| with <focus_audio>…</focus_audio> spans pointing to |
| audio-grounded cues (emotion / rhythm / intonation / |
| recording condition / paralinguistic events) </think> |
| <score>X.X</score> # overall expressive appropriateness ∈ [0.0, 5.0] |
| ``` |
|
|
| This is the *judge* half of the planner–judge decoupled pipeline defined |
| in the paper. It is designed to work with a frozen text-only planner |
| ([Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)) that first summarizes |
| long narrative context into a four-tuple |
| `{emotion, rhythm, intonation, recording_condition}` via multi-context |
| voting. |
|
|
| ## What's released here |
|
|
| - Model weights in `safetensors` (4 shards) plus `config.json`, |
| `generation_config.json`, tokenizer, preprocessor, and chat template. |
| - **Six extra special tokens** the judge uses during training and |
| inference (`<think>`, `</think>`, `<score>`, `</score>`, |
| `<focus_audio>`, `</focus_audio>`) — already merged into the tokenizer |
| and embedding matrix. |
| - A patched `modeling_*` path that implements the **adaptive audio |
| attention bias** mechanism described in Sec. 3.3.4 and Appendix F of |
| the paper (region-wise bias over system-prompt / audio / CoT regions). |
| - `test_datas/` with **anonymised** sanity samples (audio + JSON) so |
| you can verify the pipeline end-to-end without touching the main |
| dataset. |
|
|
| The full inference pipeline (planner + judge, audio pre-processing, |
| batch driver, sanity examples) lives in the code repository — see |
| [Related resources](#related-resources). |
|
|
|
|
| ## Intended use and limitations |
|
|
| - Intended as a **research benchmark and diagnostic tool** for |
| expressive-speech generation / selection, not as a standalone |
| decision-making system. Expressive appropriateness is inherently |
| subjective; predictions should be interpreted with appropriate human |
| oversight. |
| - Trained and evaluated on **Mandarin audiobook speech**. Applying the |
| model to other languages, styles, or domains (short commands, |
| non-narrative dialogue, etc.) may produce unreliable scores. |
|
|
| ## Related resources |
|
|
| This model is one of three companion releases for the paper. **Please |
| use them together:** |
|
|
| | Resource | Link | |
| | --- | --- | |
| | 📄 Paper | *Evaluating the Expressive Appropriateness of Speech in Rich Contexts* (ACL) | |
| | 💻 Code | <https://github.com/wangtianrui/CEAEval> | |
| | 🤖 Model (this repo) | <https://huggingface.co/TianRW/CEAEval-Model> | |
| | 📚 Dataset (CEAEval-D) | <https://huggingface.co/datasets/TianRW/CEAEval-Data> | |
| | 🌐 Project page / demo | <https://wangtianrui.github.io/ceaeval/> | |
|
|
|
|
| ## License |
|
|
| Released under **CC BY-NC 4.0** — non-commercial academic research use |
| only. The released weights do not contain or expose raw audio, |
| transcripts, or any personally identifiable information. |
|
|