--- license: cc-by-nc-4.0 language: - zh - en library_name: transformers tags: - speech - audio - speech-evaluation - expressive-speech - mandarin - chain-of-thought - ceaeval pipeline_tag: audio-text-to-text --- # CEAEval-Model (CEAEval-M) **CEAEval-M** is the speech-LLM *judge* released together with our ACL paper *"Evaluating the Expressive Appropriateness of Speech in Rich Contexts"*. Given a Mandarin speech segment together with an *ideal expressive plan* inferred from its surrounding narrative context, CEAEval-M produces ``` step-by-step comparison of ideal vs. actual expression, with spans pointing to audio-grounded cues (emotion / rhythm / intonation / recording condition / paralinguistic events) X.X # overall expressive appropriateness ∈ [0.0, 5.0] ``` This is the *judge* half of the planner–judge decoupled pipeline defined in the paper. It is designed to work with a frozen text-only planner ([Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)) that first summarizes long narrative context into a four-tuple `{emotion, rhythm, intonation, recording_condition}` via multi-context voting. ## What's released here - Model weights in `safetensors` (4 shards) plus `config.json`, `generation_config.json`, tokenizer, preprocessor, and chat template. - **Six extra special tokens** the judge uses during training and inference (``, ``, ``, ``, ``, ``) — already merged into the tokenizer and embedding matrix. - A patched `modeling_*` path that implements the **adaptive audio attention bias** mechanism described in Sec. 3.3.4 and Appendix F of the paper (region-wise bias over system-prompt / audio / CoT regions). - `test_datas/` with **anonymised** sanity samples (audio + JSON) so you can verify the pipeline end-to-end without touching the main dataset. The full inference pipeline (planner + judge, audio pre-processing, batch driver, sanity examples) lives in the code repository — see [Related resources](#related-resources). ## Intended use and limitations - Intended as a **research benchmark and diagnostic tool** for expressive-speech generation / selection, not as a standalone decision-making system. Expressive appropriateness is inherently subjective; predictions should be interpreted with appropriate human oversight. - Trained and evaluated on **Mandarin audiobook speech**. Applying the model to other languages, styles, or domains (short commands, non-narrative dialogue, etc.) may produce unreliable scores. ## Related resources This model is one of three companion releases for the paper. **Please use them together:** | Resource | Link | | --- | --- | | 📄 Paper | *Evaluating the Expressive Appropriateness of Speech in Rich Contexts* (ACL) | | 💻 Code | | | 🤖 Model (this repo) | | | 📚 Dataset (CEAEval-D) | | | 🌐 Project page / demo | | ## License Released under **CC BY-NC 4.0** — non-commercial academic research use only. The released weights do not contain or expose raw audio, transcripts, or any personally identifiable information.