CEAEval-Model / README.md
TianRW's picture
Upload folder using huggingface_hub
7b20070 verified
metadata
license: cc-by-nc-4.0
language:
  - zh
  - en
library_name: transformers
tags:
  - speech
  - audio
  - speech-evaluation
  - expressive-speech
  - mandarin
  - chain-of-thought
  - ceaeval
pipeline_tag: audio-text-to-text

CEAEval-Model (CEAEval-M)

CEAEval-M is the speech-LLM judge released together with our ACL paper "Evaluating the Expressive Appropriateness of Speech in Rich Contexts".

Given a Mandarin speech segment together with an ideal expressive plan inferred from its surrounding narrative context, CEAEval-M produces

<think> step-by-step comparison of ideal vs. actual expression,
        with <focus_audio>…</focus_audio> spans pointing to
        audio-grounded cues (emotion / rhythm / intonation /
        recording condition / paralinguistic events) </think>
<score>X.X</score>     # overall expressive appropriateness ∈ [0.0, 5.0]

This is the judge half of the planner–judge decoupled pipeline defined in the paper. It is designed to work with a frozen text-only planner (Qwen3-8B) that first summarizes long narrative context into a four-tuple {emotion, rhythm, intonation, recording_condition} via multi-context voting.

What's released here

  • Model weights in safetensors (4 shards) plus config.json, generation_config.json, tokenizer, preprocessor, and chat template.
  • Six extra special tokens the judge uses during training and inference (<think>, </think>, <score>, </score>, <focus_audio>, </focus_audio>) — already merged into the tokenizer and embedding matrix.
  • A patched modeling_* path that implements the adaptive audio attention bias mechanism described in Sec. 3.3.4 and Appendix F of the paper (region-wise bias over system-prompt / audio / CoT regions).
  • test_datas/ with anonymised sanity samples (audio + JSON) so you can verify the pipeline end-to-end without touching the main dataset.

The full inference pipeline (planner + judge, audio pre-processing, batch driver, sanity examples) lives in the code repository — see Related resources.

Intended use and limitations

  • Intended as a research benchmark and diagnostic tool for expressive-speech generation / selection, not as a standalone decision-making system. Expressive appropriateness is inherently subjective; predictions should be interpreted with appropriate human oversight.
  • Trained and evaluated on Mandarin audiobook speech. Applying the model to other languages, styles, or domains (short commands, non-narrative dialogue, etc.) may produce unreliable scores.

Related resources

This model is one of three companion releases for the paper. Please use them together:

Resource Link
📄 Paper Evaluating the Expressive Appropriateness of Speech in Rich Contexts (ACL)
💻 Code https://github.com/wangtianrui/CEAEval
🤖 Model (this repo) https://huggingface.co/TianRW/CEAEval-Model
📚 Dataset (CEAEval-D) https://huggingface.co/datasets/TianRW/CEAEval-Data
🌐 Project page / demo https://wangtianrui.github.io/ceaeval/

License

Released under CC BY-NC 4.0 — non-commercial academic research use only. The released weights do not contain or expose raw audio, transcripts, or any personally identifiable information.