TianRW
/

CEAEval-Model

Audio-Text-to-Text

qwen2_5_omni_thinker

speech-evaluation

expressive-speech

chain-of-thought

Model card Files Files and versions

CEAEval-Model / README.md

TianRW's picture

Upload folder using huggingface_hub

7b20070 verified 2 days ago

|

history blame contribute delete

3.38 kB

	---
	license: cc-by-nc-4.0
	language:
	- zh
	- en
	library_name: transformers
	tags:
	- speech
	- audio
	- speech-evaluation
	- expressive-speech
	- mandarin
	- chain-of-thought
	- ceaeval
	pipeline_tag: audio-text-to-text
	---

	# CEAEval-Model (CEAEval-M)

	CEAEval-M is the speech-LLM judge released together with our ACL paper
	"Evaluating the Expressive Appropriateness of Speech in Rich Contexts".

	Given a Mandarin speech segment together with an ideal expressive plan
	inferred from its surrounding narrative context, CEAEval-M produces

	```
	<think> step-by-step comparison of ideal vs. actual expression,
	with <focus_audio>…</focus_audio> spans pointing to
	audio-grounded cues (emotion / rhythm / intonation /
	recording condition / paralinguistic events) </think>
	<score>X.X</score> # overall expressive appropriateness ∈ [0.0, 5.0]
	```

	This is the judge half of the planner–judge decoupled pipeline defined
	in the paper. It is designed to work with a frozen text-only planner
	([Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)) that first summarizes
	long narrative context into a four-tuple
	`{emotion, rhythm, intonation, recording_condition}` via multi-context
	voting.

	## What's released here

	- Model weights in `safetensors` (4 shards) plus `config.json`,
	`generation_config.json`, tokenizer, preprocessor, and chat template.
	- Six extra special tokens the judge uses during training and
	inference (`<think>`, `</think>`, `<score>`, `</score>`,
	`<focus_audio>`, `</focus_audio>`) — already merged into the tokenizer
	and embedding matrix.
	- A patched `modeling_` path that implements the *adaptive audio
	attention bias** mechanism described in Sec. 3.3.4 and Appendix F of
	the paper (region-wise bias over system-prompt / audio / CoT regions).
	- `test_datas/` with anonymised sanity samples (audio + JSON) so
	you can verify the pipeline end-to-end without touching the main
	dataset.

	The full inference pipeline (planner + judge, audio pre-processing,
	batch driver, sanity examples) lives in the code repository — see
	[Related resources](#related-resources).


	## Intended use and limitations

	- Intended as a research benchmark and diagnostic tool for
	expressive-speech generation / selection, not as a standalone
	decision-making system. Expressive appropriateness is inherently
	subjective; predictions should be interpreted with appropriate human
	oversight.
	- Trained and evaluated on Mandarin audiobook speech. Applying the
	model to other languages, styles, or domains (short commands,
	non-narrative dialogue, etc.) may produce unreliable scores.

	## Related resources

	This model is one of three companion releases for the paper. **Please
	use them together:**

	\| Resource \| Link \|
	\| --- \| --- \|
	\| 📄 Paper \| Evaluating the Expressive Appropriateness of Speech in Rich Contexts (ACL) \|
	\| 💻 Code \| <https://github.com/wangtianrui/CEAEval> \|
	\| 🤖 Model (this repo) \| <https://huggingface.co/TianRW/CEAEval-Model> \|
	\| 📚 Dataset (CEAEval-D) \| <https://huggingface.co/datasets/TianRW/CEAEval-Data> \|
	\| 🌐 Project page / demo \| <https://wangtianrui.github.io/ceaeval/> \|


	## License

	Released under CC BY-NC 4.0 — non-commercial academic research use
	only. The released weights do not contain or expose raw audio,
	transcripts, or any personally identifiable information.