Papers
arxiv:2605.29257

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Published on May 28
· Submitted by
Tiantian Feng
on May 29
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

ChildVox presents a comprehensive benchmark for analyzing children's acoustic communication across developmental stages using diverse audio and speech models.

AI-generated summary

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

Community

Paper author Paper submitter

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across more than 15 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison.

We evaluate a representative range of audio and speech foundation models, including self-supervised (e.g., SSAST), ASR-oriented (e.g., Whisper), and large audio-language models (e.g., Qwen2Audio), on tasks ranging from physiological sound classification, through vocalization and canonical-syllable modeling, to speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.29257 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.29257 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.29257 in a Space README.md to link it from this page.

Collections including this paper 1