arxiv:2605.29257

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Published on May 28

· Submitted by

Tiantian Feng on May 29

University of Southern California

Upvote

Authors:

Tiantian Feng ,

Abstract

ChildVox presents a comprehensive benchmark for analyzing children's acoustic communication across developmental stages using diverse audio and speech models.

AI-generated summary

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

View arXiv page View PDF Project page Add to collection

Community

tiantiaf

Paper author Paper submitter about 9 hours ago

We evaluate a representative range of audio and speech foundation models, including self-supervised (e.g., SSAST), ASR-oriented (e.g., Whisper), and large audio-language models (e.g., Qwen2Audio), on tasks ranging from physiological sound classification, through vocalization and canonical-syllable modeling, to speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.29257 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.29257 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.29257 in a Space README.md to link it from this page.