TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice
Paper
โข 2601.16358 โข Published
This model is the official baseline system for the TidyLang2026 Language Recognition Challenge.
๐ Challenge website: https://tidylang2026.github.io/
The model is trained for language recognition and serves as the reference system for evaluation and benchmarking.
This model extracts fixed-dimensional language embeddings that can be used for:
The official evaluation pipeline is available here:
๐ https://github.com/areffarhadi/TidyLang2026-baseline
Please use the provided evaluation scripts to reproduce challenge metrics.
If you use this model, please cite:
@misc{farhadi2026tidy,
title={TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice},
author={Aref Farhadipour and Jan Marquenie and Srikanth Madikeri and Eleanor Chodroff},
year={2026},
journal={ICASSP2026},
url={https://arxiv.org/abs/2601.16358},
}