| --- |
| language: |
| - en |
| tags: |
| - speaker-profiling |
| - speaker-language-model |
| - speech |
| - zero-shot |
| - descriptive-profiling |
| - colmbo |
| - custom_code |
| license: apache-2.0 |
| datasets: |
| - cmu-mlsp/TEARS |
| arxiv: 2506.09375 |
| --- |
| |
| <div align="center"> |
|
|
| # 🕵 CoLMbo |
|
|
| ### **Speaker Language Model for Descriptive Profiling** |
| *Who is this speaker? CoLMbo will tell you.* |
|
|
| [](https://arxiv.org/abs/2506.09375) |
| [](https://huggingface.co/cmu-mlsp/CoLMbo) |
| [](https://huggingface.co/datasets/cmu-mlsp/TEARS) |
| [](https://www.apache.org/licenses/LICENSE-2.0) |
| [](https://mlsp.cs.cmu.edu) |
|
|
| 📖 [Paper](https://arxiv.org/abs/2506.09375) · 💻 [GitHub](https://github.com/massabaali7/CoLMbo) · 📦 [TEARS Dataset](https://huggingface.co/datasets/cmu-mlsp/TEARS) · 📧 [Contact](mailto:mbaali@cs.cmu.edu) |
|
|
| </div> |
|
|
| --- |
|
|
| ## What is CoLMbo? |
|
|
| Traditional speaker recognition answers one question: *"Is this speaker A or B?"* |
|
|
| **CoLMbo** asks something richer: *"What is this speaker **like**?"* |
|
|
| Given a few seconds of audio and a natural language prompt, CoLMbo generates free-form descriptions of the speaker their **gender, age, dialect, height, education**, and more directly from voice alone, with no labels or metadata required. |
|
|
| > *"The speaker is a male. He is likely between 26 and 35 years old. He speaks with a New England dialect. He has a Bachelor's Degree."* |
|
|
| CoLMbo integrates a **speaker encoder** with **prompt-conditioned GPT-2 decoding**, enabling zero-shot generalization across diverse speaker populations and datasets. |
|
|
| --- |
|
|
| ## Table of Contents |
|
|
| - [Quick Start](#quick-start) |
| - [Example Prompts](#example-prompts) |
| - [Dataset: TEARS](#dataset-tears) |
| - [Use Cases](#use-cases) |
| - [Citation](#citation) |
| - [Authors](#authors) |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| pip install transformers torch torchaudio |
| ``` |
|
|
| ### Load & Run |
|
|
| ```python |
| from transformers import AutoModel |
| import torchaudio |
| |
| # Load CoLMbo |
| model = AutoModel.from_pretrained("cmu-mlsp/CoLMbo", trust_remote_code=True) |
| model.eval() |
| |
| # Option A — from a waveform tensor |
| waveform, sr = torchaudio.load("speaker.wav") |
| print(model.describe(waveform, "Please describe the speaker.")) |
| |
| # Option B — directly from a file path |
| print(model.describe_file("speaker.wav", "What is the speaker's dialect?")) |
| ``` |
|
|
| > ⚠️ Audio should be **mono, 16 kHz**. The model will automatically resample if needed. |
|
|
| --- |
|
|
| ## Example Prompts |
|
|
| CoLMbo accepts any natural language question about the speaker: |
|
|
| ```python |
| prompts = [ |
| "What is the speaker's gender?", |
| "What is the speaker's age?", |
| "What is the speaker's dialect?", |
| "What is the speaker's race?", |
| "What is the speaker's height?", |
| "What is the speaker's education level?", |
| "Please describe the speaker.", |
| ] |
| |
| for prompt in prompts: |
| print(f"Q: {prompt}") |
| print(f"A: {model.describe_file('speaker.wav', prompt)}\n") |
| ``` |
|
|
| **Example output:** |
| ``` |
| Q: What is the speaker's gender? |
| A: The speaker's gender is male. |
| |
| Q: What is the speaker's age? |
| A: The speaker is between 26 and 35 years old. |
| |
| Q: What is the speaker's dialect? |
| A: The speaker's dialect is from the New England region. |
| |
| Q: Please describe the speaker. |
| A: The speaker is a male. He is likely between 26 and 35 years old. |
| He speaks with a New England dialect. He has a Bachelor's Degree. |
| ``` |
|
|
| ## Dataset: TEARS |
|
|
| CoLMbo is trained and evaluated on **TEARS** — a large-scale speaker captioning corpus with rich per-speaker annotations. |
|
|
| 📦 **[cmu-mlsp/TEARS on Hugging Face](https://huggingface.co/datasets/cmu-mlsp/TEARS)** |
|
|
| | Split | Utterances | |
| |:---|:---:| |
| | Train | 71,100 | |
| | Test | 44,900 | |
| | **Total** | **116,000** | |
|
|
| Each example pairs an audio file with a set of `(prompt, response)` pairs covering: |
|
|
| | Attribute | Example Response | |
| |:---|:---| |
| | Gender | *"The speaker's gender is male."* | |
| | Age | *"The speaker is between 26 and 35 years old."* | |
| | Dialect | *"The speaker's dialect is from the New England region."* | |
| | Height | *"The speaker is between 5'8 and 5'11."* | |
| | Education | *"The speaker has a Bachelor's Degree."* | |
| | Description | *"The speaker is a male. He is likely between 26 and 35..."* | |
|
|
| **Audio sources:** |
| - [EARS](https://github.com/facebookresearch/ears_dataset) — expressive, studio-quality single-speaker recordings |
| - [TIMIT (LDC93S1)](https://catalog.ldc.upenn.edu/LDC93S1) — phonetically balanced American English speech |
|
|
| --- |
|
|
| ## Use Cases |
|
|
| | | Use Case | Description | |
| |:---:|:---|:---| |
| | 🔍 | **Speaker Profiling** | Predict age, gender, dialect, education from voice | |
| | 🧩 | **Explainable Speaker Verification** | Human-readable justifications alongside verification decisions | |
| | 🎙️ | **Speaker-Aware Captioning** | Enrich ASR transcripts with speaker metadata | |
| | 🔬 | **Zero-Shot Attribute Prediction** | Query any speaker attribute without task-specific heads | |
| | 🕵️ | **Forensic Audio Analysis** | Generate structured speaker descriptions for investigative use | |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you find CoLMbo useful in your research, please cite: |
|
|
| ```bibtex |
| @misc{CoLMbo, |
| title = {CoLMbo: Speaker Language Model for Descriptive Profiling}, |
| author = {Massa Baali and Shuo Han and Syed Abdul Hannan and Purusottam Samal and |
| Karanveer Singh and Soham Deshmukh and Rita Singh and Bhiksha Raj}, |
| year = {2025}, |
| eprint = {2506.09375}, |
| archivePrefix= {arXiv}, |
| url = {https://arxiv.org/abs/2506.09375}, |
| primaryClass = {cs.CL} |
| } |
| ``` |
|
|
| --- |
|
|
| ## Authors |
|
|
| <div align="center"> |
|
|
| Massa Baali · Shuo Han · Syed Abdul Hannan · Purusottam Samal · Karanveer Singh · Soham Deshmukh · Rita Singh · Bhiksha Raj |
|
|
| *Carnegie Mellon University — Language Technologies Institute* |
| *Machine Learning for Signal Processing Group* |
|
|
| 📧 [mbaali@cs.cmu.edu](mailto:mbaali@cs.cmu.edu) |
|
|
| </div> |
|
|