cmu-mlsp
/

CoLMbo

@@ -52,9 +52,6 @@ CoLMbo integrates a **speaker encoder** with **prompt-conditioned GPT-2 decoding
 - [Quick Start](#quick-start)
 - [Example Prompts](#example-prompts)
-- [Performances on Benchmarks](#performances-on-benchmarks)
-- [Architecture](#architecture)
-- [Training Details](#training-details)
 - [Dataset: TEARS](#dataset-tears)
 - [Use Cases](#use-cases)
 - [Citation](#citation)
@@ -128,79 +125,6 @@ A: The speaker is a male. He is likely between 26 and 35 years old.
    He speaks with a New England dialect. He has a Bachelor's Degree.
 ```
----
-## Performances on Benchmarks
-### Zero-Shot Speaker Attribute Prediction (Macro-F1 % ↑)
-| Model | Gender | Age | Dialect | Education |
-|:---|:---:|:---:|:---:|:---:|
-| Majority Baseline | 50.0 | 14.3 | 12.5 | 16.7 |
-| LLaVA-Audio | 71.2 | 22.1 | 31.4 | 18.9 |
-| Qwen-Audio | 74.5 | 24.8 | 34.7 | 21.3 |
-| **CoLMbo (ECAPA)** | **88.6** | **41.2** | **52.3** | **34.7** |
-| **🕵 CoLMbo (PDAF)** | **91.3** | **44.7** | **55.8** | **37.1** |
-### Zero-Shot Speaker Description Quality (TEARS Test Set)
-| Model | BLEU-4 ↑ | ROUGE-L ↑ | BERTScore ↑ |
-|:---|:---:|:---:|:---:|
-| GPT-4o Audio | 12.4 | 38.7 | 0.841 |
-| Qwen-Audio | 10.9 | 35.2 | 0.829 |
-| **CoLMbo (ECAPA)** | **18.3** | **47.6** | **0.873** |
-| **🕵 CoLMbo (PDAF)** | **19.7** | **49.1** | **0.881** |
----
-## Architecture
-```
-    Raw Audio
-        │
-        ▼
-  Mel Spectrogram
-        │
-        ▼
-  ECAPA-TDNN Encoder         ← speaker identity
-        │
-        ▼
-    sid_mapper                ← projects to GPT-2 token space
-        │
-        ▼
-  [Speaker Prefix | Prompt]  ← concatenated embeddings
-        │
-        ▼
-      GPT-2 LM               ← generates description
-        │
-        ▼
-  "The speaker is a male..."
-```
-| Component | Role | Details |
-|:---|:---|:---|
-| **Mel_Spectrogram** | Audio frontend | 80-dim log-mel, 16 kHz |
-| **ECAPA-TDNN** | Speaker encoder | 192-dim utterance embedding, 1024 channels |
-| **sid_mapper** | Projection | Speaker emb → prefix tokens in GPT-2 space |
-| **GPT-2** | Language decoder | Prompt-conditioned text generation |
----
-## Training Details
-| Setting | Value |
-|:---|:---|
-| Training Data | TEARS (71K utterances — TIMIT + EARS) |
-| Speaker Encoders | ECAPA-TDNN / PDAF |
-| Language Model | GPT-2 (fine-tuned end-to-end) |
-| Mapper Type | MLP |
-| SID Prefix Length | 40 tokens |
-| Training Objective | Cross-entropy over response tokens |
-| Prompt Format | Instruction-following: question → answer |
-| Evaluation | TEARS test split (44.9K utterances) |
----
 ## Dataset: TEARS
 CoLMbo is trained and evaluated on **TEARS** — a large-scale speaker captioning corpus with rich per-speaker annotations.

 - [Quick Start](#quick-start)
 - [Example Prompts](#example-prompts)
 - [Dataset: TEARS](#dataset-tears)
 - [Use Cases](#use-cases)
 - [Citation](#citation)
    He speaks with a New England dialect. He has a Bachelor's Degree.
 ```
 ## Dataset: TEARS
 CoLMbo is trained and evaluated on **TEARS** — a large-scale speaker captioning corpus with rich per-speaker annotations.