Update README.md
Browse files
README.md
CHANGED
|
@@ -52,9 +52,6 @@ CoLMbo integrates a **speaker encoder** with **prompt-conditioned GPT-2 decoding
|
|
| 52 |
|
| 53 |
- [Quick Start](#quick-start)
|
| 54 |
- [Example Prompts](#example-prompts)
|
| 55 |
-
- [Performances on Benchmarks](#performances-on-benchmarks)
|
| 56 |
-
- [Architecture](#architecture)
|
| 57 |
-
- [Training Details](#training-details)
|
| 58 |
- [Dataset: TEARS](#dataset-tears)
|
| 59 |
- [Use Cases](#use-cases)
|
| 60 |
- [Citation](#citation)
|
|
@@ -128,79 +125,6 @@ A: The speaker is a male. He is likely between 26 and 35 years old.
|
|
| 128 |
He speaks with a New England dialect. He has a Bachelor's Degree.
|
| 129 |
```
|
| 130 |
|
| 131 |
-
---
|
| 132 |
-
|
| 133 |
-
## Performances on Benchmarks
|
| 134 |
-
|
| 135 |
-
### Zero-Shot Speaker Attribute Prediction (Macro-F1 % β)
|
| 136 |
-
|
| 137 |
-
| Model | Gender | Age | Dialect | Education |
|
| 138 |
-
|:---|:---:|:---:|:---:|:---:|
|
| 139 |
-
| Majority Baseline | 50.0 | 14.3 | 12.5 | 16.7 |
|
| 140 |
-
| LLaVA-Audio | 71.2 | 22.1 | 31.4 | 18.9 |
|
| 141 |
-
| Qwen-Audio | 74.5 | 24.8 | 34.7 | 21.3 |
|
| 142 |
-
| **CoLMbo (ECAPA)** | **88.6** | **41.2** | **52.3** | **34.7** |
|
| 143 |
-
| **π΅ CoLMbo (PDAF)** | **91.3** | **44.7** | **55.8** | **37.1** |
|
| 144 |
-
|
| 145 |
-
### Zero-Shot Speaker Description Quality (TEARS Test Set)
|
| 146 |
-
|
| 147 |
-
| Model | BLEU-4 β | ROUGE-L β | BERTScore β |
|
| 148 |
-
|:---|:---:|:---:|:---:|
|
| 149 |
-
| GPT-4o Audio | 12.4 | 38.7 | 0.841 |
|
| 150 |
-
| Qwen-Audio | 10.9 | 35.2 | 0.829 |
|
| 151 |
-
| **CoLMbo (ECAPA)** | **18.3** | **47.6** | **0.873** |
|
| 152 |
-
| **π΅ CoLMbo (PDAF)** | **19.7** | **49.1** | **0.881** |
|
| 153 |
-
|
| 154 |
-
---
|
| 155 |
-
|
| 156 |
-
## Architecture
|
| 157 |
-
|
| 158 |
-
```
|
| 159 |
-
Raw Audio
|
| 160 |
-
β
|
| 161 |
-
βΌ
|
| 162 |
-
Mel Spectrogram
|
| 163 |
-
β
|
| 164 |
-
βΌ
|
| 165 |
-
ECAPA-TDNN Encoder β speaker identity
|
| 166 |
-
β
|
| 167 |
-
βΌ
|
| 168 |
-
sid_mapper β projects to GPT-2 token space
|
| 169 |
-
β
|
| 170 |
-
βΌ
|
| 171 |
-
[Speaker Prefix | Prompt] β concatenated embeddings
|
| 172 |
-
β
|
| 173 |
-
βΌ
|
| 174 |
-
GPT-2 LM β generates description
|
| 175 |
-
β
|
| 176 |
-
βΌ
|
| 177 |
-
"The speaker is a male..."
|
| 178 |
-
```
|
| 179 |
-
|
| 180 |
-
| Component | Role | Details |
|
| 181 |
-
|:---|:---|:---|
|
| 182 |
-
| **Mel_Spectrogram** | Audio frontend | 80-dim log-mel, 16 kHz |
|
| 183 |
-
| **ECAPA-TDNN** | Speaker encoder | 192-dim utterance embedding, 1024 channels |
|
| 184 |
-
| **sid_mapper** | Projection | Speaker emb β prefix tokens in GPT-2 space |
|
| 185 |
-
| **GPT-2** | Language decoder | Prompt-conditioned text generation |
|
| 186 |
-
|
| 187 |
-
---
|
| 188 |
-
|
| 189 |
-
## Training Details
|
| 190 |
-
|
| 191 |
-
| Setting | Value |
|
| 192 |
-
|:---|:---|
|
| 193 |
-
| Training Data | TEARS (71K utterances β TIMIT + EARS) |
|
| 194 |
-
| Speaker Encoders | ECAPA-TDNN / PDAF |
|
| 195 |
-
| Language Model | GPT-2 (fine-tuned end-to-end) |
|
| 196 |
-
| Mapper Type | MLP |
|
| 197 |
-
| SID Prefix Length | 40 tokens |
|
| 198 |
-
| Training Objective | Cross-entropy over response tokens |
|
| 199 |
-
| Prompt Format | Instruction-following: question β answer |
|
| 200 |
-
| Evaluation | TEARS test split (44.9K utterances) |
|
| 201 |
-
|
| 202 |
-
---
|
| 203 |
-
|
| 204 |
## Dataset: TEARS
|
| 205 |
|
| 206 |
CoLMbo is trained and evaluated on **TEARS** β a large-scale speaker captioning corpus with rich per-speaker annotations.
|
|
|
|
| 52 |
|
| 53 |
- [Quick Start](#quick-start)
|
| 54 |
- [Example Prompts](#example-prompts)
|
|
|
|
|
|
|
|
|
|
| 55 |
- [Dataset: TEARS](#dataset-tears)
|
| 56 |
- [Use Cases](#use-cases)
|
| 57 |
- [Citation](#citation)
|
|
|
|
| 125 |
He speaks with a New England dialect. He has a Bachelor's Degree.
|
| 126 |
```
|
| 127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
## Dataset: TEARS
|
| 129 |
|
| 130 |
CoLMbo is trained and evaluated on **TEARS** β a large-scale speaker captioning corpus with rich per-speaker annotations.
|