massabaali commited on
Commit
e195287
Β·
verified Β·
1 Parent(s): 8e9af6f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -76
README.md CHANGED
@@ -52,9 +52,6 @@ CoLMbo integrates a **speaker encoder** with **prompt-conditioned GPT-2 decoding
52
 
53
  - [Quick Start](#quick-start)
54
  - [Example Prompts](#example-prompts)
55
- - [Performances on Benchmarks](#performances-on-benchmarks)
56
- - [Architecture](#architecture)
57
- - [Training Details](#training-details)
58
  - [Dataset: TEARS](#dataset-tears)
59
  - [Use Cases](#use-cases)
60
  - [Citation](#citation)
@@ -128,79 +125,6 @@ A: The speaker is a male. He is likely between 26 and 35 years old.
128
  He speaks with a New England dialect. He has a Bachelor's Degree.
129
  ```
130
 
131
- ---
132
-
133
- ## Performances on Benchmarks
134
-
135
- ### Zero-Shot Speaker Attribute Prediction (Macro-F1 % ↑)
136
-
137
- | Model | Gender | Age | Dialect | Education |
138
- |:---|:---:|:---:|:---:|:---:|
139
- | Majority Baseline | 50.0 | 14.3 | 12.5 | 16.7 |
140
- | LLaVA-Audio | 71.2 | 22.1 | 31.4 | 18.9 |
141
- | Qwen-Audio | 74.5 | 24.8 | 34.7 | 21.3 |
142
- | **CoLMbo (ECAPA)** | **88.6** | **41.2** | **52.3** | **34.7** |
143
- | **πŸ•΅ CoLMbo (PDAF)** | **91.3** | **44.7** | **55.8** | **37.1** |
144
-
145
- ### Zero-Shot Speaker Description Quality (TEARS Test Set)
146
-
147
- | Model | BLEU-4 ↑ | ROUGE-L ↑ | BERTScore ↑ |
148
- |:---|:---:|:---:|:---:|
149
- | GPT-4o Audio | 12.4 | 38.7 | 0.841 |
150
- | Qwen-Audio | 10.9 | 35.2 | 0.829 |
151
- | **CoLMbo (ECAPA)** | **18.3** | **47.6** | **0.873** |
152
- | **πŸ•΅ CoLMbo (PDAF)** | **19.7** | **49.1** | **0.881** |
153
-
154
- ---
155
-
156
- ## Architecture
157
-
158
- ```
159
- Raw Audio
160
- β”‚
161
- β–Ό
162
- Mel Spectrogram
163
- β”‚
164
- β–Ό
165
- ECAPA-TDNN Encoder ← speaker identity
166
- β”‚
167
- β–Ό
168
- sid_mapper ← projects to GPT-2 token space
169
- β”‚
170
- β–Ό
171
- [Speaker Prefix | Prompt] ← concatenated embeddings
172
- β”‚
173
- β–Ό
174
- GPT-2 LM ← generates description
175
- β”‚
176
- β–Ό
177
- "The speaker is a male..."
178
- ```
179
-
180
- | Component | Role | Details |
181
- |:---|:---|:---|
182
- | **Mel_Spectrogram** | Audio frontend | 80-dim log-mel, 16 kHz |
183
- | **ECAPA-TDNN** | Speaker encoder | 192-dim utterance embedding, 1024 channels |
184
- | **sid_mapper** | Projection | Speaker emb β†’ prefix tokens in GPT-2 space |
185
- | **GPT-2** | Language decoder | Prompt-conditioned text generation |
186
-
187
- ---
188
-
189
- ## Training Details
190
-
191
- | Setting | Value |
192
- |:---|:---|
193
- | Training Data | TEARS (71K utterances β€” TIMIT + EARS) |
194
- | Speaker Encoders | ECAPA-TDNN / PDAF |
195
- | Language Model | GPT-2 (fine-tuned end-to-end) |
196
- | Mapper Type | MLP |
197
- | SID Prefix Length | 40 tokens |
198
- | Training Objective | Cross-entropy over response tokens |
199
- | Prompt Format | Instruction-following: question β†’ answer |
200
- | Evaluation | TEARS test split (44.9K utterances) |
201
-
202
- ---
203
-
204
  ## Dataset: TEARS
205
 
206
  CoLMbo is trained and evaluated on **TEARS** β€” a large-scale speaker captioning corpus with rich per-speaker annotations.
 
52
 
53
  - [Quick Start](#quick-start)
54
  - [Example Prompts](#example-prompts)
 
 
 
55
  - [Dataset: TEARS](#dataset-tears)
56
  - [Use Cases](#use-cases)
57
  - [Citation](#citation)
 
125
  He speaks with a New England dialect. He has a Bachelor's Degree.
126
  ```
127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
  ## Dataset: TEARS
129
 
130
  CoLMbo is trained and evaluated on **TEARS** β€” a large-scale speaker captioning corpus with rich per-speaker annotations.