Update README.md
Browse files
README.md
CHANGED
|
@@ -238,9 +238,12 @@ HEEP Universal supports transcription across **204 languages**, including a wide
|
|
| 238 |
|
| 239 |
HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources.
|
| 240 |
|
| 241 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
| 242 |
|
| 243 |
-
**Sample Scoring (Equation 7):**
|
| 244 |
```
|
| 245 |
S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + α₅·MI(x, D)
|
| 246 |
```
|
|
@@ -251,13 +254,56 @@ Where:
|
|
| 251 |
- `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
|
| 252 |
- `H_contextual(x)`: Domain and discourse entropy
|
| 253 |
- `MI(x, D)`: Mutual information contribution relative to dataset
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 254 |
|
| 255 |
-
**Progressive Filtering (Equation 9):**
|
| 256 |
```
|
| 257 |
τ_{k+1} = τ_k × growth_factor
|
| 258 |
```
|
| 259 |
|
| 260 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 261 |
|
| 262 |
### Key Benefits
|
| 263 |
|
|
@@ -287,19 +333,6 @@ The threshold increases exponentially across training rounds, progressively sele
|
|
| 287 |
|
| 288 |
*RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.*
|
| 289 |
|
| 290 |
-
### Comparative Performance
|
| 291 |
-
|
| 292 |
-
Performance comparison against other open-source models on 8 common speech benchmarks:
|
| 293 |
-
|
| 294 |
-
| Model | AMI | Earnings22 | GigaSpeech | LS Clean | LS Other | SPGISpeech | TedLium | Voxpopuli | Avg WER |
|
| 295 |
-
| ----------------------------------- | -------- | ---------- | ---------- | -------- | -------- | ---------- | -------- | --------- | -------- |
|
| 296 |
-
| nvidia/canary-qwen-2.5b | 10.19 | 10.45 | 9.43 | 1.61 | 3.10 | 1.90 | 2.71 | 5.66 | 5.63 |
|
| 297 |
-
| ibm/granite-speech-3.3-8b | 9.12 | 9.53 | 10.33 | 1.42 | 2.99 | 3.86 | 3.50 | 6.00 | 5.74 |
|
| 298 |
-
| nvidia/parakeet-tdt-0.6b-v2 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | 6.05 |
|
| 299 |
-
| microsoft/Phi-4-multimodal-instruct | 11.45 | 10.50 | 9.77 | 1.67 | 3.82 | 3.11 | 2.89 | 5.93 | 6.14 |
|
| 300 |
-
| nvidia/canary-1b-flash | 13.11 | 12.77 | 9.85 | 1.48 | 2.87 | 1.95 | 3.12 | 5.63 | 6.35 |
|
| 301 |
-
| **HEEP Universal (Ours)** | **4.19** | **5.83** | **4.99** | **0.71** | **2.17** | **1.10** | **1.43** | **4.34** | **3.10** |
|
| 302 |
-
|
| 303 |
## Model Details
|
| 304 |
|
| 305 |
- **Architecture**: Transformer-based encoder-decoder optimized for multilingual transcription
|
|
|
|
| 238 |
|
| 239 |
HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources.
|
| 240 |
|
| 241 |
+
### Mathematical Foundation
|
| 242 |
+
|
| 243 |
+
#### Sample Score (Equation 7)
|
| 244 |
+
|
| 245 |
+
The information score for each sample combines multiple entropy dimensions:
|
| 246 |
|
|
|
|
| 247 |
```
|
| 248 |
S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + α₅·MI(x, D)
|
| 249 |
```
|
|
|
|
| 254 |
- `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
|
| 255 |
- `H_contextual(x)`: Domain and discourse entropy
|
| 256 |
- `MI(x, D)`: Mutual information contribution relative to dataset
|
| 257 |
+
- `α₁...α₅`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)
|
| 258 |
+
|
| 259 |
+
#### Selection Criterion (Equation 8)
|
| 260 |
+
|
| 261 |
+
Samples are selected based on a threshold:
|
| 262 |
+
|
| 263 |
+
```
|
| 264 |
+
D' = {x ∈ D : S(x) > τ}
|
| 265 |
+
```
|
| 266 |
+
|
| 267 |
+
#### Progressive Filtering (Equation 9)
|
| 268 |
+
|
| 269 |
+
The threshold increases exponentially across rounds:
|
| 270 |
|
|
|
|
| 271 |
```
|
| 272 |
τ_{k+1} = τ_k × growth_factor
|
| 273 |
```
|
| 274 |
|
| 275 |
+
#### Error-Aware Adaptation
|
| 276 |
+
|
| 277 |
+
After each training round, sample scores are adjusted based on model errors:
|
| 278 |
+
|
| 279 |
+
```
|
| 280 |
+
S'(x) = S(x) + λ_err·ErrorRelevance(x, errors_k) + λ_cross·CrossLingualOverlap(x)
|
| 281 |
+
```
|
| 282 |
+
|
| 283 |
+
### Algorithm Overview
|
| 284 |
+
|
| 285 |
+
```
|
| 286 |
+
Algorithm: HEEP Data Curation with Error-Aware Adaptation
|
| 287 |
+
|
| 288 |
+
Input: Dataset D, initial threshold τ₀, growth factor g
|
| 289 |
+
Output: Curated dataset D*
|
| 290 |
+
|
| 291 |
+
1. Initialize scorer with entropy estimators
|
| 292 |
+
2. Fit scorer to D (compute normalization stats, fit MI estimator)
|
| 293 |
+
3. D* ← D
|
| 294 |
+
4. k ← 0
|
| 295 |
+
5. While |D*| > min_samples AND k < max_rounds:
|
| 296 |
+
a. For each x in D*:
|
| 297 |
+
Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + MI(x, D)
|
| 298 |
+
b. If error_patterns available:
|
| 299 |
+
Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x)
|
| 300 |
+
c. D* ← {x ∈ D* : S'(x) > τ_k}
|
| 301 |
+
d. If train_callback: Train model on D*
|
| 302 |
+
e. If eval_callback: Analyze errors, update error_patterns
|
| 303 |
+
f. τ_{k+1} ← τ_k × g
|
| 304 |
+
g. k ← k + 1
|
| 305 |
+
6. Return D*
|
| 306 |
+
```
|
| 307 |
|
| 308 |
### Key Benefits
|
| 309 |
|
|
|
|
| 333 |
|
| 334 |
*RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.*
|
| 335 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 336 |
## Model Details
|
| 337 |
|
| 338 |
- **Architecture**: Transformer-based encoder-decoder optimized for multilingual transcription
|