bc7ec356 commited on
Commit
f8fde5e
·
verified ·
1 Parent(s): 9c71997

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -17
README.md CHANGED
@@ -238,9 +238,12 @@ HEEP Universal supports transcription across **204 languages**, including a wide
238
 
239
  HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources.
240
 
241
- ### Core Methodology
 
 
 
 
242
 
243
- **Sample Scoring (Equation 7):**
244
  ```
245
  S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + α₅·MI(x, D)
246
  ```
@@ -251,13 +254,56 @@ Where:
251
  - `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
252
  - `H_contextual(x)`: Domain and discourse entropy
253
  - `MI(x, D)`: Mutual information contribution relative to dataset
 
 
 
 
 
 
 
 
 
 
 
 
 
254
 
255
- **Progressive Filtering (Equation 9):**
256
  ```
257
  τ_{k+1} = τ_k × growth_factor
258
  ```
259
 
260
- The threshold increases exponentially across training rounds, progressively selecting higher-entropy samples.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
 
262
  ### Key Benefits
263
 
@@ -287,19 +333,6 @@ The threshold increases exponentially across training rounds, progressively sele
287
 
288
  *RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.*
289
 
290
- ### Comparative Performance
291
-
292
- Performance comparison against other open-source models on 8 common speech benchmarks:
293
-
294
- | Model | AMI | Earnings22 | GigaSpeech | LS Clean | LS Other | SPGISpeech | TedLium | Voxpopuli | Avg WER |
295
- | ----------------------------------- | -------- | ---------- | ---------- | -------- | -------- | ---------- | -------- | --------- | -------- |
296
- | nvidia/canary-qwen-2.5b | 10.19 | 10.45 | 9.43 | 1.61 | 3.10 | 1.90 | 2.71 | 5.66 | 5.63 |
297
- | ibm/granite-speech-3.3-8b | 9.12 | 9.53 | 10.33 | 1.42 | 2.99 | 3.86 | 3.50 | 6.00 | 5.74 |
298
- | nvidia/parakeet-tdt-0.6b-v2 | 11.16 | 11.15 | 9.74 | 1.69 | 3.19 | 2.17 | 3.38 | 5.95 | 6.05 |
299
- | microsoft/Phi-4-multimodal-instruct | 11.45 | 10.50 | 9.77 | 1.67 | 3.82 | 3.11 | 2.89 | 5.93 | 6.14 |
300
- | nvidia/canary-1b-flash | 13.11 | 12.77 | 9.85 | 1.48 | 2.87 | 1.95 | 3.12 | 5.63 | 6.35 |
301
- | **HEEP Universal (Ours)** | **4.19** | **5.83** | **4.99** | **0.71** | **2.17** | **1.10** | **1.43** | **4.34** | **3.10** |
302
-
303
  ## Model Details
304
 
305
  - **Architecture**: Transformer-based encoder-decoder optimized for multilingual transcription
 
238
 
239
  HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources.
240
 
241
+ ### Mathematical Foundation
242
+
243
+ #### Sample Score (Equation 7)
244
+
245
+ The information score for each sample combines multiple entropy dimensions:
246
 
 
247
  ```
248
  S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + α₅·MI(x, D)
249
  ```
 
254
  - `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
255
  - `H_contextual(x)`: Domain and discourse entropy
256
  - `MI(x, D)`: Mutual information contribution relative to dataset
257
+ - `α₁...α₅`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)
258
+
259
+ #### Selection Criterion (Equation 8)
260
+
261
+ Samples are selected based on a threshold:
262
+
263
+ ```
264
+ D' = {x ∈ D : S(x) > τ}
265
+ ```
266
+
267
+ #### Progressive Filtering (Equation 9)
268
+
269
+ The threshold increases exponentially across rounds:
270
 
 
271
  ```
272
  τ_{k+1} = τ_k × growth_factor
273
  ```
274
 
275
+ #### Error-Aware Adaptation
276
+
277
+ After each training round, sample scores are adjusted based on model errors:
278
+
279
+ ```
280
+ S'(x) = S(x) + λ_err·ErrorRelevance(x, errors_k) + λ_cross·CrossLingualOverlap(x)
281
+ ```
282
+
283
+ ### Algorithm Overview
284
+
285
+ ```
286
+ Algorithm: HEEP Data Curation with Error-Aware Adaptation
287
+
288
+ Input: Dataset D, initial threshold τ₀, growth factor g
289
+ Output: Curated dataset D*
290
+
291
+ 1. Initialize scorer with entropy estimators
292
+ 2. Fit scorer to D (compute normalization stats, fit MI estimator)
293
+ 3. D* ← D
294
+ 4. k ← 0
295
+ 5. While |D*| > min_samples AND k < max_rounds:
296
+ a. For each x in D*:
297
+ Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + MI(x, D)
298
+ b. If error_patterns available:
299
+ Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x)
300
+ c. D* ← {x ∈ D* : S'(x) > τ_k}
301
+ d. If train_callback: Train model on D*
302
+ e. If eval_callback: Analyze errors, update error_patterns
303
+ f. τ_{k+1} ← τ_k × g
304
+ g. k ← k + 1
305
+ 6. Return D*
306
+ ```
307
 
308
  ### Key Benefits
309
 
 
333
 
334
  *RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.*
335
 
 
 
 
 
 
 
 
 
 
 
 
 
 
336
  ## Model Details
337
 
338
  - **Architecture**: Transformer-based encoder-decoder optimized for multilingual transcription