bc7ec356
/

heep-universal

@@ -238,9 +238,12 @@ HEEP Universal supports transcription across **204 languages**, including a wide
 HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources.
-### Core Methodology
-**Sample Scoring (Equation 7):**
 ```
 S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + α₅·MI(x, D)
 ```
@@ -251,13 +254,56 @@ Where:
 - `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
 - `H_contextual(x)`: Domain and discourse entropy
 - `MI(x, D)`: Mutual information contribution relative to dataset
-**Progressive Filtering (Equation 9):**
 ```
 τ_{k+1} = τ_k × growth_factor
 ```
-The threshold increases exponentially across training rounds, progressively selecting higher-entropy samples.
 ### Key Benefits
@@ -287,19 +333,6 @@ The threshold increases exponentially across training rounds, progressively sele
 *RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.*
-### Comparative Performance
-Performance comparison against other open-source models on 8 common speech benchmarks:
-| Model                               | AMI      | Earnings22 | GigaSpeech | LS Clean | LS Other | SPGISpeech | TedLium  | Voxpopuli | Avg WER  |
-| ----------------------------------- | -------- | ---------- | ---------- | -------- | -------- | ---------- | -------- | --------- | -------- |
-| nvidia/canary-qwen-2.5b             | 10.19    | 10.45      | 9.43       | 1.61     | 3.10     | 1.90       | 2.71     | 5.66      | 5.63     |
-| ibm/granite-speech-3.3-8b           | 9.12     | 9.53       | 10.33      | 1.42     | 2.99     | 3.86       | 3.50     | 6.00      | 5.74     |
-| nvidia/parakeet-tdt-0.6b-v2         | 11.16    | 11.15      | 9.74       | 1.69     | 3.19     | 2.17       | 3.38     | 5.95      | 6.05     |
-| microsoft/Phi-4-multimodal-instruct | 11.45    | 10.50      | 9.77       | 1.67     | 3.82     | 3.11       | 2.89     | 5.93      | 6.14     |
-| nvidia/canary-1b-flash              | 13.11    | 12.77      | 9.85       | 1.48     | 2.87     | 1.95       | 3.12     | 5.63      | 6.35     |
-| **HEEP Universal (Ours)**           | **4.19** | **5.83**   | **4.99**   | **0.71** | **2.17** | **1.10**   | **1.43** | **4.34**  | **3.10** |
 ## Model Details
 - **Architecture**: Transformer-based encoder-decoder optimized for multilingual transcription

 HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources.
+### Mathematical Foundation
+#### Sample Score (Equation 7)
+The information score for each sample combines multiple entropy dimensions:
 ```
 S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + α₅·MI(x, D)
 ```
 - `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
 - `H_contextual(x)`: Domain and discourse entropy
 - `MI(x, D)`: Mutual information contribution relative to dataset
+- `α₁...α₅`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)
+#### Selection Criterion (Equation 8)
+Samples are selected based on a threshold:
+```
+D' = {x ∈ D : S(x) > τ}
+```
+#### Progressive Filtering (Equation 9)
+The threshold increases exponentially across rounds:
 ```
 τ_{k+1} = τ_k × growth_factor
 ```
+#### Error-Aware Adaptation
+After each training round, sample scores are adjusted based on model errors:
+```
+S'(x) = S(x) + λ_err·ErrorRelevance(x, errors_k) + λ_cross·CrossLingualOverlap(x)
+```
+### Algorithm Overview
+```
+Algorithm: HEEP Data Curation with Error-Aware Adaptation
+Input: Dataset D, initial threshold τ₀, growth factor g
+Output: Curated dataset D*
+1. Initialize scorer with entropy estimators
+2. Fit scorer to D (compute normalization stats, fit MI estimator)
+3. D* ← D
+4. k ← 0
+5. While |D*| > min_samples AND k < max_rounds:
+    a. For each x in D*:
+        Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + MI(x, D)
+    b. If error_patterns available:
+        Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x)
+    c. D* ← {x ∈ D* : S'(x) > τ_k}
+    d. If train_callback: Train model on D*
+    e. If eval_callback: Analyze errors, update error_patterns
+    f. τ_{k+1} ← τ_k × g
+    g. k ← k + 1
+6. Return D*
+```
 ### Key Benefits
 *RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.*
 ## Model Details
 - **Architecture**: Transformer-based encoder-decoder optimized for multilingual transcription