bc7ec356
/

heep-universal

Automatic Speech Recognition

entropy-based-curation

Model card Files Files and versions

bc7ec356 commited on Jan 29

Commit

7e8b688

·

verified ·

1 Parent(s): f8fde5e

Update README.md

Files changed (1) hide show

README.md +18 -10

README.md CHANGED Viewed

@@ -240,12 +240,12 @@ HEEP (High Entropy Exponential Pruning) is an entropy-based data curation method
 ### Mathematical Foundation
-#### Sample Score (Equation 7)
 The information score for each sample combines multiple entropy dimensions:
 ```
-S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + α₅·MI(x, D)
 ```
 Where:
@@ -254,9 +254,17 @@ Where:
 - `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
 - `H_contextual(x)`: Domain and discourse entropy
 - `MI(x, D)`: Mutual information contribution relative to dataset
-- `α₁...α₅`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)
-#### Selection Criterion (Equation 8)
 Samples are selected based on a threshold:
@@ -264,12 +272,12 @@ Samples are selected based on a threshold:
 D' = {x ∈ D : S(x) > τ}
 ```
-#### Progressive Filtering (Equation 9)
 The threshold increases exponentially across rounds:
 ```
-τ_{k+1} = τ_k × growth_factor
 ```
 #### Error-Aware Adaptation
@@ -294,13 +302,13 @@ Output: Curated dataset D*
 4. k ← 0
 5. While |D*| > min_samples AND k < max_rounds:
     a. For each x in D*:
-        Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + MI(x, D)
     b. If error_patterns available:
-        Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x)
-    c. D* ← {x ∈ D* : S'(x) > τ_k}
     d. If train_callback: Train model on D*
     e. If eval_callback: Analyze errors, update error_patterns
-    f. τ_{k+1} ← τ_k × g
     g. k ← k + 1
 6. Return D*
 ```

 ### Mathematical Foundation
+#### Sample Score (Equation 1)
 The information score for each sample combines multiple entropy dimensions:
 ```
+S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + β·MI(x, D)
 ```
 Where:
 - `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
 - `H_contextual(x)`: Domain and discourse entropy
 - `MI(x, D)`: Mutual information contribution relative to dataset
+- `α₁...α₄, β`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)
+#### Mutual Information (Equation 2)
+The mutual information between acoustic features and transcription:
+```
+I(x, y) = Σ_{j,ℓ} p(f_j, y_ℓ) log [p(f_j, y_ℓ) / (p(f_j)·p(y_ℓ))]
+```
+#### Selection Criterion
 Samples are selected based on a threshold:
 D' = {x ∈ D : S(x) > τ}
 ```
+#### Progressive Filtering (Equation 8)
 The threshold increases exponentially across rounds:
 ```
+τ_{k+1} = τ_k · growth_factor
 ```
 #### Error-Aware Adaptation
 4. k ← 0
 5. While |D*| > min_samples AND k < max_rounds:
     a. For each x in D*:
+        Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + β·MI(x, D)
     b. If error_patterns available:
+        Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x) + λ_cross·CrossLingualOverlap(x)
+    c. D* ← {x ∈ D* : S'(x) > τₖ}
     d. If train_callback: Train model on D*
     e. If eval_callback: Analyze errors, update error_patterns
+    f. τₖ₊₁ ← τₖ · g
     g. k ← k + 1
 6. Return D*
 ```