bc7ec356 commited on
Commit
7e8b688
·
verified ·
1 Parent(s): f8fde5e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -10
README.md CHANGED
@@ -240,12 +240,12 @@ HEEP (High Entropy Exponential Pruning) is an entropy-based data curation method
240
 
241
  ### Mathematical Foundation
242
 
243
- #### Sample Score (Equation 7)
244
 
245
  The information score for each sample combines multiple entropy dimensions:
246
 
247
  ```
248
- S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + α₅·MI(x, D)
249
  ```
250
 
251
  Where:
@@ -254,9 +254,17 @@ Where:
254
  - `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
255
  - `H_contextual(x)`: Domain and discourse entropy
256
  - `MI(x, D)`: Mutual information contribution relative to dataset
257
- - `α₁...α`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)
258
 
259
- #### Selection Criterion (Equation 8)
 
 
 
 
 
 
 
 
260
 
261
  Samples are selected based on a threshold:
262
 
@@ -264,12 +272,12 @@ Samples are selected based on a threshold:
264
  D' = {x ∈ D : S(x) > τ}
265
  ```
266
 
267
- #### Progressive Filtering (Equation 9)
268
 
269
  The threshold increases exponentially across rounds:
270
 
271
  ```
272
- τ_{k+1} = τ_k × growth_factor
273
  ```
274
 
275
  #### Error-Aware Adaptation
@@ -294,13 +302,13 @@ Output: Curated dataset D*
294
  4. k ← 0
295
  5. While |D*| > min_samples AND k < max_rounds:
296
  a. For each x in D*:
297
- Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + MI(x, D)
298
  b. If error_patterns available:
299
- Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x)
300
- c. D* ← {x ∈ D* : S'(x) > τ_k}
301
  d. If train_callback: Train model on D*
302
  e. If eval_callback: Analyze errors, update error_patterns
303
- f. τ_{k+1} ← τ_k × g
304
  g. k ← k + 1
305
  6. Return D*
306
  ```
 
240
 
241
  ### Mathematical Foundation
242
 
243
+ #### Sample Score (Equation 1)
244
 
245
  The information score for each sample combines multiple entropy dimensions:
246
 
247
  ```
248
+ S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + β·MI(x, D)
249
  ```
250
 
251
  Where:
 
254
  - `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
255
  - `H_contextual(x)`: Domain and discourse entropy
256
  - `MI(x, D)`: Mutual information contribution relative to dataset
257
+ - `α₁...α₄, β`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)
258
 
259
+ #### Mutual Information (Equation 2)
260
+
261
+ The mutual information between acoustic features and transcription:
262
+
263
+ ```
264
+ I(x, y) = Σ_{j,ℓ} p(f_j, y_ℓ) log [p(f_j, y_ℓ) / (p(f_j)·p(y_ℓ))]
265
+ ```
266
+
267
+ #### Selection Criterion
268
 
269
  Samples are selected based on a threshold:
270
 
 
272
  D' = {x ∈ D : S(x) > τ}
273
  ```
274
 
275
+ #### Progressive Filtering (Equation 8)
276
 
277
  The threshold increases exponentially across rounds:
278
 
279
  ```
280
+ τ_{k+1} = τ_k · growth_factor
281
  ```
282
 
283
  #### Error-Aware Adaptation
 
302
  4. k ← 0
303
  5. While |D*| > min_samples AND k < max_rounds:
304
  a. For each x in D*:
305
+ Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + β·MI(x, D)
306
  b. If error_patterns available:
307
+ Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x) + λ_cross·CrossLingualOverlap(x)
308
+ c. D* ← {x ∈ D* : S'(x) > τ}
309
  d. If train_callback: Train model on D*
310
  e. If eval_callback: Analyze errors, update error_patterns
311
+ f. τₖ₊₁ ← τ · g
312
  g. k ← k + 1
313
  6. Return D*
314
  ```