Update README.md
Browse files
README.md
CHANGED
|
@@ -240,12 +240,12 @@ HEEP (High Entropy Exponential Pruning) is an entropy-based data curation method
|
|
| 240 |
|
| 241 |
### Mathematical Foundation
|
| 242 |
|
| 243 |
-
#### Sample Score (Equation
|
| 244 |
|
| 245 |
The information score for each sample combines multiple entropy dimensions:
|
| 246 |
|
| 247 |
```
|
| 248 |
-
S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) +
|
| 249 |
```
|
| 250 |
|
| 251 |
Where:
|
|
@@ -254,9 +254,17 @@ Where:
|
|
| 254 |
- `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
|
| 255 |
- `H_contextual(x)`: Domain and discourse entropy
|
| 256 |
- `MI(x, D)`: Mutual information contribution relative to dataset
|
| 257 |
-
- `α₁...α
|
| 258 |
|
| 259 |
-
####
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 260 |
|
| 261 |
Samples are selected based on a threshold:
|
| 262 |
|
|
@@ -264,12 +272,12 @@ Samples are selected based on a threshold:
|
|
| 264 |
D' = {x ∈ D : S(x) > τ}
|
| 265 |
```
|
| 266 |
|
| 267 |
-
#### Progressive Filtering (Equation
|
| 268 |
|
| 269 |
The threshold increases exponentially across rounds:
|
| 270 |
|
| 271 |
```
|
| 272 |
-
τ_{k+1} = τ_k
|
| 273 |
```
|
| 274 |
|
| 275 |
#### Error-Aware Adaptation
|
|
@@ -294,13 +302,13 @@ Output: Curated dataset D*
|
|
| 294 |
4. k ← 0
|
| 295 |
5. While |D*| > min_samples AND k < max_rounds:
|
| 296 |
a. For each x in D*:
|
| 297 |
-
Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + MI(x, D)
|
| 298 |
b. If error_patterns available:
|
| 299 |
-
Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x)
|
| 300 |
-
c. D* ← {x ∈ D* : S'(x) > τ
|
| 301 |
d. If train_callback: Train model on D*
|
| 302 |
e. If eval_callback: Analyze errors, update error_patterns
|
| 303 |
-
f. τ
|
| 304 |
g. k ← k + 1
|
| 305 |
6. Return D*
|
| 306 |
```
|
|
|
|
| 240 |
|
| 241 |
### Mathematical Foundation
|
| 242 |
|
| 243 |
+
#### Sample Score (Equation 1)
|
| 244 |
|
| 245 |
The information score for each sample combines multiple entropy dimensions:
|
| 246 |
|
| 247 |
```
|
| 248 |
+
S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + β·MI(x, D)
|
| 249 |
```
|
| 250 |
|
| 251 |
Where:
|
|
|
|
| 254 |
- `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
|
| 255 |
- `H_contextual(x)`: Domain and discourse entropy
|
| 256 |
- `MI(x, D)`: Mutual information contribution relative to dataset
|
| 257 |
+
- `α₁...α₄, β`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)
|
| 258 |
|
| 259 |
+
#### Mutual Information (Equation 2)
|
| 260 |
+
|
| 261 |
+
The mutual information between acoustic features and transcription:
|
| 262 |
+
|
| 263 |
+
```
|
| 264 |
+
I(x, y) = Σ_{j,ℓ} p(f_j, y_ℓ) log [p(f_j, y_ℓ) / (p(f_j)·p(y_ℓ))]
|
| 265 |
+
```
|
| 266 |
+
|
| 267 |
+
#### Selection Criterion
|
| 268 |
|
| 269 |
Samples are selected based on a threshold:
|
| 270 |
|
|
|
|
| 272 |
D' = {x ∈ D : S(x) > τ}
|
| 273 |
```
|
| 274 |
|
| 275 |
+
#### Progressive Filtering (Equation 8)
|
| 276 |
|
| 277 |
The threshold increases exponentially across rounds:
|
| 278 |
|
| 279 |
```
|
| 280 |
+
τ_{k+1} = τ_k · growth_factor
|
| 281 |
```
|
| 282 |
|
| 283 |
#### Error-Aware Adaptation
|
|
|
|
| 302 |
4. k ← 0
|
| 303 |
5. While |D*| > min_samples AND k < max_rounds:
|
| 304 |
a. For each x in D*:
|
| 305 |
+
Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + β·MI(x, D)
|
| 306 |
b. If error_patterns available:
|
| 307 |
+
Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x) + λ_cross·CrossLingualOverlap(x)
|
| 308 |
+
c. D* ← {x ∈ D* : S'(x) > τₖ}
|
| 309 |
d. If train_callback: Train model on D*
|
| 310 |
e. If eval_callback: Analyze errors, update error_patterns
|
| 311 |
+
f. τₖ₊₁ ← τₖ · g
|
| 312 |
g. k ← k + 1
|
| 313 |
6. Return D*
|
| 314 |
```
|