|
|
--- |
|
|
language: |
|
|
|
|
|
- aa |
|
|
- ab |
|
|
- ae |
|
|
- af |
|
|
- ak |
|
|
- am |
|
|
- an |
|
|
- ar |
|
|
- as |
|
|
- av |
|
|
- ay |
|
|
- az |
|
|
- ba |
|
|
- be |
|
|
- bg |
|
|
- bh |
|
|
- bi |
|
|
- bm |
|
|
- bn |
|
|
- bo |
|
|
- br |
|
|
- bs |
|
|
- ca |
|
|
- ce |
|
|
- ch |
|
|
- co |
|
|
- cr |
|
|
- cs |
|
|
- cu |
|
|
- cv |
|
|
- cy |
|
|
- da |
|
|
- de |
|
|
- dv |
|
|
- dz |
|
|
- ee |
|
|
- el |
|
|
- en |
|
|
- eo |
|
|
- es |
|
|
- et |
|
|
- eu |
|
|
- fa |
|
|
- ff |
|
|
- fi |
|
|
- fj |
|
|
- fo |
|
|
- fr |
|
|
- fy |
|
|
- ga |
|
|
- gd |
|
|
- gl |
|
|
- gn |
|
|
- gu |
|
|
- gv |
|
|
- ha |
|
|
- he |
|
|
- hi |
|
|
- ho |
|
|
- hr |
|
|
- ht |
|
|
- hu |
|
|
- hy |
|
|
- hz |
|
|
- ia |
|
|
- id |
|
|
- ie |
|
|
- ig |
|
|
- ii |
|
|
- ik |
|
|
- io |
|
|
- is |
|
|
- it |
|
|
- iu |
|
|
- ja |
|
|
- jv |
|
|
- ka |
|
|
- kg |
|
|
- ki |
|
|
- kj |
|
|
- kk |
|
|
- kl |
|
|
- km |
|
|
- kn |
|
|
- ko |
|
|
- kr |
|
|
- ks |
|
|
- ku |
|
|
- kv |
|
|
- kw |
|
|
- ky |
|
|
- la |
|
|
- lb |
|
|
- lg |
|
|
- li |
|
|
- ln |
|
|
- lo |
|
|
- lt |
|
|
- lu |
|
|
- lv |
|
|
- mg |
|
|
- mh |
|
|
- mi |
|
|
- mk |
|
|
- ml |
|
|
- mn |
|
|
- mr |
|
|
- ms |
|
|
- mt |
|
|
- my |
|
|
- na |
|
|
- nb |
|
|
- nd |
|
|
- ne |
|
|
- ng |
|
|
- nl |
|
|
- nn |
|
|
- no |
|
|
- nr |
|
|
- nv |
|
|
- ny |
|
|
- oc |
|
|
- oj |
|
|
- om |
|
|
- or |
|
|
- os |
|
|
- pa |
|
|
- pi |
|
|
- pl |
|
|
- ps |
|
|
- pt |
|
|
- qu |
|
|
- rm |
|
|
- rn |
|
|
- ro |
|
|
- ru |
|
|
- rw |
|
|
- sa |
|
|
- sc |
|
|
- sd |
|
|
- se |
|
|
- sg |
|
|
- si |
|
|
- sk |
|
|
- sl |
|
|
- sm |
|
|
- sn |
|
|
- so |
|
|
- sq |
|
|
- sr |
|
|
- ss |
|
|
- st |
|
|
- su |
|
|
- sv |
|
|
- sw |
|
|
- ta |
|
|
- te |
|
|
- tg |
|
|
- th |
|
|
- ti |
|
|
- tk |
|
|
- tl |
|
|
- tn |
|
|
- to |
|
|
- tr |
|
|
- ts |
|
|
- tt |
|
|
- tw |
|
|
- ty |
|
|
- ug |
|
|
- uk |
|
|
- ur |
|
|
- uz |
|
|
- ve |
|
|
- vi |
|
|
- vo |
|
|
- wa |
|
|
- wo |
|
|
- xh |
|
|
- yi |
|
|
- yo |
|
|
- za |
|
|
- zh |
|
|
- zu |
|
|
- fil |
|
|
- cmn |
|
|
- yue |
|
|
- ars |
|
|
- ary |
|
|
- arz |
|
|
- prs |
|
|
- pes |
|
|
- bho |
|
|
- mai |
|
|
- hif |
|
|
- tzm |
|
|
- kab |
|
|
- ber |
|
|
- srd |
|
|
- ast |
|
|
- lad |
|
|
- lmo |
|
|
- nap |
|
|
- ckb |
|
|
|
|
|
library_name: transformers |
|
|
tags: |
|
|
- speech |
|
|
- audio |
|
|
- automatic-speech-recognition |
|
|
- asr |
|
|
- multi-lingual |
|
|
- transformers |
|
|
- heep |
|
|
- heep-universal |
|
|
- entropy-based-curation |
|
|
metrics: |
|
|
- wer |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
--- |
|
|
|
|
|
# HEEP Universal |
|
|
|
|
|
**High Entropy Exponential Pruning for State-of-the-Art Multilingual ASR** |
|
|
|
|
|
HEEP Universal is a state-of-the-art automatic speech recognition model that demonstrates how strategic entropy-based data curation outperforms brute-force data scaling. With a composite word error rate (WER) of **3.10%** on English benchmarks, it challenges the "more data is better" paradigm by training on carefully selected high-information samples. |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
HEEP Universal supports transcription across **204 languages**, including a wide range of Indic and global languages, with consistent performance across various domains such as meetings, earnings calls, broadcast media, and educational content. The model is optimized for high-precision, verbatim transcription capturing spoken content word-for-word with remarkable fidelity. |
|
|
|
|
|
**Core Insight**: Strategic selection of high-entropy samples leads to better ASR models than training on larger but redundant datasets. |
|
|
|
|
|
## HEEP Methodology |
|
|
|
|
|
HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources. |
|
|
|
|
|
### Mathematical Foundation |
|
|
|
|
|
#### Sample Score (Equation 1) |
|
|
|
|
|
The information score for each sample combines multiple entropy dimensions: |
|
|
|
|
|
``` |
|
|
S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + β·MI(x, D) |
|
|
``` |
|
|
|
|
|
Where: |
|
|
- `H_acoustic(x)`: Spectral/MFCC entropy measuring acoustic diversity |
|
|
- `H_phonetic(x)`: Phoneme distribution entropy capturing phonetic complexity |
|
|
- `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness |
|
|
- `H_contextual(x)`: Domain and discourse entropy |
|
|
- `MI(x, D)`: Mutual information contribution relative to dataset |
|
|
- `α₁...α₄, β`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15) |
|
|
|
|
|
#### Mutual Information (Equation 2) |
|
|
|
|
|
The mutual information between acoustic features and transcription: |
|
|
|
|
|
``` |
|
|
I(x, y) = Σ_{j,ℓ} p(f_j, y_ℓ) log [p(f_j, y_ℓ) / (p(f_j)·p(y_ℓ))] |
|
|
``` |
|
|
|
|
|
#### Selection Criterion |
|
|
|
|
|
Samples are selected based on a threshold: |
|
|
|
|
|
``` |
|
|
D' = {x ∈ D : S(x) > τ} |
|
|
``` |
|
|
|
|
|
#### Progressive Filtering (Equation 8) |
|
|
|
|
|
The threshold increases exponentially across rounds: |
|
|
|
|
|
``` |
|
|
τ_{k+1} = τ_k · growth_factor |
|
|
``` |
|
|
|
|
|
#### Error-Aware Adaptation |
|
|
|
|
|
After each training round, sample scores are adjusted based on model errors: |
|
|
|
|
|
``` |
|
|
S'(x) = S(x) + λ_err·ErrorRelevance(x, errors_k) + λ_cross·CrossLingualOverlap(x) |
|
|
``` |
|
|
|
|
|
### Algorithm Overview |
|
|
|
|
|
``` |
|
|
Algorithm: HEEP Data Curation with Error-Aware Adaptation |
|
|
|
|
|
Input: Dataset D, initial threshold τ₀, growth factor g |
|
|
Output: Curated dataset D* |
|
|
|
|
|
1. Initialize scorer with entropy estimators |
|
|
2. Fit scorer to D (compute normalization stats, fit MI estimator) |
|
|
3. D* ← D |
|
|
4. k ← 0 |
|
|
5. While |D*| > min_samples AND k < max_rounds: |
|
|
a. For each x in D*: |
|
|
Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + β·MI(x, D) |
|
|
b. If error_patterns available: |
|
|
Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x) + λ_cross·CrossLingualOverlap(x) |
|
|
c. D* ← {x ∈ D* : S'(x) > τₖ} |
|
|
d. If train_callback: Train model on D* |
|
|
e. If eval_callback: Analyze errors, update error_patterns |
|
|
f. τₖ₊₁ ← τₖ · g |
|
|
g. k ← k + 1 |
|
|
6. Return D* |
|
|
``` |
|
|
|
|
|
### Key Benefits |
|
|
|
|
|
- Training on **10-20% of data** while matching or exceeding full-dataset performance |
|
|
- Efficient multilingual model development with cross-lingual transfer |
|
|
- Error-aware adaptive sample selection across training rounds |
|
|
- Significant reduction in computational resources and training time |
|
|
|
|
|
## Performance Benchmarks |
|
|
|
|
|
### OpenASR Leaderboard Results |
|
|
|
|
|
| Dataset | WER (%) | RTFx | |
|
|
| ---------------------- | ------- | ------ | |
|
|
| AMI Test | 4.19 | 70.22 | |
|
|
| Earnings22 Test | 5.83 | 101.52 | |
|
|
| GigaSpeech Test | 4.99 | 131.09 | |
|
|
| LibriSpeech Test Clean | 0.71 | 158.74 | |
|
|
| LibriSpeech Test Other | 2.17 | 142.40 | |
|
|
| SPGISpeech Test | 1.10 | 170.85 | |
|
|
| TedLium Test | 1.43 | 153.34 | |
|
|
| VoxPopuli Test | 4.34 | 179.28 | |
|
|
|
|
|
### Composite Results |
|
|
- **Overall WER**: 3.10% |
|
|
- **Average RTFx**: 146.23 |
|
|
|
|
|
*RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.* |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture**: Transformer-based encoder-decoder optimized for multilingual transcription |
|
|
- **Languages**: 204 languages supported |
|
|
- **Format**: Transformers compatible (safetensors) |
|
|
- **Sampling Rate**: 16 kHz |
|
|
- **Precision**: FP16/FP32 supported |
|
|
- **Optimization**: Real-time inference capable with GPU acceleration |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Exceptional Accuracy**: Achieves 3.10% WER across diverse English test sets |
|
|
- **Real-Time Performance**: Average RTFx of 146.23 enables real-time applications |
|
|
- **Verbatim Transcription**: Optimized for accurate, word-for-word transcription |
|
|
- **Multi-Domain Excellence**: Superior performance across conversational, broadcast, and read speech |
|
|
- **Multilingual Support**: 204 languages with cross-lingual transfer learning |
|
|
- **HEEP-Curated Training**: Strategic entropy-based data selection for maximum information density |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline |
|
|
import torch |
|
|
|
|
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
|
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
|
|
|
|
|
model = AutoModelForSpeechSeq2Seq.from_pretrained( |
|
|
"bc7ec356/heep-universal", |
|
|
torch_dtype=torch_dtype, |
|
|
use_safetensors=True, |
|
|
) |
|
|
model.to(device) |
|
|
|
|
|
processor = AutoProcessor.from_pretrained("bc7ec356/heep-universal") |
|
|
|
|
|
pipe = pipeline( |
|
|
"automatic-speech-recognition", |
|
|
model=model, |
|
|
tokenizer=processor.tokenizer, |
|
|
feature_extractor=processor.feature_extractor, |
|
|
torch_dtype=torch_dtype, |
|
|
device=device, |
|
|
) |
|
|
|
|
|
result = pipe("audio.wav") |
|
|
print(result["text"]) |
|
|
``` |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
HEEP Universal excels in various speech recognition scenarios: |
|
|
|
|
|
- **Meeting Transcription**: High accuracy on conversational speech (AMI: 4.19% WER) |
|
|
- **Financial Communications**: Specialized performance on earnings calls (Earnings22: 5.83% WER) |
|
|
- **Broadcast Media**: Excellent results on news, podcasts, and media content |
|
|
- **Educational Content**: Optimized for lectures and presentations |
|
|
- **Customer Support**: Accurate transcription of support calls |
|
|
- **Legal Documentation**: Professional-grade accuracy for legal proceedings |
|
|
- **Medical Transcription**: High-quality transcription for medical consultations |
|
|
|
|
|
## Performance Optimization Tips |
|
|
|
|
|
- **GPU Acceleration**: Use `device="cuda"` for significantly faster inference |
|
|
- **Precision**: Set `torch_dtype=torch.float16` for optimal speed on modern GPUs |
|
|
- **Language Specification**: Specify language code when known to improve accuracy and speed |
|
|
- **Beam Size**: Use `beam_size=5` for best accuracy, reduce for faster inference |
|
|
- **Batch Processing**: Process multiple files with a single model instance for efficiency |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
HEEP Universal was developed using the HEEP framework for entropy-based data curation. We thank the open-source community for providing foundational tools that make this work possible. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{anonymous2026heep, |
|
|
title={HEEP: High Entropy Exponential Pruning for State-of-the-Art ASR Through Strategic Data Curation}, |
|
|
author={Anonymous}, |
|
|
journal={Under Review}, |
|
|
year={2026} |
|
|
} |
|
|
``` |