heep-universal / README.md
bc7ec356's picture
Update README.md
7e8b688 verified
---
language:
# ISO 639-1 (official)
- aa
- ab
- ae
- af
- ak
- am
- an
- ar
- as
- av
- ay
- az
- ba
- be
- bg
- bh
- bi
- bm
- bn
- bo
- br
- bs
- ca
- ce
- ch
- co
- cr
- cs
- cu
- cv
- cy
- da
- de
- dv
- dz
- ee
- el
- en
- eo
- es
- et
- eu
- fa
- ff
- fi
- fj
- fo
- fr
- fy
- ga
- gd
- gl
- gn
- gu
- gv
- ha
- he
- hi
- ho
- hr
- ht
- hu
- hy
- hz
- ia
- id
- ie
- ig
- ii
- ik
- io
- is
- it
- iu
- ja
- jv
- ka
- kg
- ki
- kj
- kk
- kl
- km
- kn
- ko
- kr
- ks
- ku
- kv
- kw
- ky
- la
- lb
- lg
- li
- ln
- lo
- lt
- lu
- lv
- mg
- mh
- mi
- mk
- ml
- mn
- mr
- ms
- mt
- my
- na
- nb
- nd
- ne
- ng
- nl
- nn
- no
- nr
- nv
- ny
- oc
- oj
- om
- or
- os
- pa
- pi
- pl
- ps
- pt
- qu
- rm
- rn
- ro
- ru
- rw
- sa
- sc
- sd
- se
- sg
- si
- sk
- sl
- sm
- sn
- so
- sq
- sr
- ss
- st
- su
- sv
- sw
- ta
- te
- tg
- th
- ti
- tk
- tl
- tn
- to
- tr
- ts
- tt
- tw
- ty
- ug
- uk
- ur
- uz
- ve
- vi
- vo
- wa
- wo
- xh
- yi
- yo
- za
- zh
- zu
- fil # Filipino
- cmn # Mandarin Chinese
- yue # Cantonese
- ars # Najdi Arabic
- ary # Moroccan Arabic
- arz # Egyptian Arabic
- prs # Dari
- pes # Iranian Persian
- bho # Bhojpuri
- mai # Maithili
- hif # Fiji Hindi
- tzm # Central Atlas Tamazight
- kab # Kabyle
- ber # Berber (macro)
- srd # Sardinian
- ast # Asturian
- lad # Ladino
- lmo # Lombard
- nap # Neapolitan
- ckb # Central Kurdish (Sorani)
library_name: transformers
tags:
- speech
- audio
- automatic-speech-recognition
- asr
- multi-lingual
- transformers
- heep
- heep-universal
- entropy-based-curation
metrics:
- wer
pipeline_tag: automatic-speech-recognition
---
# HEEP Universal
**High Entropy Exponential Pruning for State-of-the-Art Multilingual ASR**
HEEP Universal is a state-of-the-art automatic speech recognition model that demonstrates how strategic entropy-based data curation outperforms brute-force data scaling. With a composite word error rate (WER) of **3.10%** on English benchmarks, it challenges the "more data is better" paradigm by training on carefully selected high-information samples.
## Model Overview
HEEP Universal supports transcription across **204 languages**, including a wide range of Indic and global languages, with consistent performance across various domains such as meetings, earnings calls, broadcast media, and educational content. The model is optimized for high-precision, verbatim transcription capturing spoken content word-for-word with remarkable fidelity.
**Core Insight**: Strategic selection of high-entropy samples leads to better ASR models than training on larger but redundant datasets.
## HEEP Methodology
HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources.
### Mathematical Foundation
#### Sample Score (Equation 1)
The information score for each sample combines multiple entropy dimensions:
```
S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + β·MI(x, D)
```
Where:
- `H_acoustic(x)`: Spectral/MFCC entropy measuring acoustic diversity
- `H_phonetic(x)`: Phoneme distribution entropy capturing phonetic complexity
- `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
- `H_contextual(x)`: Domain and discourse entropy
- `MI(x, D)`: Mutual information contribution relative to dataset
- `α₁...α₄, β`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)
#### Mutual Information (Equation 2)
The mutual information between acoustic features and transcription:
```
I(x, y) = Σ_{j,ℓ} p(f_j, y_ℓ) log [p(f_j, y_ℓ) / (p(f_j)·p(y_ℓ))]
```
#### Selection Criterion
Samples are selected based on a threshold:
```
D' = {x ∈ D : S(x) > τ}
```
#### Progressive Filtering (Equation 8)
The threshold increases exponentially across rounds:
```
τ_{k+1} = τ_k · growth_factor
```
#### Error-Aware Adaptation
After each training round, sample scores are adjusted based on model errors:
```
S'(x) = S(x) + λ_err·ErrorRelevance(x, errors_k) + λ_cross·CrossLingualOverlap(x)
```
### Algorithm Overview
```
Algorithm: HEEP Data Curation with Error-Aware Adaptation
Input: Dataset D, initial threshold τ₀, growth factor g
Output: Curated dataset D*
1. Initialize scorer with entropy estimators
2. Fit scorer to D (compute normalization stats, fit MI estimator)
3. D* ← D
4. k ← 0
5. While |D*| > min_samples AND k < max_rounds:
a. For each x in D*:
Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + β·MI(x, D)
b. If error_patterns available:
Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x) + λ_cross·CrossLingualOverlap(x)
c. D* ← {x ∈ D* : S'(x) > τₖ}
d. If train_callback: Train model on D*
e. If eval_callback: Analyze errors, update error_patterns
f. τₖ₊₁ ← τₖ · g
g. k ← k + 1
6. Return D*
```
### Key Benefits
- Training on **10-20% of data** while matching or exceeding full-dataset performance
- Efficient multilingual model development with cross-lingual transfer
- Error-aware adaptive sample selection across training rounds
- Significant reduction in computational resources and training time
## Performance Benchmarks
### OpenASR Leaderboard Results
| Dataset | WER (%) | RTFx |
| ---------------------- | ------- | ------ |
| AMI Test | 4.19 | 70.22 |
| Earnings22 Test | 5.83 | 101.52 |
| GigaSpeech Test | 4.99 | 131.09 |
| LibriSpeech Test Clean | 0.71 | 158.74 |
| LibriSpeech Test Other | 2.17 | 142.40 |
| SPGISpeech Test | 1.10 | 170.85 |
| TedLium Test | 1.43 | 153.34 |
| VoxPopuli Test | 4.34 | 179.28 |
### Composite Results
- **Overall WER**: 3.10%
- **Average RTFx**: 146.23
*RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.*
## Model Details
- **Architecture**: Transformer-based encoder-decoder optimized for multilingual transcription
- **Languages**: 204 languages supported
- **Format**: Transformers compatible (safetensors)
- **Sampling Rate**: 16 kHz
- **Precision**: FP16/FP32 supported
- **Optimization**: Real-time inference capable with GPU acceleration
## Key Features
- **Exceptional Accuracy**: Achieves 3.10% WER across diverse English test sets
- **Real-Time Performance**: Average RTFx of 146.23 enables real-time applications
- **Verbatim Transcription**: Optimized for accurate, word-for-word transcription
- **Multi-Domain Excellence**: Superior performance across conversational, broadcast, and read speech
- **Multilingual Support**: 204 languages with cross-lingual transfer learning
- **HEEP-Curated Training**: Strategic entropy-based data selection for maximum information density
## Usage
```python
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model = AutoModelForSpeechSeq2Seq.from_pretrained(
"bc7ec356/heep-universal",
torch_dtype=torch_dtype,
use_safetensors=True,
)
model.to(device)
processor = AutoProcessor.from_pretrained("bc7ec356/heep-universal")
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
)
result = pipe("audio.wav")
print(result["text"])
```
## Use Cases
HEEP Universal excels in various speech recognition scenarios:
- **Meeting Transcription**: High accuracy on conversational speech (AMI: 4.19% WER)
- **Financial Communications**: Specialized performance on earnings calls (Earnings22: 5.83% WER)
- **Broadcast Media**: Excellent results on news, podcasts, and media content
- **Educational Content**: Optimized for lectures and presentations
- **Customer Support**: Accurate transcription of support calls
- **Legal Documentation**: Professional-grade accuracy for legal proceedings
- **Medical Transcription**: High-quality transcription for medical consultations
## Performance Optimization Tips
- **GPU Acceleration**: Use `device="cuda"` for significantly faster inference
- **Precision**: Set `torch_dtype=torch.float16` for optimal speed on modern GPUs
- **Language Specification**: Specify language code when known to improve accuracy and speed
- **Beam Size**: Use `beam_size=5` for best accuracy, reduce for faster inference
- **Batch Processing**: Process multiple files with a single model instance for efficiency
## Acknowledgments
HEEP Universal was developed using the HEEP framework for entropy-based data curation. We thank the open-source community for providing foundational tools that make this work possible.
## Citation
If you use this model in your research, please cite:
```bibtex
@article{anonymous2026heep,
title={HEEP: High Entropy Exponential Pruning for State-of-the-Art ASR Through Strategic Data Curation},
author={Anonymous},
journal={Under Review},
year={2026}
}
```