File size: 9,717 Bytes

---
language:
  # ISO 639-1 (official)
  - aa
  - ab
  - ae
  - af
  - ak
  - am
  - an
  - ar
  - as
  - av
  - ay
  - az
  - ba
  - be
  - bg
  - bh
  - bi
  - bm
  - bn
  - bo
  - br
  - bs
  - ca
  - ce
  - ch
  - co
  - cr
  - cs
  - cu
  - cv
  - cy
  - da
  - de
  - dv
  - dz
  - ee
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - ff
  - fi
  - fj
  - fo
  - fr
  - fy
  - ga
  - gd
  - gl
  - gn
  - gu
  - gv
  - ha
  - he
  - hi
  - ho
  - hr
  - ht
  - hu
  - hy
  - hz
  - ia
  - id
  - ie
  - ig
  - ii
  - ik
  - io
  - is
  - it
  - iu
  - ja
  - jv
  - ka
  - kg
  - ki
  - kj
  - kk
  - kl
  - km
  - kn
  - ko
  - kr
  - ks
  - ku
  - kv
  - kw
  - ky
  - la
  - lb
  - lg
  - li
  - ln
  - lo
  - lt
  - lu
  - lv
  - mg
  - mh
  - mi
  - mk
  - ml
  - mn
  - mr
  - ms
  - mt
  - my
  - na
  - nb
  - nd
  - ne
  - ng
  - nl
  - nn
  - no
  - nr
  - nv
  - ny
  - oc
  - oj
  - om
  - or
  - os
  - pa
  - pi
  - pl
  - ps
  - pt
  - qu
  - rm
  - rn
  - ro
  - ru
  - rw
  - sa
  - sc
  - sd
  - se
  - sg
  - si
  - sk
  - sl
  - sm
  - sn
  - so
  - sq
  - sr
  - ss
  - st
  - su
  - sv
  - sw
  - ta
  - te
  - tg
  - th
  - ti
  - tk
  - tl
  - tn
  - to
  - tr
  - ts
  - tt
  - tw
  - ty
  - ug
  - uk
  - ur
  - uz
  - ve
  - vi
  - vo
  - wa
  - wo
  - xh
  - yi
  - yo
  - za
  - zh
  - zu
  - fil   # Filipino
  - cmn   # Mandarin Chinese
  - yue   # Cantonese
  - ars   # Najdi Arabic
  - ary   # Moroccan Arabic
  - arz   # Egyptian Arabic
  - prs   # Dari
  - pes   # Iranian Persian
  - bho   # Bhojpuri
  - mai   # Maithili
  - hif   # Fiji Hindi
  - tzm   # Central Atlas Tamazight
  - kab   # Kabyle
  - ber   # Berber (macro)
  - srd   # Sardinian
  - ast   # Asturian
  - lad   # Ladino
  - lmo   # Lombard
  - nap   # Neapolitan
  - ckb   # Central Kurdish (Sorani)

library_name: transformers
tags:
- speech
- audio
- automatic-speech-recognition
- asr
- multi-lingual
- transformers
- heep
- heep-universal
- entropy-based-curation
metrics:
- wer
pipeline_tag: automatic-speech-recognition
---

# HEEP Universal

**High Entropy Exponential Pruning for State-of-the-Art Multilingual ASR**

HEEP Universal is a state-of-the-art automatic speech recognition model that demonstrates how strategic entropy-based data curation outperforms brute-force data scaling. With a composite word error rate (WER) of **3.10%** on English benchmarks, it challenges the "more data is better" paradigm by training on carefully selected high-information samples.

## Model Overview

HEEP Universal supports transcription across **204 languages**, including a wide range of Indic and global languages, with consistent performance across various domains such as meetings, earnings calls, broadcast media, and educational content. The model is optimized for high-precision, verbatim transcription capturing spoken content word-for-word with remarkable fidelity.

**Core Insight**: Strategic selection of high-entropy samples leads to better ASR models than training on larger but redundant datasets.

## HEEP Methodology

HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources.

### Mathematical Foundation

#### Sample Score (Equation 1)

The information score for each sample combines multiple entropy dimensions:

```
S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + β·MI(x, D)
```

Where:
- `H_acoustic(x)`: Spectral/MFCC entropy measuring acoustic diversity
- `H_phonetic(x)`: Phoneme distribution entropy capturing phonetic complexity
- `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
- `H_contextual(x)`: Domain and discourse entropy
- `MI(x, D)`: Mutual information contribution relative to dataset
- `α₁...α₄, β`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)

#### Mutual Information (Equation 2)

The mutual information between acoustic features and transcription:

```
I(x, y) = Σ_{j,ℓ} p(f_j, y_ℓ) log [p(f_j, y_ℓ) / (p(f_j)·p(y_ℓ))]
```

#### Selection Criterion

Samples are selected based on a threshold:

```
D' = {x ∈ D : S(x) > τ}
```

#### Progressive Filtering (Equation 8)

The threshold increases exponentially across rounds:

```
τ_{k+1} = τ_k · growth_factor
```

#### Error-Aware Adaptation

After each training round, sample scores are adjusted based on model errors:

```
S'(x) = S(x) + λ_err·ErrorRelevance(x, errors_k) + λ_cross·CrossLingualOverlap(x)
```

### Algorithm Overview

```
Algorithm: HEEP Data Curation with Error-Aware Adaptation

Input: Dataset D, initial threshold τ₀, growth factor g
Output: Curated dataset D*

1. Initialize scorer with entropy estimators
2. Fit scorer to D (compute normalization stats, fit MI estimator)
3. D* ← D
4. k ← 0
5. While |D*| > min_samples AND k < max_rounds:
    a. For each x in D*:
        Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + β·MI(x, D)
    b. If error_patterns available:
        Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x) + λ_cross·CrossLingualOverlap(x)
    c. D* ← {x ∈ D* : S'(x) > τₖ}
    d. If train_callback: Train model on D*
    e. If eval_callback: Analyze errors, update error_patterns
    f. τₖ₊₁ ← τₖ · g
    g. k ← k + 1
6. Return D*
```

### Key Benefits

- Training on **10-20% of data** while matching or exceeding full-dataset performance
- Efficient multilingual model development with cross-lingual transfer
- Error-aware adaptive sample selection across training rounds
- Significant reduction in computational resources and training time

## Performance Benchmarks

### OpenASR Leaderboard Results

| Dataset                | WER (%) | RTFx   |
| ---------------------- | ------- | ------ |
| AMI Test               | 4.19    | 70.22  |
| Earnings22 Test        | 5.83    | 101.52 |
| GigaSpeech Test        | 4.99    | 131.09 |
| LibriSpeech Test Clean | 0.71    | 158.74 |
| LibriSpeech Test Other | 2.17    | 142.40 |
| SPGISpeech Test        | 1.10    | 170.85 |
| TedLium Test           | 1.43    | 153.34 |
| VoxPopuli Test         | 4.34    | 179.28 |

### Composite Results
- **Overall WER**: 3.10%
- **Average RTFx**: 146.23

*RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.*

## Model Details

- **Architecture**: Transformer-based encoder-decoder optimized for multilingual transcription
- **Languages**: 204 languages supported
- **Format**: Transformers compatible (safetensors)
- **Sampling Rate**: 16 kHz
- **Precision**: FP16/FP32 supported
- **Optimization**: Real-time inference capable with GPU acceleration

## Key Features

- **Exceptional Accuracy**: Achieves 3.10% WER across diverse English test sets
- **Real-Time Performance**: Average RTFx of 146.23 enables real-time applications
- **Verbatim Transcription**: Optimized for accurate, word-for-word transcription
- **Multi-Domain Excellence**: Superior performance across conversational, broadcast, and read speech
- **Multilingual Support**: 204 languages with cross-lingual transfer learning
- **HEEP-Curated Training**: Strategic entropy-based data selection for maximum information density

## Usage

```python
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "bc7ec356/heep-universal",
    torch_dtype=torch_dtype,
    use_safetensors=True,
)
model.to(device)

processor = AutoProcessor.from_pretrained("bc7ec356/heep-universal")

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

result = pipe("audio.wav")
print(result["text"])
```

## Use Cases

HEEP Universal excels in various speech recognition scenarios:

- **Meeting Transcription**: High accuracy on conversational speech (AMI: 4.19% WER)
- **Financial Communications**: Specialized performance on earnings calls (Earnings22: 5.83% WER)
- **Broadcast Media**: Excellent results on news, podcasts, and media content
- **Educational Content**: Optimized for lectures and presentations
- **Customer Support**: Accurate transcription of support calls
- **Legal Documentation**: Professional-grade accuracy for legal proceedings
- **Medical Transcription**: High-quality transcription for medical consultations

## Performance Optimization Tips

- **GPU Acceleration**: Use `device="cuda"` for significantly faster inference
- **Precision**: Set `torch_dtype=torch.float16` for optimal speed on modern GPUs
- **Language Specification**: Specify language code when known to improve accuracy and speed
- **Beam Size**: Use `beam_size=5` for best accuracy, reduce for faster inference
- **Batch Processing**: Process multiple files with a single model instance for efficiency

## Acknowledgments

HEEP Universal was developed using the HEEP framework for entropy-based data curation. We thank the open-source community for providing foundational tools that make this work possible.

## Citation

If you use this model in your research, please cite:

```bibtex
@article{anonymous2026heep,
  title={HEEP: High Entropy Exponential Pruning for State-of-the-Art ASR Through Strategic Data Curation},
  author={Anonymous},
  journal={Under Review},
  year={2026}
}
```