--- language: # ISO 639-1 (official) - aa - ab - ae - af - ak - am - an - ar - as - av - ay - az - ba - be - bg - bh - bi - bm - bn - bo - br - bs - ca - ce - ch - co - cr - cs - cu - cv - cy - da - de - dv - dz - ee - el - en - eo - es - et - eu - fa - ff - fi - fj - fo - fr - fy - ga - gd - gl - gn - gu - gv - ha - he - hi - ho - hr - ht - hu - hy - hz - ia - id - ie - ig - ii - ik - io - is - it - iu - ja - jv - ka - kg - ki - kj - kk - kl - km - kn - ko - kr - ks - ku - kv - kw - ky - la - lb - lg - li - ln - lo - lt - lu - lv - mg - mh - mi - mk - ml - mn - mr - ms - mt - my - na - nb - nd - ne - ng - nl - nn - no - nr - nv - ny - oc - oj - om - or - os - pa - pi - pl - ps - pt - qu - rm - rn - ro - ru - rw - sa - sc - sd - se - sg - si - sk - sl - sm - sn - so - sq - sr - ss - st - su - sv - sw - ta - te - tg - th - ti - tk - tl - tn - to - tr - ts - tt - tw - ty - ug - uk - ur - uz - ve - vi - vo - wa - wo - xh - yi - yo - za - zh - zu - fil # Filipino - cmn # Mandarin Chinese - yue # Cantonese - ars # Najdi Arabic - ary # Moroccan Arabic - arz # Egyptian Arabic - prs # Dari - pes # Iranian Persian - bho # Bhojpuri - mai # Maithili - hif # Fiji Hindi - tzm # Central Atlas Tamazight - kab # Kabyle - ber # Berber (macro) - srd # Sardinian - ast # Asturian - lad # Ladino - lmo # Lombard - nap # Neapolitan - ckb # Central Kurdish (Sorani) library_name: transformers tags: - speech - audio - automatic-speech-recognition - asr - multi-lingual - transformers - heep - heep-universal - entropy-based-curation metrics: - wer pipeline_tag: automatic-speech-recognition --- # HEEP Universal **High Entropy Exponential Pruning for State-of-the-Art Multilingual ASR** HEEP Universal is a state-of-the-art automatic speech recognition model that demonstrates how strategic entropy-based data curation outperforms brute-force data scaling. With a composite word error rate (WER) of **3.10%** on English benchmarks, it challenges the "more data is better" paradigm by training on carefully selected high-information samples. ## Model Overview HEEP Universal supports transcription across **204 languages**, including a wide range of Indic and global languages, with consistent performance across various domains such as meetings, earnings calls, broadcast media, and educational content. The model is optimized for high-precision, verbatim transcription capturing spoken content word-for-word with remarkable fidelity. **Core Insight**: Strategic selection of high-entropy samples leads to better ASR models than training on larger but redundant datasets. ## HEEP Methodology HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources. ### Mathematical Foundation #### Sample Score (Equation 1) The information score for each sample combines multiple entropy dimensions: ``` S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + β·MI(x, D) ``` Where: - `H_acoustic(x)`: Spectral/MFCC entropy measuring acoustic diversity - `H_phonetic(x)`: Phoneme distribution entropy capturing phonetic complexity - `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness - `H_contextual(x)`: Domain and discourse entropy - `MI(x, D)`: Mutual information contribution relative to dataset - `α₁...α₄, β`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15) #### Mutual Information (Equation 2) The mutual information between acoustic features and transcription: ``` I(x, y) = Σ_{j,ℓ} p(f_j, y_ℓ) log [p(f_j, y_ℓ) / (p(f_j)·p(y_ℓ))] ``` #### Selection Criterion Samples are selected based on a threshold: ``` D' = {x ∈ D : S(x) > τ} ``` #### Progressive Filtering (Equation 8) The threshold increases exponentially across rounds: ``` τ_{k+1} = τ_k · growth_factor ``` #### Error-Aware Adaptation After each training round, sample scores are adjusted based on model errors: ``` S'(x) = S(x) + λ_err·ErrorRelevance(x, errors_k) + λ_cross·CrossLingualOverlap(x) ``` ### Algorithm Overview ``` Algorithm: HEEP Data Curation with Error-Aware Adaptation Input: Dataset D, initial threshold τ₀, growth factor g Output: Curated dataset D* 1. Initialize scorer with entropy estimators 2. Fit scorer to D (compute normalization stats, fit MI estimator) 3. D* ← D 4. k ← 0 5. While |D*| > min_samples AND k < max_rounds: a. For each x in D*: Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + β·MI(x, D) b. If error_patterns available: Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x) + λ_cross·CrossLingualOverlap(x) c. D* ← {x ∈ D* : S'(x) > τₖ} d. If train_callback: Train model on D* e. If eval_callback: Analyze errors, update error_patterns f. τₖ₊₁ ← τₖ · g g. k ← k + 1 6. Return D* ``` ### Key Benefits - Training on **10-20% of data** while matching or exceeding full-dataset performance - Efficient multilingual model development with cross-lingual transfer - Error-aware adaptive sample selection across training rounds - Significant reduction in computational resources and training time ## Performance Benchmarks ### OpenASR Leaderboard Results | Dataset | WER (%) | RTFx | | ---------------------- | ------- | ------ | | AMI Test | 4.19 | 70.22 | | Earnings22 Test | 5.83 | 101.52 | | GigaSpeech Test | 4.99 | 131.09 | | LibriSpeech Test Clean | 0.71 | 158.74 | | LibriSpeech Test Other | 2.17 | 142.40 | | SPGISpeech Test | 1.10 | 170.85 | | TedLium Test | 1.43 | 153.34 | | VoxPopuli Test | 4.34 | 179.28 | ### Composite Results - **Overall WER**: 3.10% - **Average RTFx**: 146.23 *RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.* ## Model Details - **Architecture**: Transformer-based encoder-decoder optimized for multilingual transcription - **Languages**: 204 languages supported - **Format**: Transformers compatible (safetensors) - **Sampling Rate**: 16 kHz - **Precision**: FP16/FP32 supported - **Optimization**: Real-time inference capable with GPU acceleration ## Key Features - **Exceptional Accuracy**: Achieves 3.10% WER across diverse English test sets - **Real-Time Performance**: Average RTFx of 146.23 enables real-time applications - **Verbatim Transcription**: Optimized for accurate, word-for-word transcription - **Multi-Domain Excellence**: Superior performance across conversational, broadcast, and read speech - **Multilingual Support**: 204 languages with cross-lingual transfer learning - **HEEP-Curated Training**: Strategic entropy-based data selection for maximum information density ## Usage ```python from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline import torch device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model = AutoModelForSpeechSeq2Seq.from_pretrained( "bc7ec356/heep-universal", torch_dtype=torch_dtype, use_safetensors=True, ) model.to(device) processor = AutoProcessor.from_pretrained("bc7ec356/heep-universal") pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, torch_dtype=torch_dtype, device=device, ) result = pipe("audio.wav") print(result["text"]) ``` ## Use Cases HEEP Universal excels in various speech recognition scenarios: - **Meeting Transcription**: High accuracy on conversational speech (AMI: 4.19% WER) - **Financial Communications**: Specialized performance on earnings calls (Earnings22: 5.83% WER) - **Broadcast Media**: Excellent results on news, podcasts, and media content - **Educational Content**: Optimized for lectures and presentations - **Customer Support**: Accurate transcription of support calls - **Legal Documentation**: Professional-grade accuracy for legal proceedings - **Medical Transcription**: High-quality transcription for medical consultations ## Performance Optimization Tips - **GPU Acceleration**: Use `device="cuda"` for significantly faster inference - **Precision**: Set `torch_dtype=torch.float16` for optimal speed on modern GPUs - **Language Specification**: Specify language code when known to improve accuracy and speed - **Beam Size**: Use `beam_size=5` for best accuracy, reduce for faster inference - **Batch Processing**: Process multiple files with a single model instance for efficiency ## Acknowledgments HEEP Universal was developed using the HEEP framework for entropy-based data curation. We thank the open-source community for providing foundational tools that make this work possible. ## Citation If you use this model in your research, please cite: ```bibtex @article{anonymous2026heep, title={HEEP: High Entropy Exponential Pruning for State-of-the-Art ASR Through Strategic Data Curation}, author={Anonymous}, journal={Under Review}, year={2026} } ```