heep-universal / README.md

Update README.md

7e8b688 verified 6 days ago

9.72 kB

	---
	language:
	# ISO 639-1 (official)
	- aa
	- ab
	- ae
	- af
	- ak
	- am
	- an
	- ar
	- as
	- av
	- ay
	- az
	- ba
	- be
	- bg
	- bh
	- bi
	- bm
	- bn
	- bo
	- br
	- bs
	- ca
	- ce
	- ch
	- co
	- cr
	- cs
	- cu
	- cv
	- cy
	- da
	- de
	- dv
	- dz
	- ee
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- ff
	- fi
	- fj
	- fo
	- fr
	- fy
	- ga
	- gd
	- gl
	- gn
	- gu
	- gv
	- ha
	- he
	- hi
	- ho
	- hr
	- ht
	- hu
	- hy
	- hz
	- ia
	- id
	- ie
	- ig
	- ii
	- ik
	- io
	- is
	- it
	- iu
	- ja
	- jv
	- ka
	- kg
	- ki
	- kj
	- kk
	- kl
	- km
	- kn
	- ko
	- kr
	- ks
	- ku
	- kv
	- kw
	- ky
	- la
	- lb
	- lg
	- li
	- ln
	- lo
	- lt
	- lu
	- lv
	- mg
	- mh
	- mi
	- mk
	- ml
	- mn
	- mr
	- ms
	- mt
	- my
	- na
	- nb
	- nd
	- ne
	- ng
	- nl
	- nn
	- no
	- nr
	- nv
	- ny
	- oc
	- oj
	- om
	- or
	- os
	- pa
	- pi
	- pl
	- ps
	- pt
	- qu
	- rm
	- rn
	- ro
	- ru
	- rw
	- sa
	- sc
	- sd
	- se
	- sg
	- si
	- sk
	- sl
	- sm
	- sn
	- so
	- sq
	- sr
	- ss
	- st
	- su
	- sv
	- sw
	- ta
	- te
	- tg
	- th
	- ti
	- tk
	- tl
	- tn
	- to
	- tr
	- ts
	- tt
	- tw
	- ty
	- ug
	- uk
	- ur
	- uz
	- ve
	- vi
	- vo
	- wa
	- wo
	- xh
	- yi
	- yo
	- za
	- zh
	- zu
	- fil # Filipino
	- cmn # Mandarin Chinese
	- yue # Cantonese
	- ars # Najdi Arabic
	- ary # Moroccan Arabic
	- arz # Egyptian Arabic
	- prs # Dari
	- pes # Iranian Persian
	- bho # Bhojpuri
	- mai # Maithili
	- hif # Fiji Hindi
	- tzm # Central Atlas Tamazight
	- kab # Kabyle
	- ber # Berber (macro)
	- srd # Sardinian
	- ast # Asturian
	- lad # Ladino
	- lmo # Lombard
	- nap # Neapolitan
	- ckb # Central Kurdish (Sorani)

	library_name: transformers
	tags:
	- speech
	- audio
	- automatic-speech-recognition
	- asr
	- multi-lingual
	- transformers
	- heep
	- heep-universal
	- entropy-based-curation
	metrics:
	- wer
	pipeline_tag: automatic-speech-recognition
	---

	# HEEP Universal

	High Entropy Exponential Pruning for State-of-the-Art Multilingual ASR

	HEEP Universal is a state-of-the-art automatic speech recognition model that demonstrates how strategic entropy-based data curation outperforms brute-force data scaling. With a composite word error rate (WER) of 3.10% on English benchmarks, it challenges the "more data is better" paradigm by training on carefully selected high-information samples.

	## Model Overview

	HEEP Universal supports transcription across 204 languages, including a wide range of Indic and global languages, with consistent performance across various domains such as meetings, earnings calls, broadcast media, and educational content. The model is optimized for high-precision, verbatim transcription capturing spoken content word-for-word with remarkable fidelity.

	Core Insight: Strategic selection of high-entropy samples leads to better ASR models than training on larger but redundant datasets.

	## HEEP Methodology

	HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources.

	### Mathematical Foundation

	#### Sample Score (Equation 1)

	The information score for each sample combines multiple entropy dimensions:

	```
	S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + β·MI(x, D)
	```

	Where:
	- `H_acoustic(x)`: Spectral/MFCC entropy measuring acoustic diversity
	- `H_phonetic(x)`: Phoneme distribution entropy capturing phonetic complexity
	- `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
	- `H_contextual(x)`: Domain and discourse entropy
	- `MI(x, D)`: Mutual information contribution relative to dataset
	- `α₁...α₄, β`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)

	#### Mutual Information (Equation 2)

	The mutual information between acoustic features and transcription:

	```
	I(x, y) = Σ_{j,ℓ} p(f_j, y_ℓ) log [p(f_j, y_ℓ) / (p(f_j)·p(y_ℓ))]
	```

	#### Selection Criterion

	Samples are selected based on a threshold:

	```
	D' = {x ∈ D : S(x) > τ}
	```

	#### Progressive Filtering (Equation 8)

	The threshold increases exponentially across rounds:

	```
	τ_{k+1} = τ_k · growth_factor
	```

	#### Error-Aware Adaptation

	After each training round, sample scores are adjusted based on model errors:

	```
	S'(x) = S(x) + λ_err·ErrorRelevance(x, errors_k) + λ_cross·CrossLingualOverlap(x)
	```

	### Algorithm Overview

	```
	Algorithm: HEEP Data Curation with Error-Aware Adaptation

	Input: Dataset D, initial threshold τ₀, growth factor g
	Output: Curated dataset D*

	1. Initialize scorer with entropy estimators
	2. Fit scorer to D (compute normalization stats, fit MI estimator)
	3. D* ← D
	4. k ← 0
	5. While \|D*\| > min_samples AND k < max_rounds:
	a. For each x in D*:
	Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + β·MI(x, D)
	b. If error_patterns available:
	Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x) + λ_cross·CrossLingualOverlap(x)
	c. D* ← {x ∈ D* : S'(x) > τₖ}
	d. If train_callback: Train model on D*
	e. If eval_callback: Analyze errors, update error_patterns
	f. τₖ₊₁ ← τₖ · g
	g. k ← k + 1
	6. Return D*
	```

	### Key Benefits

	- Training on 10-20% of data while matching or exceeding full-dataset performance
	- Efficient multilingual model development with cross-lingual transfer
	- Error-aware adaptive sample selection across training rounds
	- Significant reduction in computational resources and training time

	## Performance Benchmarks

	### OpenASR Leaderboard Results

	\| Dataset \| WER (%) \| RTFx \|
	\| ---------------------- \| ------- \| ------ \|
	\| AMI Test \| 4.19 \| 70.22 \|
	\| Earnings22 Test \| 5.83 \| 101.52 \|
	\| GigaSpeech Test \| 4.99 \| 131.09 \|
	\| LibriSpeech Test Clean \| 0.71 \| 158.74 \|
	\| LibriSpeech Test Other \| 2.17 \| 142.40 \|
	\| SPGISpeech Test \| 1.10 \| 170.85 \|
	\| TedLium Test \| 1.43 \| 153.34 \|
	\| VoxPopuli Test \| 4.34 \| 179.28 \|

	### Composite Results
	- Overall WER: 3.10%
	- Average RTFx: 146.23

	RTFx (Real-Time Factor) indicates inference speed relative to audio duration. Higher values mean faster processing.

	## Model Details

	- Architecture: Transformer-based encoder-decoder optimized for multilingual transcription
	- Languages: 204 languages supported
	- Format: Transformers compatible (safetensors)
	- Sampling Rate: 16 kHz
	- Precision: FP16/FP32 supported
	- Optimization: Real-time inference capable with GPU acceleration

	## Key Features

	- Exceptional Accuracy: Achieves 3.10% WER across diverse English test sets
	- Real-Time Performance: Average RTFx of 146.23 enables real-time applications
	- Verbatim Transcription: Optimized for accurate, word-for-word transcription
	- Multi-Domain Excellence: Superior performance across conversational, broadcast, and read speech
	- Multilingual Support: 204 languages with cross-lingual transfer learning
	- HEEP-Curated Training: Strategic entropy-based data selection for maximum information density

	## Usage

	```python
	from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
	import torch

	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

	model = AutoModelForSpeechSeq2Seq.from_pretrained(
	"bc7ec356/heep-universal",
	torch_dtype=torch_dtype,
	use_safetensors=True,
	)
	model.to(device)

	processor = AutoProcessor.from_pretrained("bc7ec356/heep-universal")

	pipe = pipeline(
	"automatic-speech-recognition",
	model=model,
	tokenizer=processor.tokenizer,
	feature_extractor=processor.feature_extractor,
	torch_dtype=torch_dtype,
	device=device,
	)

	result = pipe("audio.wav")
	print(result["text"])
	```

	## Use Cases

	HEEP Universal excels in various speech recognition scenarios:

	- Meeting Transcription: High accuracy on conversational speech (AMI: 4.19% WER)
	- Financial Communications: Specialized performance on earnings calls (Earnings22: 5.83% WER)
	- Broadcast Media: Excellent results on news, podcasts, and media content
	- Educational Content: Optimized for lectures and presentations
	- Customer Support: Accurate transcription of support calls
	- Legal Documentation: Professional-grade accuracy for legal proceedings
	- Medical Transcription: High-quality transcription for medical consultations

	## Performance Optimization Tips

	- GPU Acceleration: Use `device="cuda"` for significantly faster inference
	- Precision: Set `torch_dtype=torch.float16` for optimal speed on modern GPUs
	- Language Specification: Specify language code when known to improve accuracy and speed
	- Beam Size: Use `beam_size=5` for best accuracy, reduce for faster inference
	- Batch Processing: Process multiple files with a single model instance for efficiency

	## Acknowledgments

	HEEP Universal was developed using the HEEP framework for entropy-based data curation. We thank the open-source community for providing foundational tools that make this work possible.

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@article{anonymous2026heep,
	title={HEEP: High Entropy Exponential Pruning for State-of-the-Art ASR Through Strategic Data Curation},
	author={Anonymous},
	journal={Under Review},
	year={2026}
	}
	```