bc7ec356
/

heep-indic

+---
+language:
+  - hi
+  - bn
+  - te
+  - mr
+  - kn
+  - ta
+  - ml
+  - gu
+  - pa
+  - or
+  - as
+  - en
+  - ur
+  - ks
+  - ne
+  - sd
+  - sa
+  - mai
+  - bho
+  - mag
+  - hne
+  - raj
+  - doi
+  - kok
+  - sat
+  - brx
+  - mni
+  - grt
+  - rwr
+  - bgc
+  - awa
+  - bra
+  - gbm
+  - lmn
+  - bhb
+  - bgq
+  - kfy
+  - xnr
+  - bfy
+  - noe
+  - rjs
+  - mwr
+  - mtr
+  - wbr
+  - hoj
+  - gom
+  - ahr
+  - sgj
+  - kru
+  - unr
+  - spv
+  - kfr
+  - tcy
+  - kfa
+  - sck
+tags:
+  - speech
+  - asr
+  - automatic-speech-recognition
+  - indian-languages
+  - indic
+  - multilingual
+  - heep
+license: apache-2.0
+library_name: transformers
+pipeline_tag: automatic-speech-recognition
+---
+# HEEP Indic
+**High Entropy Exponential Pruning for State-of-the-Art Multilingual ASR**
+HEEP Indic is a state-of-the-art automatic speech recognition model that demonstrates how strategic entropy-based data curation outperforms brute-force data scaling. With an average word error rate (WER) of **11.9%** on Hindi benchmarks — outperforming Google STT, Azure STT, Nvidia Conformer, and IndicWhisper — it challenges the "more data is better" paradigm by training on carefully selected high-information samples.
+## Model Overview
+HEEP Indic supports transcription across **55 Indic languages**, with consistent performance across various domains such as meetings, earnings calls, broadcast media, and educational content. The model is optimized for high-precision, verbatim transcription capturing spoken content word-for-word with remarkable fidelity.
+**Core Insight**: Strategic selection of high-entropy samples leads to better ASR models than training on larger but redundant datasets.
+## HEEP Methodology
+HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources.
+### Mathematical Foundation
+#### Sample Score (Equation 1)
+The information score for each sample combines multiple entropy dimensions:
+```
+S(x) = α₁·H_acoustic(x) + α₂·H_phonetic(x) + α₃·H_linguistic(x) + α₄·H_contextual(x) + β·MI(x, D)
+```
+Where:
+- `H_acoustic(x)`: Spectral/MFCC entropy measuring acoustic diversity
+- `H_phonetic(x)`: Phoneme distribution entropy capturing phonetic complexity
+- `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
+- `H_contextual(x)`: Domain and discourse entropy
+- `MI(x, D)`: Mutual information contribution relative to dataset
+- `α₁...α₄, β`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)
+#### Mutual Information (Equation 2)
+The mutual information between acoustic features and transcription:
+```
+I(x, y) = Σ_{j,ℓ} p(f_j, y_ℓ) log [p(f_j, y_ℓ) / (p(f_j)·p(y_ℓ))]
+```
+#### Selection Criterion
+Samples are selected based on a threshold:
+```
+D' = {x ∈ D : S(x) > τ}
+```
+#### Progressive Filtering (Equation 8)
+The threshold increases exponentially across rounds:
+```
+τ_{k+1} = τ_k · growth_factor
+```
+#### Error-Aware Adaptation
+After each training round, sample scores are adjusted based on model errors:
+```
+S'(x) = S(x) + λ_err·ErrorRelevance(x, errors_k) + λ_cross·CrossLingualOverlap(x)
+```
+### Algorithm Overview
+```
+Algorithm: HEEP Data Curation with Error-Aware Adaptation
+Input: Dataset D, initial threshold τ₀, growth factor g
+Output: Curated dataset D*
+1. Initialize scorer with entropy estimators
+2. Fit scorer to D (compute normalization stats, fit MI estimator)
+3. D* ← D
+4. k ← 0
+5. While |D*| > min_samples AND k < max_rounds:
+    a. For each x in D*:
+        Compute S(x) = Σᵢ αᵢ·Hᵢ(x) + β·MI(x, D)
+    b. If error_patterns available:
+        Adjust S'(x) = S(x) + λ_err·ErrorRelevance(x) + λ_cross·CrossLingualOverlap(x)
+    c. D* ← {x ∈ D* : S'(x) > τₖ}
+    d. If train_callback: Train model on D*
+    e. If eval_callback: Analyze errors, update error_patterns
+    f. τₖ₊₁ ← τₖ · g
+    g. k ← k + 1
+6. Return D*
+```
+### Key Benefits
+- Training on **10-20% of data** while matching or exceeding full-dataset performance
+- Efficient multilingual model development with cross-lingual transfer
+- Error-aware adaptive sample selection across training rounds
+- Significant reduction in computational resources and training time
+## Performance Benchmarks
+### Indic Language Results
+Word error rates (%) on Indic benchmark datasets:
+| Dataset | Bengali | Bhojpuri | Chhattisgarhi | Gujarati | Hindi | Kannada | Magahi | Maithili | Malayalam | Marathi | Odia | Punjabi | Sanskrit | Tamil | Telugu | Urdu | Avg |
+|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
+| Kathbath | 14.6 | – | – | 17.4 | 8.5 | 23 | – | – | 39.3 | 19.2 | 25.4 | 15.8 | 41.4 | 30.3 | 29 | 12.1 | 23 |
+| Kathbath Hard | 15.7 | – | – | 18.5 | 9 | 25.1 | – | – | 41.2 | 20.4 | 27.7 | 16.6 | 43.6 | 32.6 | 30.3 | 11.9 | 24.4 |
+| CommonVoice | 21 | – | – | – | 9.96 | – | – | – | 46 | 21.5 | 34.6 | 17.5 | – | 34 | – | 20.6 | 25.7 |
+| FLEURS | 22.4 | – | – | 23.3 | 11 | 23.1 | – | – | 34.4 | 25.5 | 33.3 | 25 | – | 35.1 | 31.9 | 22.4 | 26.1 |
+| IndicTTS | 15.8 | – | – | 16.9 | 6.6 | 19.6 | – | – | 26.4 | 14.5 | 14.8 | – | – | 22.6 | 31.3 | – | 18.7 |
+| Gramvaani | – | – | – | – | 26 | – | – | – | – | – | – | – | – | – | – | – | 26 |
+| RESPIN | 32.5 | 21.3 | 21.6 | – | 12.1 | 45.6 | 27.7 | 41.1 | – | 32.7 | – | – | – | – | 37.5 | – | 30.2 |
+| **Average** | **20.4** | **21.3** | **21.6** | **19** | **11.9** | **27.3** | **27.7** | **41.1** | **37.5** | **22.3** | **27.2** | **18.7** | **42.5** | **30.9** | **32** | **16.7** | **24.6** |
+### Hindi Benchmark Comparison
+Comparison of publicly-available models on the Hindi subset of the benchmark:
+| Model | Kathbath | Kathbath Noisy | CommonVoice | FLEURS | IndicTTS | RESPIN | Gramvaani | Average |
+|---|---|---|---|---|---|---|---|---|
+| Google STT | 14.3 | 16.7 | 20.8 | 19.4 | 18.3 | – | 59.9 | 24.9 |
+| IndicWav2Vec | 12.2 | 16.2 | 20.2 | 18.3 | 15 | – | 42.1 | 20.7 |
+| Azure STT | 13.6 | 15.1 | 14.6 | 24.3 | 15.2 | – | 42.3 | 20.8 |
+| Nvidia Conformer-CTC Medium | 14 | 15.6 | 20.4 | 19.4 | 12.3 | – | 41.3 | 20.5 |
+| Nvidia Conformer-CTC Large | 12.7 | 14.2 | 21.2 | 15.7 | 12.2 | – | 42.6 | 19.8 |
+| IndicWhisper | 10.3 | 12 | 15 | 11.4 | 7.6 | – | 26.8 | 13.8 |
+| **HEEP Indic** | **8.53** | **8.97** | **9.96** | **11.04** | **6.59** | **12.05** | **25.98** | **11.9** |
+## Model Details
+- **Architecture**: Qwen3ASR — Transformer-based encoder-decoder optimized for multilingual transcription
+- **Languages**: 55 Indic languages supported
+- **Format**: Transformers compatible (safetensors)
+- **Sampling Rate**: 16 kHz
+- **Precision**: FP16/FP32 supported
+- **Optimization**: Real-time inference capable with GPU acceleration
+## Key Features
+- **Real-Time Performance**: Average RTFx of 300 enables real-time applications
+- **Verbatim Transcription**: Optimized for accurate, word-for-word transcription
+- **Multi-Domain Excellence**: Superior performance across conversational, broadcast, and read speech
+- **Multilingual Support**: 55 Indic languages with cross-lingual transfer learning
+- **HEEP-Curated Training**: Strategic entropy-based data selection for maximum information density
+## Quick Start
+### Install
+```bash
+pip install qwen-asr[vllm]
+```
+### Inference with vLLM (Recommended)
+```python
+from qwen_asr import Qwen3ASRModel
+# Load model with vLLM backend
+asr = Qwen3ASRModel.LLM(
+    model="bc7ec356/heep-indic",
+    gpu_memory_utilization=0.8,
+    max_new_tokens=4096,
+)
+# Transcribe from file path
+results = asr.transcribe(
+    audio="path/to/audio.wav",
+    language="Hindi",
+)
+print(results[0].text)
+print(results[0].language)
+```
+### Inference with Transformers
+```python
+import torch
+from qwen_asr import Qwen3ASRModel
+# Load model with Transformers backend
+asr = Qwen3ASRModel.from_pretrained(
+    "bc7ec356/heep-indic",
+    dtype=torch.bfloat16,
+    device_map="cuda:0",
+)
+# Transcribe
+results = asr.transcribe(
+    audio="path/to/audio.wav",
+    language="Hindi",
+)
+print(results[0].text)
+```
+### Batch Transcription
+```python
+# Transcribe multiple files at once
+results = asr.transcribe(
+    audio=["audio1.wav", "audio2.wav", "audio3.wav"],
+    language=["Hindi", "Tamil", "Bengali"],
+)
+for r in results:
+    print(f"[{r.language}] {r.text}")
+```
+### Auto Language Detection
+```python
+# Pass language=None to auto-detect
+results = asr.transcribe(
+    audio="path/to/audio.wav",
+    language=None,
+)
+print(f"Detected: {results[0].language}")
+print(f"Text: {results[0].text}")
+```
+### Streaming Transcription (vLLM only)
+```python
+import numpy as np
+import soundfile as sf
+from qwen_asr import Qwen3ASRModel
+asr = Qwen3ASRModel.LLM(
+    model="bc7ec356/heep-indic",
+    gpu_memory_utilization=0.8,
+    max_new_tokens=4096,
+)
+# Load audio
+wav, sr = sf.read("path/to/audio.wav", dtype="float32")
+# Initialize streaming state
+state = asr.init_streaming_state(
+    language="Hindi",
+    chunk_size_sec=2.0,
+    unfixed_chunk_num=2,
+    unfixed_token_num=5,
+)
+# Feed audio in 1-second chunks
+step = sr  # 1 second of samples
+for pos in range(0, len(wav), step):
+    chunk = wav[pos : pos + step]
+    asr.streaming_transcribe(chunk, state)
+    print(f"Partial: {state.text}")
+# Finalize
+asr.finish_streaming_transcribe(state)
+print(f"Final: {state.text}")
+```
+### NumPy Array Input
+```python
+import numpy as np
+# From a numpy array + sample rate
+audio_array = np.random.randn(16000).astype(np.float32)  # 1 second at 16kHz
+results = asr.transcribe(
+    audio=(audio_array, 16000),
+    language="English",
+)
+```
+## Performance Optimization Tips
+- **GPU Acceleration**: Use `device="cuda"` for significantly faster inference
+- **Precision**: Set `torch_dtype=torch.float16` for optimal speed on modern GPUs
+- **Language Specification**: Specify language code when known to improve accuracy and speed
+## Acknowledgments
+HEEP Universal was developed using the HEEP framework for entropy-based data curation. We thank the open-source community for providing foundational tools that make this work possible.
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@article{anonymous2026heep,
+  title={HEEP: High Entropy Exponential Pruning for State-of-the-Art ASR Through Strategic Data Curation},
+  author={Anonymous},
+  journal={Under Review},
+  year={2026}
+}
+```