bc7ec356 commited on
Commit
d9602b0
Β·
verified Β·
1 Parent(s): 80fb066

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +364 -0
README.md ADDED
@@ -0,0 +1,364 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - hi
4
+ - bn
5
+ - te
6
+ - mr
7
+ - kn
8
+ - ta
9
+ - ml
10
+ - gu
11
+ - pa
12
+ - or
13
+ - as
14
+ - en
15
+ - ur
16
+ - ks
17
+ - ne
18
+ - sd
19
+ - sa
20
+ - mai
21
+ - bho
22
+ - mag
23
+ - hne
24
+ - raj
25
+ - doi
26
+ - kok
27
+ - sat
28
+ - brx
29
+ - mni
30
+ - grt
31
+ - rwr
32
+ - bgc
33
+ - awa
34
+ - bra
35
+ - gbm
36
+ - lmn
37
+ - bhb
38
+ - bgq
39
+ - kfy
40
+ - xnr
41
+ - bfy
42
+ - noe
43
+ - rjs
44
+ - mwr
45
+ - mtr
46
+ - wbr
47
+ - hoj
48
+ - gom
49
+ - ahr
50
+ - sgj
51
+ - kru
52
+ - unr
53
+ - spv
54
+ - kfr
55
+ - tcy
56
+ - kfa
57
+ - sck
58
+ tags:
59
+ - speech
60
+ - asr
61
+ - automatic-speech-recognition
62
+ - indian-languages
63
+ - indic
64
+ - multilingual
65
+ - heep
66
+ license: apache-2.0
67
+ library_name: transformers
68
+ pipeline_tag: automatic-speech-recognition
69
+ ---
70
+
71
+ # HEEP Indic
72
+
73
+ **High Entropy Exponential Pruning for State-of-the-Art Multilingual ASR**
74
+
75
+ HEEP Indic is a state-of-the-art automatic speech recognition model that demonstrates how strategic entropy-based data curation outperforms brute-force data scaling. With an average word error rate (WER) of **11.9%** on Hindi benchmarks β€” outperforming Google STT, Azure STT, Nvidia Conformer, and IndicWhisper β€” it challenges the "more data is better" paradigm by training on carefully selected high-information samples.
76
+
77
+ ## Model Overview
78
+
79
+ HEEP Indic supports transcription across **55 Indic languages**, with consistent performance across various domains such as meetings, earnings calls, broadcast media, and educational content. The model is optimized for high-precision, verbatim transcription capturing spoken content word-for-word with remarkable fidelity.
80
+
81
+ **Core Insight**: Strategic selection of high-entropy samples leads to better ASR models than training on larger but redundant datasets.
82
+
83
+ ## HEEP Methodology
84
+
85
+ HEEP (High Entropy Exponential Pruning) is an entropy-based data curation methodology that prioritizes information density over data quantity. It identifies high-information training samples while progressively filtering redundant data, enabling efficient model training with significantly reduced computational resources.
86
+
87
+ ### Mathematical Foundation
88
+
89
+ #### Sample Score (Equation 1)
90
+
91
+ The information score for each sample combines multiple entropy dimensions:
92
+
93
+ ```
94
+ S(x) = α₁·H_acoustic(x) + Ξ±β‚‚Β·H_phonetic(x) + α₃·H_linguistic(x) + Ξ±β‚„Β·H_contextual(x) + Ξ²Β·MI(x, D)
95
+ ```
96
+
97
+ Where:
98
+ - `H_acoustic(x)`: Spectral/MFCC entropy measuring acoustic diversity
99
+ - `H_phonetic(x)`: Phoneme distribution entropy capturing phonetic complexity
100
+ - `H_linguistic(x)`: Vocabulary and syntax entropy measuring linguistic richness
101
+ - `H_contextual(x)`: Domain and discourse entropy
102
+ - `MI(x, D)`: Mutual information contribution relative to dataset
103
+ - `α₁...Ξ±β‚„, Ξ²`: Configurable weights (default: 0.25, 0.20, 0.25, 0.15, 0.15)
104
+
105
+ #### Mutual Information (Equation 2)
106
+
107
+ The mutual information between acoustic features and transcription:
108
+
109
+ ```
110
+ I(x, y) = Ξ£_{j,β„“} p(f_j, y_β„“) log [p(f_j, y_β„“) / (p(f_j)Β·p(y_β„“))]
111
+ ```
112
+
113
+ #### Selection Criterion
114
+
115
+ Samples are selected based on a threshold:
116
+
117
+ ```
118
+ D' = {x ∈ D : S(x) > Ο„}
119
+ ```
120
+
121
+ #### Progressive Filtering (Equation 8)
122
+
123
+ The threshold increases exponentially across rounds:
124
+
125
+ ```
126
+ Ο„_{k+1} = Ο„_k Β· growth_factor
127
+ ```
128
+
129
+ #### Error-Aware Adaptation
130
+
131
+ After each training round, sample scores are adjusted based on model errors:
132
+
133
+ ```
134
+ S'(x) = S(x) + Ξ»_errΒ·ErrorRelevance(x, errors_k) + Ξ»_crossΒ·CrossLingualOverlap(x)
135
+ ```
136
+
137
+ ### Algorithm Overview
138
+
139
+ ```
140
+ Algorithm: HEEP Data Curation with Error-Aware Adaptation
141
+
142
+ Input: Dataset D, initial threshold Ο„β‚€, growth factor g
143
+ Output: Curated dataset D*
144
+
145
+ 1. Initialize scorer with entropy estimators
146
+ 2. Fit scorer to D (compute normalization stats, fit MI estimator)
147
+ 3. D* ← D
148
+ 4. k ← 0
149
+ 5. While |D*| > min_samples AND k < max_rounds:
150
+ a. For each x in D*:
151
+ Compute S(x) = Ξ£α΅’ Ξ±α΅’Β·Hα΅’(x) + Ξ²Β·MI(x, D)
152
+ b. If error_patterns available:
153
+ Adjust S'(x) = S(x) + Ξ»_errΒ·ErrorRelevance(x) + Ξ»_crossΒ·CrossLingualOverlap(x)
154
+ c. D* ← {x ∈ D* : S'(x) > Ο„β‚–}
155
+ d. If train_callback: Train model on D*
156
+ e. If eval_callback: Analyze errors, update error_patterns
157
+ f. Ο„β‚–β‚Šβ‚ ← Ο„β‚– Β· g
158
+ g. k ← k + 1
159
+ 6. Return D*
160
+ ```
161
+
162
+ ### Key Benefits
163
+
164
+ - Training on **10-20% of data** while matching or exceeding full-dataset performance
165
+ - Efficient multilingual model development with cross-lingual transfer
166
+ - Error-aware adaptive sample selection across training rounds
167
+ - Significant reduction in computational resources and training time
168
+
169
+ ## Performance Benchmarks
170
+
171
+ ### Indic Language Results
172
+
173
+ Word error rates (%) on Indic benchmark datasets:
174
+
175
+ | Dataset | Bengali | Bhojpuri | Chhattisgarhi | Gujarati | Hindi | Kannada | Magahi | Maithili | Malayalam | Marathi | Odia | Punjabi | Sanskrit | Tamil | Telugu | Urdu | Avg |
176
+ |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
177
+ | Kathbath | 14.6 | – | – | 17.4 | 8.5 | 23 | – | – | 39.3 | 19.2 | 25.4 | 15.8 | 41.4 | 30.3 | 29 | 12.1 | 23 |
178
+ | Kathbath Hard | 15.7 | – | – | 18.5 | 9 | 25.1 | – | – | 41.2 | 20.4 | 27.7 | 16.6 | 43.6 | 32.6 | 30.3 | 11.9 | 24.4 |
179
+ | CommonVoice | 21 | – | – | – | 9.96 | – | – | – | 46 | 21.5 | 34.6 | 17.5 | – | 34 | – | 20.6 | 25.7 |
180
+ | FLEURS | 22.4 | – | – | 23.3 | 11 | 23.1 | – | – | 34.4 | 25.5 | 33.3 | 25 | – | 35.1 | 31.9 | 22.4 | 26.1 |
181
+ | IndicTTS | 15.8 | – | – | 16.9 | 6.6 | 19.6 | – | – | 26.4 | 14.5 | 14.8 | – | – | 22.6 | 31.3 | – | 18.7 |
182
+ | Gramvaani | – | – | – | – | 26 | – | – | – | – | – | – | – | – | – | – | – | 26 |
183
+ | RESPIN | 32.5 | 21.3 | 21.6 | – | 12.1 | 45.6 | 27.7 | 41.1 | – | 32.7 | – | – | – | – | 37.5 | – | 30.2 |
184
+ | **Average** | **20.4** | **21.3** | **21.6** | **19** | **11.9** | **27.3** | **27.7** | **41.1** | **37.5** | **22.3** | **27.2** | **18.7** | **42.5** | **30.9** | **32** | **16.7** | **24.6** |
185
+
186
+ ### Hindi Benchmark Comparison
187
+
188
+ Comparison of publicly-available models on the Hindi subset of the benchmark:
189
+
190
+ | Model | Kathbath | Kathbath Noisy | CommonVoice | FLEURS | IndicTTS | RESPIN | Gramvaani | Average |
191
+ |---|---|---|---|---|---|---|---|---|
192
+ | Google STT | 14.3 | 16.7 | 20.8 | 19.4 | 18.3 | – | 59.9 | 24.9 |
193
+ | IndicWav2Vec | 12.2 | 16.2 | 20.2 | 18.3 | 15 | – | 42.1 | 20.7 |
194
+ | Azure STT | 13.6 | 15.1 | 14.6 | 24.3 | 15.2 | – | 42.3 | 20.8 |
195
+ | Nvidia Conformer-CTC Medium | 14 | 15.6 | 20.4 | 19.4 | 12.3 | – | 41.3 | 20.5 |
196
+ | Nvidia Conformer-CTC Large | 12.7 | 14.2 | 21.2 | 15.7 | 12.2 | – | 42.6 | 19.8 |
197
+ | IndicWhisper | 10.3 | 12 | 15 | 11.4 | 7.6 | – | 26.8 | 13.8 |
198
+ | **HEEP Indic** | **8.53** | **8.97** | **9.96** | **11.04** | **6.59** | **12.05** | **25.98** | **11.9** |
199
+
200
+ ## Model Details
201
+
202
+ - **Architecture**: Qwen3ASR β€” Transformer-based encoder-decoder optimized for multilingual transcription
203
+ - **Languages**: 55 Indic languages supported
204
+ - **Format**: Transformers compatible (safetensors)
205
+ - **Sampling Rate**: 16 kHz
206
+ - **Precision**: FP16/FP32 supported
207
+ - **Optimization**: Real-time inference capable with GPU acceleration
208
+
209
+ ## Key Features
210
+
211
+ - **Real-Time Performance**: Average RTFx of 300 enables real-time applications
212
+ - **Verbatim Transcription**: Optimized for accurate, word-for-word transcription
213
+ - **Multi-Domain Excellence**: Superior performance across conversational, broadcast, and read speech
214
+ - **Multilingual Support**: 55 Indic languages with cross-lingual transfer learning
215
+ - **HEEP-Curated Training**: Strategic entropy-based data selection for maximum information density
216
+
217
+ ## Quick Start
218
+
219
+ ### Install
220
+
221
+ ```bash
222
+ pip install qwen-asr[vllm]
223
+ ```
224
+
225
+ ### Inference with vLLM (Recommended)
226
+
227
+ ```python
228
+ from qwen_asr import Qwen3ASRModel
229
+
230
+ # Load model with vLLM backend
231
+ asr = Qwen3ASRModel.LLM(
232
+ model="bc7ec356/heep-indic",
233
+ gpu_memory_utilization=0.8,
234
+ max_new_tokens=4096,
235
+ )
236
+
237
+ # Transcribe from file path
238
+ results = asr.transcribe(
239
+ audio="path/to/audio.wav",
240
+ language="Hindi",
241
+ )
242
+ print(results[0].text)
243
+ print(results[0].language)
244
+ ```
245
+
246
+ ### Inference with Transformers
247
+
248
+ ```python
249
+ import torch
250
+ from qwen_asr import Qwen3ASRModel
251
+
252
+ # Load model with Transformers backend
253
+ asr = Qwen3ASRModel.from_pretrained(
254
+ "bc7ec356/heep-indic",
255
+ dtype=torch.bfloat16,
256
+ device_map="cuda:0",
257
+ )
258
+
259
+ # Transcribe
260
+ results = asr.transcribe(
261
+ audio="path/to/audio.wav",
262
+ language="Hindi",
263
+ )
264
+ print(results[0].text)
265
+ ```
266
+
267
+ ### Batch Transcription
268
+
269
+ ```python
270
+ # Transcribe multiple files at once
271
+ results = asr.transcribe(
272
+ audio=["audio1.wav", "audio2.wav", "audio3.wav"],
273
+ language=["Hindi", "Tamil", "Bengali"],
274
+ )
275
+ for r in results:
276
+ print(f"[{r.language}] {r.text}")
277
+ ```
278
+
279
+ ### Auto Language Detection
280
+
281
+ ```python
282
+ # Pass language=None to auto-detect
283
+ results = asr.transcribe(
284
+ audio="path/to/audio.wav",
285
+ language=None,
286
+ )
287
+ print(f"Detected: {results[0].language}")
288
+ print(f"Text: {results[0].text}")
289
+ ```
290
+
291
+ ### Streaming Transcription (vLLM only)
292
+
293
+ ```python
294
+ import numpy as np
295
+ import soundfile as sf
296
+
297
+ from qwen_asr import Qwen3ASRModel
298
+
299
+ asr = Qwen3ASRModel.LLM(
300
+ model="bc7ec356/heep-indic",
301
+ gpu_memory_utilization=0.8,
302
+ max_new_tokens=4096,
303
+ )
304
+
305
+ # Load audio
306
+ wav, sr = sf.read("path/to/audio.wav", dtype="float32")
307
+
308
+ # Initialize streaming state
309
+ state = asr.init_streaming_state(
310
+ language="Hindi",
311
+ chunk_size_sec=2.0,
312
+ unfixed_chunk_num=2,
313
+ unfixed_token_num=5,
314
+ )
315
+
316
+ # Feed audio in 1-second chunks
317
+ step = sr # 1 second of samples
318
+ for pos in range(0, len(wav), step):
319
+ chunk = wav[pos : pos + step]
320
+ asr.streaming_transcribe(chunk, state)
321
+ print(f"Partial: {state.text}")
322
+
323
+ # Finalize
324
+ asr.finish_streaming_transcribe(state)
325
+ print(f"Final: {state.text}")
326
+ ```
327
+
328
+ ### NumPy Array Input
329
+
330
+ ```python
331
+ import numpy as np
332
+
333
+ # From a numpy array + sample rate
334
+ audio_array = np.random.randn(16000).astype(np.float32) # 1 second at 16kHz
335
+ results = asr.transcribe(
336
+ audio=(audio_array, 16000),
337
+ language="English",
338
+ )
339
+ ```
340
+
341
+
342
+ ## Performance Optimization Tips
343
+
344
+ - **GPU Acceleration**: Use `device="cuda"` for significantly faster inference
345
+ - **Precision**: Set `torch_dtype=torch.float16` for optimal speed on modern GPUs
346
+ - **Language Specification**: Specify language code when known to improve accuracy and speed
347
+
348
+
349
+ ## Acknowledgments
350
+
351
+ HEEP Universal was developed using the HEEP framework for entropy-based data curation. We thank the open-source community for providing foundational tools that make this work possible.
352
+
353
+ ## Citation
354
+
355
+ If you use this model in your research, please cite:
356
+
357
+ ```bibtex
358
+ @article{anonymous2026heep,
359
+ title={HEEP: High Entropy Exponential Pruning for State-of-the-Art ASR Through Strategic Data Curation},
360
+ author={Anonymous},
361
+ journal={Under Review},
362
+ year={2026}
363
+ }
364
+ ```