Update README.md
Browse files
README.md
CHANGED
|
@@ -166,9 +166,25 @@ Output: Curated dataset D*
|
|
| 166 |
- Error-aware adaptive sample selection across training rounds
|
| 167 |
- Significant reduction in computational resources and training time
|
| 168 |
|
| 169 |
-
## Performance Benchmarks
|
| 170 |
|
| 171 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
Word error rates (%) on Indic benchmark datasets:
|
| 174 |
|
|
@@ -197,6 +213,22 @@ Comparison of publicly-available models on the Hindi subset of the benchmark:
|
|
| 197 |
| IndicWhisper | 10.3 | 12 | 15 | 11.4 | 7.6 | – | 26.8 | 13.8 |
|
| 198 |
| **HEEP Indic** | **8.53** | **8.97** | **9.96** | **11.04** | **6.59** | **12.05** | **25.98** | **11.9** |
|
| 199 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
## Model Details
|
| 201 |
|
| 202 |
- **Architecture**: Qwen3ASR — Transformer-based encoder-decoder optimized for multilingual transcription
|
|
|
|
| 166 |
- Error-aware adaptive sample selection across training rounds
|
| 167 |
- Significant reduction in computational resources and training time
|
| 168 |
|
|
|
|
| 169 |
|
| 170 |
+
## Post-Rebuttal Update: Cross-Architecture Validation with HEEP-Indic
|
| 171 |
+
|
| 172 |
+
**Addressing Q1 (Gain Attribution), Q2 (Baselines), and Q3 (Base Model Dependency)**
|
| 173 |
+
|
| 174 |
+
We apologize for the supplementary post after the rebuttal period. These results were finalized shortly after the deadline, and we wanted to ensure complete experimental evidence was available rather than leave placeholders.
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
#### Resources
|
| 178 |
+
|
| 179 |
+
* **Reproducibility (Universal Model):** [https://huggingface.co/bc7ec356/heep-universal](https://huggingface.co/bc7ec356/heep-universal)
|
| 180 |
+
* **Cross-Architecture Model (Indic):** [https://huggingface.co/bc7ec356/heep-indic](https://huggingface.co/bc7ec356/heep-indic)
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
### Cross-Architecture Generalization
|
| 184 |
+
|
| 185 |
+
To directly address concerns about generalization beyond Whisper V3 Turbo, we trained **Qwen3-ASR (1.7B)**, an architecturally distinct audio-language model, on HEEP-curated data spanning **46 Indian languages** (~4.78M utterances). The curation pipeline is identical to the one described in the paper with no architecture-specific tuning.
|
| 186 |
+
|
| 187 |
+
### Hindi Benchmark Comparison (7 Benchmarks)
|
| 188 |
|
| 189 |
Word error rates (%) on Indic benchmark datasets:
|
| 190 |
|
|
|
|
| 213 |
| IndicWhisper | 10.3 | 12 | 15 | 11.4 | 7.6 | – | 26.8 | 13.8 |
|
| 214 |
| **HEEP Indic** | **8.53** | **8.97** | **9.96** | **11.04** | **6.59** | **12.05** | **25.98** | **11.9** |
|
| 215 |
|
| 216 |
+
|
| 217 |
+
**HEEP-Indic achieves 11.9% average Hindi WER vs. 13.8% for IndicWhisper (14% relative improvement).**
|
| 218 |
+
|
| 219 |
+
### Key Takeaways
|
| 220 |
+
|
| 221 |
+
1. **Cross-architecture generalization confirmed.** The same HEEP pipeline improves two distinct backbones: Whisper V3 Turbo (0.8B, encoder-decoder) and Qwen3-ASR (1.7B, audio-language model), without modification.
|
| 222 |
+
|
| 223 |
+
2. **Controlled multilingual evaluation.** Results span 16 languages across Indo-Aryan, Dravidian, and Classical families on standardized benchmarks with consistent evaluation protocols.
|
| 224 |
+
|
| 225 |
+
3. **Model-independent scoring.** Entropy scoring operates on MFCCs, G2P phonemes, and token distributions, not model internals. The same curated dataset was used for both backbones.
|
| 226 |
+
|
| 227 |
+
4. **Reproducibility.** Model weights, curation code, and training scripts for both backbones are at the anonymous repository.
|
| 228 |
+
|
| 229 |
+
*We hope Reviewers 2ezj, oXjG, and S4Jd also find this supplementary evidence relevant to their earlier questions on generalization and controlled multilingual evaluation.*
|
| 230 |
+
|
| 231 |
+
|
| 232 |
## Model Details
|
| 233 |
|
| 234 |
- **Architecture**: Qwen3ASR — Transformer-based encoder-decoder optimized for multilingual transcription
|