Qwen3-ASR-1.7B-Bengali (Specialized)
This repository contains a fine-tuned, specialized version of the Qwen3-ASR-1.7B model, fundamentally optimized for Bengali Automatic Speech Recognition (ASR).
By conducting a full-parameter Supervised Fine-Tuning (SFT) over 241 hours of Bangladeshi Bengali audio, this model overcomes the "language bleeding" and script-hallucination limitations of the base foundation model, providing highly accurate native Bengali transcriptions.
Comparative Evaluation Results
To ensure a rigorous and unbiased benchmark, the model was evaluated using 1,000 randomly sampled deterministic audio files from the SUBAK.KO validation set. Metrics were calculated using the industry-standard jiwer library.
Note: For fairness, the baseline models (Qwen Base and Whisper) had their outputs normalized via bnunicodenormalizer to prevent visual-unicode mismatch penalties. The fine-tuned model's output was evaluated raw, proving its native alignment to the language.
| Model | WER (%) | CER (%) | Script Err% (Devanagari) |
|---|---|---|---|
| Qwen3-ASR-1.7B-Bengali (FT) | 23.91% | 9.57% | 0.00% |
| Qwen3-ASR-1.7B (Base) | 71.17% | 40.26% | 80.20% |
| OpenAI Whisper Large-v3 | 72.73% | 28.73% | 29.40% |
Technical Interpretations & Breakthroughs
- Elimination of Language Bleeding: The base Qwen3 model lacked deep alignment for the Bengali language. When prompted to transcribe Bengali, it suffered an 80.2% script error rate, outputting Devanagari (Hindi) characters. This fine-tune completely eradicated this hallucination (0.00% error), mapping the acoustic features directly to the Bengali script.
- Outperforming Generalist Models: Whisper Large-v3 struggles heavily with the SUBAK.KO dataset due to regional Bangladeshi dialects, broadcast noise, and English loanword formatting. By specializing the weights on local data, this 1.7B model achieves a ~3x reduction in Word Error Rate (WER) compared to the massive Whisper Large-v3 model.
Training Configuration
- Methodology: Full-parameter Supervised Fine-Tuning (SFT) using a custom PyTorch/Transformers pipeline. No Parameter-Efficient Fine-Tuning (LoRA) was used, ensuring deep architectural alignment to the new linguistic representations.
- Dataset: SUBAK.KO (সুবাক্য) - 241 hours of annotated Bangladeshi Bengali corpus (229h Read Speech; 12h Broadcast Speech).
- Optimization: Trained on NVIDIA A100 GPUs using
bfloat16precision and native HF gradient accumulation, masking prompt tokens (-100) to focus loss calculation purely on the Bengali textual output.
Usage
Method 1: Hugging Face Transformers
Since Qwen3-ASR is a multimodal LLM, you must provide a text prompt containing the <|audio_pad|> token so the model knows where to "listen."
import torch
import librosa
from transformers import AutoModel, AutoProcessor, AutoConfig
model_id = "amugoodbad229/Qwen3-ASR-Bengali-FT"
# 1. Initialize Processor and Model
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# 2. Load audio (16kHz mono)
audio, _ = librosa.load("path/to/audio.wav", sr=16000)
# 3. Prepare Input with Prompt
# The <|audio_pad|> token is mandatory for the processor
prompt = "<|im_start|>user\n<|audio_pad|>Please transcribe.<|im_end|>\n<|im_start|>assistant\n"
inputs = processor(text=prompt, audio=audio, sampling_rate=16000, return_tensors="pt").to(model.device)
# Ensure floating point inputs match model precision
inputs = {k: v.to(torch.bfloat16) if v.is_floating_point() else v for k, v in inputs.items()}
# 4. Generate
generated_ids = model.generate(**inputs, max_new_tokens=256)
# 5. Decode
transcription = processor.batch_decode(
generated_ids[:, inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)[0]
print(f"Result: {transcription}")
Method 2: Qwen-ASR Official Wrapper (Recommended)
This is the most efficient way to transcribe long audio files.
from qwen_asr import Qwen3ASRModel
# Load the model using the official wrapper
# Ensure your model repo has the chat_template defined in tokenizer_config.json
model = Qwen3ASRModel.from_pretrained("amugoodbad229/Qwen3-ASR-Bengali-FT")
# Transcribe audio (Handles long-form audio chunking natively)
results = model.transcribe(audio=["path/to/your/audio.wav"], language=[None])
print(results[0].text)
- Downloads last month
- 234
Model tree for amugoodbad229/Qwen3-ASR-Bengali-FT
Base model
Qwen/Qwen3-ASR-1.7BDataset used to train amugoodbad229/Qwen3-ASR-Bengali-FT
Evaluation results
- Word Error Rate on SUBAK.KOvalidation set self-reported23.910
- Character Error Rate on SUBAK.KOvalidation set self-reported9.570