Qwen3-ASR-1.7B-Bengali (Specialized)

This repository contains a fine-tuned, specialized version of the Qwen3-ASR-1.7B model, fundamentally optimized for Bengali Automatic Speech Recognition (ASR).

By conducting a full-parameter Supervised Fine-Tuning (SFT) over 241 hours of Bangladeshi Bengali audio, this model overcomes the "language bleeding" and script-hallucination limitations of the base foundation model, providing highly accurate native Bengali transcriptions.

Comparative Evaluation Results

To ensure a rigorous and unbiased benchmark, the model was evaluated using 1,000 randomly sampled deterministic audio files from the SUBAK.KO validation set. Metrics were calculated using the industry-standard jiwer library.

Note: For fairness, the baseline models (Qwen Base and Whisper) had their outputs normalized via bnunicodenormalizer to prevent visual-unicode mismatch penalties. The fine-tuned model's output was evaluated raw, proving its native alignment to the language.

Model	WER (%)	CER (%)	Script Err% (Devanagari)
Qwen3-ASR-1.7B-Bengali (FT)	23.91%	9.57%	0.00%
Qwen3-ASR-1.7B (Base)	71.17%	40.26%	80.20%
OpenAI Whisper Large-v3	72.73%	28.73%	29.40%

Technical Interpretations & Breakthroughs

Elimination of Language Bleeding: The base Qwen3 model lacked deep alignment for the Bengali language. When prompted to transcribe Bengali, it suffered an 80.2% script error rate, outputting Devanagari (Hindi) characters. This fine-tune completely eradicated this hallucination (0.00% error), mapping the acoustic features directly to the Bengali script.
Outperforming Generalist Models: Whisper Large-v3 struggles heavily with the SUBAK.KO dataset due to regional Bangladeshi dialects, broadcast noise, and English loanword formatting. By specializing the weights on local data, this 1.7B model achieves a ~3x reduction in Word Error Rate (WER) compared to the massive Whisper Large-v3 model.

Training Configuration

Methodology: Full-parameter Supervised Fine-Tuning (SFT) using a custom PyTorch/Transformers pipeline. No Parameter-Efficient Fine-Tuning (LoRA) was used, ensuring deep architectural alignment to the new linguistic representations.
Dataset: SUBAK.KO (সুবাক্য) - 241 hours of annotated Bangladeshi Bengali corpus (229h Read Speech; 12h Broadcast Speech).
Optimization: Trained on NVIDIA A100 GPUs using bfloat16 precision and native HF gradient accumulation, masking prompt tokens (-100) to focus loss calculation purely on the Bengali textual output.

Usage

Method 1: Hugging Face Transformers

Since Qwen3-ASR is a multimodal LLM, you must provide a text prompt containing the <|audio_pad|> token so the model knows where to "listen."

import torch
import librosa
from transformers import AutoModel, AutoProcessor, AutoConfig

model_id = "amugoodbad229/Qwen3-ASR-Bengali-FT" 

# 1. Initialize Processor and Model
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id, 
    dtype=torch.bfloat16, 
    device_map="auto", 
    trust_remote_code=True
)

# 2. Load audio (16kHz mono)
audio, _ = librosa.load("path/to/audio.wav", sr=16000)

# 3. Prepare Input with Prompt
# The <|audio_pad|> token is mandatory for the processor
prompt = "<|im_start|>user\n<|audio_pad|>Please transcribe.<|im_end|>\n<|im_start|>assistant\n"
inputs = processor(text=prompt, audio=audio, sampling_rate=16000, return_tensors="pt").to(model.device)

# Ensure floating point inputs match model precision
inputs = {k: v.to(torch.bfloat16) if v.is_floating_point() else v for k, v in inputs.items()}

# 4. Generate
generated_ids = model.generate(**inputs, max_new_tokens=256)

# 5. Decode
transcription = processor.batch_decode(
    generated_ids[:, inputs["input_ids"].shape[1]:], 
    skip_special_tokens=True
)[0]

print(f"Result: {transcription}")

Method 2: Qwen-ASR Official Wrapper (Recommended)

This is the most efficient way to transcribe long audio files.

from qwen_asr import Qwen3ASRModel

# Load the model using the official wrapper
# Ensure your model repo has the chat_template defined in tokenizer_config.json
model = Qwen3ASRModel.from_pretrained("amugoodbad229/Qwen3-ASR-Bengali-FT")

# Transcribe audio (Handles long-form audio chunking natively)
results = model.transcribe(audio=["path/to/your/audio.wav"], language=[None])
print(results[0].text)

Downloads last month: 234

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for amugoodbad229/Qwen3-ASR-Bengali-FT

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

(35)

this model

Dataset used to train amugoodbad229/Qwen3-ASR-Bengali-FT

Evaluation results

Word Error Rate on SUBAK.KO
validation set self-reported

23.910
Character Error Rate on SUBAK.KO
validation set self-reported

9.570