Vietnamese Regional Accent Classification Model
This is a fine-tuned version of nguyenvulebinh/wav2vec2-base-vi for Vietnamese Regional Accent Classification.
The model classifies Vietnamese speech into 3 primary regional accents: North (Bắc), Central (Trung), and South (Nam).
It was fine-tuned to solve the common issue in existing models where the Central dialect is heavily misclassified due to its high diversity and acoustic variations.
Model Performance
The model was evaluated on a held-out test set of 1,895 samples from the ViMD dataset. It achieved exceptional performance, overcoming the previous difficulties with the Central accent.
- Overall Accuracy: 94.72%
- Macro F1: 94.60%
- Weighted F1: 94.71%
Per-class Metrics
| Accent Region | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| North (Bắc) | 95.01% | 97.18% | 96.08% | 745 |
| South (Nam) | 95.98% | 94.25% | 95.11% | 557 |
| Central (Trung) | 93.17% | 92.07% | 92.62% | 593 |
Confusion Matrix Insights
- The model shows a very narrow error span for the Central accent. Out of 593 Central samples, only 25 (
4.2%) were misclassified as North, and 22 (3.7%) as South. This is incredibly stable considering the complex and diverse nature of the 63 dialects in the ViMD dataset.
Dataset & Training Details
- Base Model:
nguyenvulebinh/wav2vec2-base-vi(Pretrained Wav2Vec2 SSL on Vietnamese). - Dataset: ViMD (Multi-Dialect Vietnamese) - The dataset contains ~19,000 utterances across 63 provinces. The 63 province codes mapped to 3 main regions.
- Data Splitting: Stratified split of 80% Train (15,159), 10% Validation (1,895), and 10% Test (1,895).
- Training Config:
- Max Duration: 30s chunks
- Epochs trained: 45 (Early stopping patience = 10 on
eval_macro_f1) - Precision:
bfloat16 - Effective Batch Size: 64
Usage
You can use this model directly through the transformers pipeline:
import torch
from transformers import pipeline
# Load pipeline
classifier = pipeline(
"audio-classification",
model="thangquang09/wav2vec2-base-vi-accent-classification",
device=0 if torch.cuda.is_available() else -1
)
# Predict
result = classifier("path_to_audio_16kHz.wav")
print(result)
# Output example: [{'score': 0.98, 'label': 'north'}, {'score': 0.01, 'label': 'central'}, {'score': 0.01, 'label': 'south'}]
Note: Make sure your audio is sampled at 16kHz before passing it to the model for optimal accuracy.
Labels Mapping
0:north(Miền Bắc)1:central(Miền Trung)2:south(Miền Nam)
- Downloads last month
- 165