Vietnamese Regional Accent Classification Model

This is a fine-tuned version of nguyenvulebinh/wav2vec2-base-vi for Vietnamese Regional Accent Classification. The model classifies Vietnamese speech into 3 primary regional accents: North (Bắc), Central (Trung), and South (Nam).

It was fine-tuned to solve the common issue in existing models where the Central dialect is heavily misclassified due to its high diversity and acoustic variations.

Model Performance

The model was evaluated on a held-out test set of 1,895 samples from the ViMD dataset. It achieved exceptional performance, overcoming the previous difficulties with the Central accent.

Overall Accuracy: 94.72%
Macro F1: 94.60%
Weighted F1: 94.71%

Per-class Metrics

Accent Region	Precision	Recall	F1-Score	Support
North (Bắc)	95.01%	97.18%	96.08%	745
South (Nam)	95.98%	94.25%	95.11%	557
Central (Trung)	93.17%	92.07%	92.62%	593

Confusion Matrix Insights

The model shows a very narrow error span for the Central accent. Out of 593 Central samples, only 25 (~~4.2%) were misclassified as North, and 22 (~~3.7%) as South. This is incredibly stable considering the complex and diverse nature of the 63 dialects in the ViMD dataset.

Dataset & Training Details

Base Model: nguyenvulebinh/wav2vec2-base-vi (Pretrained Wav2Vec2 SSL on Vietnamese).
Dataset: ViMD (Multi-Dialect Vietnamese) - The dataset contains ~19,000 utterances across 63 provinces. The 63 province codes mapped to 3 main regions.
Data Splitting: Stratified split of 80% Train (15,159), 10% Validation (1,895), and 10% Test (1,895).
Training Config:
- Max Duration: 30s chunks
- Epochs trained: 45 (Early stopping patience = 10 on eval_macro_f1)
- Precision: bfloat16
- Effective Batch Size: 64

Usage

You can use this model directly through the transformers pipeline:

import torch
from transformers import pipeline

# Load pipeline
classifier = pipeline(
    "audio-classification", 
    model="thangquang09/wav2vec2-base-vi-accent-classification",
    device=0 if torch.cuda.is_available() else -1
)

# Predict
result = classifier("path_to_audio_16kHz.wav")
print(result)
# Output example: [{'score': 0.98, 'label': 'north'}, {'score': 0.01, 'label': 'central'}, {'score': 0.01, 'label': 'south'}]

Note: Make sure your audio is sampled at 16kHz before passing it to the model for optimal accuracy.

Labels Mapping

0: north (Miền Bắc)
1: central (Miền Trung)
2: south (Miền Nam)

Downloads last month: 28

Safetensors

Model size

94.6M params

Tensor type

F32

thangquang09
/

wav2vec2-base-vi-accent-classification