ViP-VL: Vietnamese Self-supervised speech Pretraining model leveraging Vector-quantization Learning

ViP-VL is a Vietnamese self-supervised speech Pretraining model leveraging Vector-quantization Learning, accepted to INTERSPEECH 2026. This repository hosts the pretrained ViP-VL model: a ChunkFormer encoder pretrained on large-scale unlabeled Vietnamese speech with a random-projection-quantizer masked-prediction objective (BEST-RQ). It is designed to initialize downstream finetuning (ASR / RNN-T / classification).

Method

ViP-VL adapts the random-projection-quantizer masked-prediction recipe (BEST-RQ) to an aggressive 8× temporal-subsampling ChunkFormer backbone, fixing the synchronization between the masking manifold and the encoder's subsampling rate:

Masking is applied to the raw 10 ms log-mel frames before subsampling; a subsampled frame is treated as masked iff ≥ 80 % of its constituent input frames are masked.
Targets come from a frozen random-projection quantizer: a fixed random projection of the (CMVN-normalized) input is matched by L2 nearest-neighbour to a fixed random codebook (1024 entries, dimension 16); the encoder is trained with a masked language-model (NLL) objective over masked positions.

Architecture


Encoder	ChunkFormer
Encoder blocks	12
Hidden size	512
Attention heads	8
FFN size	2048
CNN module kernel	15
Subsampling	`dw_striding` (8×)
Positional encoding	chunk relative
Input features	80-dim log-mel fbank @ 16 kHz

Files

pytorch_model.pt — encoder-only state dict (encoder.*).
config.yaml — encoder configuration (encoder_conf) and feature settings.
global_cmvn — global CMVN statistics used during pretraining.

Finetuning

The encoder weights load with strict=False, so point any ChunkFormer ASR / RNN-T / classification recipe at this checkpoint and train the task heads from scratch. Make sure the downstream encoder_conf matches config.yaml.

The checkpoint argument accepts either a local path or this repo id directly — load_checkpoint looks for a local file/directory first and otherwise downloads pytorch_model.pt from the Hub automatically (cached locally), so no manual download step is required:

# e.g. in examples/asr/ctc/run.sh (or rnnt / classification)

# Option A — download straight from the Hub (recommended)
checkpoint=khanhld/vip-vl-base-vie

# Option B — local path to an exported bundle
checkpoint=/path/to/khanhld/vip-vl-base-vie/pytorch_model.pt

Feature Extraction

from chunkformer import ChunkFormerModel
import torch

device = "cuda:0"

# Load a pre-trained model from Hugging Face or local directory
model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-ctc-large-vie").to(device)
x, x_len = model._load_audio_and_extract_features("path/to/audio")  # x: (T, F), x_len: int
x = x.unsqueeze(0).to(device)
x_len = torch.tensor([x_len], device=device)

# Extract feature
feature, feature_len = model.encode(
    xs=x,
    xs_lens=x_len,
)

print("feature: ", feature.shape)
print("feature_len: ", feature_len)

Citation

If you use this model, please cite ViP-VL and ChunkFormer:

@inproceedings{vipvl,
    title={ViP-VL: Vietnamese Self-supervised Speech Pretraining Model Leveraging Vector-Quantization Learning},
    author={Khanh Le* and Kiet Anh Hoang* and Bao Nguyen* and Duy Vo* and Dung Vo and Thai Tran and Linh Pham and Khoa D Doan},
    booktitle={Proc. INTERSPEECH 2026},
    year={2026},
    url={https://arxiv.org/abs/2606.10360}
}

@INPROCEEDINGS{10888640,
    author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
    booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
    year={2025},
    pages={1-5},
    doi={10.1109/ICASSP49660.2025.10888640}}

Downloads last month: 16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for khanhld/vip-vl-base-vie

ViP-VL: Vietnamese Self-supervised Speech Pretraining Model with Vector-Quantization Learning

Paper • 2606.10360 • Published Jun 10

Self-supervised Learning with Random-projection Quantizer for Speech Recognition

Paper • 2202.01855 • Published Feb 3, 2022