ViP-VL: Vietnamese Self-supervised speech Pretraining model leveraging Vector-quantization Learning

GitHub Paper

ViP-VL is a self-supervised speech pretraining model for Vietnamese, accepted to INTERSPEECH 2026. This repository hosts the pretrained ViP-VL model: a ChunkFormer encoder pretrained on large-scale unlabeled Vietnamese speech with a random-projection-quantizer masked-prediction objective (BEST-RQ). It is designed to initialize downstream finetuning (ASR / RNN-T / classification).

Method

ViP-VL adapts the random-projection-quantizer masked-prediction recipe (BEST-RQ) to an aggressive 8× temporal-subsampling ChunkFormer backbone, fixing the synchronization between the masking manifold and the encoder's subsampling rate:

  • Masking is applied to the raw 10 ms log-mel frames before subsampling; a subsampled frame is treated as masked iff ≥ 80 % of its constituent input frames are masked.
  • Targets come from a frozen random-projection quantizer: a fixed random projection of the (CMVN-normalized) input is matched by L2 nearest-neighbour to a fixed random codebook (1024 entries, dimension 16); the encoder is trained with a masked language-model (NLL) objective over masked positions.

Architecture

Encoder ChunkFormer
Encoder blocks 12
Hidden size 512
Attention heads 8
FFN size 2048
CNN module kernel 15
Subsampling dw_striding (8×)
Positional encoding chunk relative
Input features 80-dim log-mel fbank @ 16 kHz

Files

  • pytorch_model.pt — encoder-only state dict (encoder.*).
  • config.yaml — encoder configuration (encoder_conf) and feature settings.
  • global_cmvn — global CMVN statistics used during pretraining.

Finetuning

The encoder weights load with strict=False, so point any ChunkFormer ASR / RNN-T / classification recipe at this checkpoint and train the task heads from scratch. Make sure the downstream encoder_conf matches config.yaml.

The checkpoint argument accepts either a local path or this repo id directlyload_checkpoint looks for a local file/directory first and otherwise downloads pytorch_model.pt from the Hub automatically (cached locally), so no manual download step is required:

# e.g. in examples/asr/ctc/run.sh (or rnnt / classification)

# Option A — download straight from the Hub (recommended)
checkpoint=khanhld/vip-vl-base-vie

# Option B — local path to an exported bundle
checkpoint=/path/to/khanhld/vip-vl-base-vie/pytorch_model.pt

For a private repo, authenticate first with huggingface-cli login or by exporting HF_TOKEN. To pre-download (or inspect) the files manually:

from huggingface_hub import snapshot_download

local_dir = snapshot_download(repo_id="khanhld/vip-vl-base-vie")
# local_dir/pytorch_model.pt  ->  also valid as the finetuning `checkpoint=`

Citation

If you use this model, please cite ViP-VL (INTERSPEECH 2026) and ChunkFormer:

@inproceedings{vipvl,
    title={ViP-VL: Vietnamese Self-supervised Speech Pretraining Model Leveraging Vector-Quantization Learning},
    author={Khanh Le* and Kiet Anh Hoang* and Bao Nguyen* and Duy Vo* and Dung Vo and Thai Tran and Linh Pham and Khoa D Doan},
    booktitle={Proc. INTERSPEECH 2026},
    year={2026}
}

@INPROCEEDINGS{10888640,
    author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
    booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
    year={2025},
    pages={1-5},
    doi={10.1109/ICASSP49660.2025.10888640}}
Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for khanhld/vip-vl-base-vie