Instructions to use khanhld/vip-vl-base-vie with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use khanhld/vip-vl-base-vie with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("khanhld/vip-vl-base-vie", dtype="auto") - Notebooks
- Google Colab
- Kaggle
ViP-VL: Vietnamese Self-supervised speech Pretraining model leveraging Vector-quantization Learning
ViP-VL is a self-supervised speech pretraining model for Vietnamese, accepted to INTERSPEECH 2026. This repository hosts the pretrained ViP-VL model: a ChunkFormer encoder pretrained on large-scale unlabeled Vietnamese speech with a random-projection-quantizer masked-prediction objective (BEST-RQ). It is designed to initialize downstream finetuning (ASR / RNN-T / classification).
Method
ViP-VL adapts the random-projection-quantizer masked-prediction recipe (BEST-RQ) to an aggressive 8× temporal-subsampling ChunkFormer backbone, fixing the synchronization between the masking manifold and the encoder's subsampling rate:
- Masking is applied to the raw 10 ms log-mel frames before subsampling; a subsampled frame is treated as masked iff ≥ 80 % of its constituent input frames are masked.
- Targets come from a frozen random-projection quantizer: a fixed random projection of the (CMVN-normalized) input is matched by L2 nearest-neighbour to a fixed random codebook (1024 entries, dimension 16); the encoder is trained with a masked language-model (NLL) objective over masked positions.
Architecture
| Encoder | ChunkFormer |
| Encoder blocks | 12 |
| Hidden size | 512 |
| Attention heads | 8 |
| FFN size | 2048 |
| CNN module kernel | 15 |
| Subsampling | dw_striding (8×) |
| Positional encoding | chunk relative |
| Input features | 80-dim log-mel fbank @ 16 kHz |
Files
pytorch_model.pt— encoder-only state dict (encoder.*).config.yaml— encoder configuration (encoder_conf) and feature settings.global_cmvn— global CMVN statistics used during pretraining.
Finetuning
The encoder weights load with strict=False, so point any ChunkFormer ASR / RNN-T /
classification recipe at this checkpoint and train the task heads from scratch. Make sure
the downstream encoder_conf matches config.yaml.
The checkpoint argument accepts either a local path or this repo id directly —
load_checkpoint looks for a local file/directory first and otherwise downloads
pytorch_model.pt from the Hub automatically (cached locally), so no manual download
step is required:
# e.g. in examples/asr/ctc/run.sh (or rnnt / classification)
# Option A — download straight from the Hub (recommended)
checkpoint=khanhld/vip-vl-base-vie
# Option B — local path to an exported bundle
checkpoint=/path/to/khanhld/vip-vl-base-vie/pytorch_model.pt
For a private repo, authenticate first with huggingface-cli login or by exporting
HF_TOKEN. To pre-download (or inspect) the files manually:
from huggingface_hub import snapshot_download
local_dir = snapshot_download(repo_id="khanhld/vip-vl-base-vie")
# local_dir/pytorch_model.pt -> also valid as the finetuning `checkpoint=`
Citation
If you use this model, please cite ViP-VL (INTERSPEECH 2026) and ChunkFormer:
@inproceedings{vipvl,
title={ViP-VL: Vietnamese Self-supervised Speech Pretraining Model Leveraging Vector-Quantization Learning},
author={Khanh Le* and Kiet Anh Hoang* and Bao Nguyen* and Duy Vo* and Dung Vo and Thai Tran and Linh Pham and Khoa D Doan},
booktitle={Proc. INTERSPEECH 2026},
year={2026}
}
@INPROCEEDINGS{10888640,
author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription},
year={2025},
pages={1-5},
doi={10.1109/ICASSP49660.2025.10888640}}
- Downloads last month
- 21