FireRedVAD β safetensors for Svod
The voice-activity-detection checkpoints of
FireRedVAD
(FireRedASR2S technical report, arXiv:2603.10420) β
both the non-streaming model and the causal Stream-VAD model β converted
from the original model.pth.tar + cmvn.ark to single safetensors files for
the Svod inference runtime. Weights are
unmodified f32 β only renamed, reshaped, and bundled with the CMVN statistics.
FireRedVAD is a feed-forward DFSMN (no recurrence): 8 FSMN layers with
depthwise temporal filters, operating on 80-bin kaldi log-mel fbank at 16 kHz
(25 ms window / 10 ms hop). It detects speech in 100+ languages; upstream
reports 97.57 F1 on FLEURS-VAD-102. The non-streaming model (588k params) has
lookback + lookahead filters; the streaming model (568k params) is causal-only
(N2 = 0) and carries a 19-frame conv cache per FSMN layer between chunks.
Files
| file | contents |
|---|---|
firered_vad.safetensors |
non-streaming: 45 model tensors + cmvn_means/cmvn_istd [80] (global CMVN derived from cmvn.ark: variance floored at 1e-20, inverse std precomputed) |
golden.safetensors |
non-streaming parity reference for assets/hello_zh.wav: samples (16 kHz mono, [-1, 1] f32), feat (pre-CMVN kaldi-native-fbank output [230, 80]), probs (PyTorch DetectModel output [230]) |
firered_vad_stream.safetensors |
streaming (Stream-VAD): 37 model tensors (no lookahead filters) + the CMVN pair |
golden_stream.safetensors |
streaming parity reference, same wav: samples, feat, probs (cache-threaded chunkwise reference forward), probs_full (one whole-sequence causal forward), chunk_frames (the chunk size used, 16) |
Architecture / config
Non-streaming: idim=80, R=8, M=1, H=256, P=128, N1=20, S1=1, N2=20, S2=1, odim=1; streaming: identical except N2=0 (no lookahead β strictly causal).
Both read from the checkpoints' embedded args.
Tensor schema: fc1.{weight,bias} (80β256), fc2.{weight,bias} (256β128),
fsmn1.{lookback,lookahead}.weight [128, 1, 1, 20],
blocks.{0..6}.{fc1.weight,fc1.bias,fc2.weight,lookback.weight,lookahead.weight},
dnn.{weight,bias} (128β256), out.{weight,bias} (256β1),
cmvn_means/cmvn_istd [80]. FSMN filters are reshaped from PyTorch's
[P, 1, 20] to [P, 1, 1, 20] for 2-D depthwise convolution. The streaming
file has no *.lookahead.weight keys.
License & citation
Apache-2.0, matching the upstream FireRedVAD release by Xiaohongshu.
@article{xu2026fireredasr2s,
title={FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System},
author={Xu, Kaituo and Jia, Yan and Huang, Kai and Chen, Junjie and Li, Wenpeng and Liu, Kun and Xie, Feng-Long and Tang, Xu and Hu, Yao},
journal={arXiv preprint arXiv:2603.10420},
year={2026}
}
Model tree for vpermilp/firered_vad
Base model
FireRedTeam/FireRedVAD