FireRedVAD β€” safetensors for Svod

The voice-activity-detection checkpoints of FireRedVAD (FireRedASR2S technical report, arXiv:2603.10420) β€” both the non-streaming model and the causal Stream-VAD model β€” converted from the original model.pth.tar + cmvn.ark to single safetensors files for the Svod inference runtime. Weights are unmodified f32 β€” only renamed, reshaped, and bundled with the CMVN statistics.

FireRedVAD is a feed-forward DFSMN (no recurrence): 8 FSMN layers with depthwise temporal filters, operating on 80-bin kaldi log-mel fbank at 16 kHz (25 ms window / 10 ms hop). It detects speech in 100+ languages; upstream reports 97.57 F1 on FLEURS-VAD-102. The non-streaming model (588k params) has lookback + lookahead filters; the streaming model (568k params) is causal-only (N2 = 0) and carries a 19-frame conv cache per FSMN layer between chunks.

Files

file contents
firered_vad.safetensors non-streaming: 45 model tensors + cmvn_means/cmvn_istd [80] (global CMVN derived from cmvn.ark: variance floored at 1e-20, inverse std precomputed)
golden.safetensors non-streaming parity reference for assets/hello_zh.wav: samples (16 kHz mono, [-1, 1] f32), feat (pre-CMVN kaldi-native-fbank output [230, 80]), probs (PyTorch DetectModel output [230])
firered_vad_stream.safetensors streaming (Stream-VAD): 37 model tensors (no lookahead filters) + the CMVN pair
golden_stream.safetensors streaming parity reference, same wav: samples, feat, probs (cache-threaded chunkwise reference forward), probs_full (one whole-sequence causal forward), chunk_frames (the chunk size used, 16)

Architecture / config

Non-streaming: idim=80, R=8, M=1, H=256, P=128, N1=20, S1=1, N2=20, S2=1, odim=1; streaming: identical except N2=0 (no lookahead β€” strictly causal). Both read from the checkpoints' embedded args.

Tensor schema: fc1.{weight,bias} (80β†’256), fc2.{weight,bias} (256β†’128), fsmn1.{lookback,lookahead}.weight [128, 1, 1, 20], blocks.{0..6}.{fc1.weight,fc1.bias,fc2.weight,lookback.weight,lookahead.weight}, dnn.{weight,bias} (128β†’256), out.{weight,bias} (256β†’1), cmvn_means/cmvn_istd [80]. FSMN filters are reshaped from PyTorch's [P, 1, 20] to [P, 1, 1, 20] for 2-D depthwise convolution. The streaming file has no *.lookahead.weight keys.

License & citation

Apache-2.0, matching the upstream FireRedVAD release by Xiaohongshu.

@article{xu2026fireredasr2s,
  title={FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System},
  author={Xu, Kaituo and Jia, Yan and Huang, Kai and Chen, Junjie and Li, Wenpeng and Liu, Kun and Xie, Feng-Long and Tang, Xu and Hu, Yao},
  journal={arXiv preprint arXiv:2603.10420},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for vpermilp/firered_vad

Finetuned
(3)
this model

Paper for vpermilp/firered_vad