FireRedVAD — safetensors for Svod

The voice-activity-detection checkpoints of FireRedVAD (FireRedASR2S technical report, arXiv:2603.10420) — both the non-streaming model and the causal Stream-VAD model — converted from the original model.pth.tar + cmvn.ark to single safetensors files for the Svod inference runtime. Weights are unmodified f32 — only renamed, reshaped, and bundled with the CMVN statistics.

FireRedVAD is a feed-forward DFSMN (no recurrence): 8 FSMN layers with depthwise temporal filters, operating on 80-bin kaldi log-mel fbank at 16 kHz (25 ms window / 10 ms hop). It detects speech in 100+ languages; upstream reports 97.57 F1 on FLEURS-VAD-102. The non-streaming model (588k params) has lookback + lookahead filters; the streaming model (568k params) is causal-only (N2 = 0) and carries a 19-frame conv cache per FSMN layer between chunks.

Files

file	contents
`firered_vad.safetensors`	non-streaming: 45 model tensors + `cmvn_means`/`cmvn_istd` `[80]` (global CMVN derived from `cmvn.ark`: variance floored at 1e-20, inverse std precomputed)
`golden.safetensors`	non-streaming parity reference for `assets/hello_zh.wav`: `samples` (16 kHz mono, `[-1, 1]` f32), `feat` (pre-CMVN `kaldi-native-fbank` output `[230, 80]`), `probs` (PyTorch `DetectModel` output `[230]`)
`firered_vad_stream.safetensors`	streaming (`Stream-VAD`): 37 model tensors (no lookahead filters) + the CMVN pair
`golden_stream.safetensors`	streaming parity reference, same wav: `samples`, `feat`, `probs` (cache-threaded chunkwise reference forward), `probs_full` (one whole-sequence causal forward), `chunk_frames` (the chunk size used, 16)

Architecture / config

Non-streaming: idim=80, R=8, M=1, H=256, P=128, N1=20, S1=1, N2=20, S2=1, odim=1; streaming: identical except N2=0 (no lookahead — strictly causal). Both read from the checkpoints' embedded args.

Tensor schema: fc1.{weight,bias} (80→256), fc2.{weight,bias} (256→128), fsmn1.{lookback,lookahead}.weight [128, 1, 1, 20], blocks.{0..6}.{fc1.weight,fc1.bias,fc2.weight,lookback.weight,lookahead.weight}, dnn.{weight,bias} (128→256), out.{weight,bias} (256→1), cmvn_means/cmvn_istd [80]. FSMN filters are reshaped from PyTorch's [P, 1, 20] to [P, 1, 1, 20] for 2-D depthwise convolution. The streaming file has no *.lookahead.weight keys.

License & citation

Apache-2.0, matching the upstream FireRedVAD release by Xiaohongshu.

@article{xu2026fireredasr2s,
  title={FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System},
  author={Xu, Kaituo and Jia, Yan and Huang, Kai and Chen, Junjie and Li, Wenpeng and Liu, Kun and Xie, Feng-Long and Tang, Xu and Hu, Yao},
  journal={arXiv preprint arXiv:2603.10420},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Voice Activity Detection

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vpermilp/firered_vad

Base model

FireRedTeam/FireRedVAD

Finetuned

(3)

this model

Paper for vpermilp/firered_vad

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

Paper • 2603.10420 • Published Mar 11 • 7