FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration
Paper
• 2501.14350 • Published
• 1
Core ML conversion of FireRedVAD Stream-VAD for real-time voice activity detection on Apple platforms (iOS 16+ / macOS 13+). Converted from the original PyTorch model by FireRedTeam/FireRedVAD.
Results from the FLEURS-VAD-102 benchmark (102 languages, 9,443 audio clips):
| Metric | FireRedVAD | Silero-VAD | TEN-VAD | FunASR-VAD | WebRTC-VAD |
|---|---|---|---|---|---|
| AUC-ROC | 99.60 | 97.99 | 97.81 | - | - |
| F1 Score | 97.57 | 95.95 | 95.19 | 90.91 | 52.30 |
| False Alarm | 2.69% | 9.41% | 15.47% | 44.03% | 2.83% |
| Miss Rate | 3.62% | 3.95% | 2.95% | 0.42% | 64.15% |
| Name | Shape | Type | Description |
|---|---|---|---|
feat |
[1, 1..512, 80] |
Float32 | Log-Mel filterbank features (dynamic time axis) |
cache_0 ~ cache_7 |
[1, 128, 19] |
Float32 | FSMN lookback cache for each of the 8 layers |
| Name | Type | Description |
|---|---|---|
probs |
Float32 | Speech probability, shape [1, T, 1] |
new_cache_0 ~ new_cache_7 |
Float32 | Updated lookback cache |
Converted from PyTorch using coremltools via the export script in FireRedASR2S. The Stream-VAD variant was selected for its causal (no lookahead) property, making it suitable for real-time streaming applications.
import CoreML
// Load model
let model = try FireRedVAD(configuration: .init())
// Initialize caches (8 layers x [1, 128, 19])
var caches = (0..<8).map { _ in
try! MLMultiArray(shape: [1, 128, 19], dataType: .float32)
}
// Process audio frame by frame
let input = FireRedVADInput(
feat: fbankFeatures, // [1, T, 80]
cache_0: caches[0], cache_1: caches[1],
cache_2: caches[2], cache_3: caches[3],
cache_4: caches[4], cache_5: caches[5],
cache_6: caches[6], cache_7: caches[7]
)
let output = try model.prediction(input: input)
let speechProb = output.probs // [1, T, 1]
// Update caches for next frame
caches = [
output.new_cache_0, output.new_cache_1,
output.new_cache_2, output.new_cache_3,
output.new_cache_4, output.new_cache_5,
output.new_cache_6, output.new_cache_7
]
For a complete implementation with feature extraction, CMVN normalization, and speech state machine, see FireRedASRKit.
Apache 2.0, following the original FireRedVAD license.
Base model
FireRedTeam/FireRedVAD