wav2vec-vm-finetune-c
Pre-exported weights for eschmidbauer/wav2vec-vm-finetune-c,
a zero-dependency C re-implementation of the
jakeBland/wav2vec-vm-finetune
voicemail detector. No PyTorch, no ONNX Runtime, no BLAS at runtime โ only
libm and libpthread.
This repository contains the model weights in a custom layout consumed by
the vm_detect C binary. It does not contain PyTorch / safetensors
checkpoints โ those live in the upstream repo.
Contents
| Directory | Size | Notes |
|---|---|---|
weights-fp32/ |
~1.26 GB | every tensor as raw float32 |
weights-int8/ |
~355 MB | large MatMul weights as int8 + per-row float32 scale; conv / LayerNorm / biases / pos_conv remain float32 |
Each directory holds one manifest.json plus per-module subdirectories:
weights-fp32/
manifest.json
feature_extractor/ conv0..conv6 (weight, bias, norm_weight, norm_bias)
feature_projection/ norm_*, proj.*
pos_conv/ weight.bin, bias.bin (weight-norm collapsed)
encoder_norm/ weight.bin, bias.bin
encoder/layer_{0..23}/
ln1.*, attn_{q,k,v,out}.*, ln2.*, ffn_{in,out}.*
classifier/ projector.*, out.*
For the INT8 variant each quantized 2D weight is a pair:
<stem>.q8.bin int8, shape [M, K] row-major
<stem>.scale.bin float32, shape [M] (one scale per output row)
The C loader auto-detects these and dequantizes to float32 at model load time.
Model details
Fine-tune of facebook/wav2vec2-large for binary voicemail detection, taken
unchanged from jakeBland/wav2vec-vm-finetune.
| Task | Binary audio classification (human vs voicemail) |
| Sample rate | 16 kHz, mono |
| Input length | 32,000 samples (2 s), raw float32 PCM |
| Hidden size | 1024 |
| FFN size | 4096 |
| Attention heads | 16 (64-dim each) |
| Encoder layers | 24 |
| Classifier proj | 256 |
| Labels | 0: human, 1: voicemail |
See manifest.json (identical in both directories) for the full tensor list
and shapes.
Usage
Download
pip install huggingface_hub
huggingface-cli download eschmidbauer/wav2vec-vm-finetune-c \
--local-dir . --local-dir-use-symlinks False
Build and run the C inference binary
git clone https://github.com/eschmidbauer/wav2vec-vm-finetune-c
cd wav2vec-vm-finetune-c
make -C c
# preprocess an mp3/wav to 16 kHz mono float32 PCM (32,000 samples)
python prep_audio.py my_clip.mp3
# run inference against the fp32 or int8 weights
c/vm_detect path/to/weights-fp32 my_clip.f32
c/vm_detect path/to/weights-int8 my_clip.f32
Example output:
loaded weights-fp32 in 160 ms
my_clip.f32 voicemail (human=0.034, voicemail=0.966) [1703 ms]
Concurrent batch mode (shared model, N pthread workers):
c/vm_detect weights-fp32 --workers 4 clips/*.f32
See the project README
for build flags, profiling env vars (WAV2VEC_PROF, WAV2VEC_DUMP), and
the NEON SGEMM micro-kernel details.
Re-generating these weights
The files here were produced by extract_weights.py from the upstream
PyTorch checkpoint:
python extract_weights.py # jakeBland/wav2vec-vm-finetune
MODEL_ID=other-user/model python extract_weights.py
python extract_weights.py path/to/local/dir
Re-runs are idempotent (delete weights-fp32/ or weights-int8/ to force
regeneration).
Intended use and limitations
Designed for call-flow systems that need to decide, from the first ~2 seconds of audio, whether the other end is a live human or a voicemail greeting. Inherits the biases, language coverage, and failure modes of the upstream fine-tune โ clips outside the training distribution (non-English, heavy background noise, very short greetings) will degrade accuracy.
The INT8 variant trades a small amount of accuracy for a ~3.5ร weight-size reduction and faster load time; behavior is otherwise identical.
License
MIT โ see LICENSE in the source repository. Upstream weights are subject
to the license of jakeBland/wav2vec-vm-finetune.
Citation
@misc{wav2vec-vm-finetune-c,
author = {Emmanuel Schmidbauer},
title = {wav2vec-vm-finetune-c: zero-dependency C inference for a wav2vec2 voicemail detector},
year = {2026},
url = {https://github.com/eschmidbauer/wav2vec-vm-finetune-c}
}
Model tree for eschmidbauer/wav2vec-vm-finetune-c
Base model
facebook/wav2vec2-xls-r-300m