cstr's picture
Upload README.md with huggingface_hub
05c11bc verified
---
license: apache-2.0
language:
- en
tags:
- gguf
- audio
- speech-recognition
- data2vec
- wav2vec2
- ctc
- automatic-speech-recognition
base_model: facebook/data2vec-audio-base-960h
pipeline_tag: automatic-speech-recognition
---
# Data2Vec Audio (GGUF)
GGUF conversion of [facebook/data2vec-audio-base-960h](https://huggingface.co/facebook/data2vec-audio-base-960h) for use with [CrispASR](https://github.com/CrispStrobe/CrispASR).
## Model Details
- **Architecture**: Data2Vec Audio — wav2vec2-style CNN (7L, 512-dim) + 12-layer transformer (768-dim, 12 heads) + CTC head
- **Parameters**: ~95M
- **Training**: Self-supervised pre-training on LibriSpeech 960h, fine-tuned with CTC loss
- **Language**: English only
- **License**: Apache 2.0
- **WER**: 1.89% (LibriSpeech test-clean), 4.07% (test-other)
## Usage with CrispASR
```bash
# Uses the wav2vec2 backend (auto-detected from GGUF architecture)
crispasr --backend wav2vec2 -m data2vec-audio-base-960h-q4_k.gguf -f audio.wav
```
## Architecture Notes
Data2Vec Audio differs from standard wav2vec2 in three ways handled by the converter:
1. **5-layer positional convolution** (vs 1 for wav2vec2), each with Conv1d + LayerNorm(no affine) + GELU
2. **Global encoder LayerNorm BEFORE transformer layers** (vs after for wav2vec2)
3. **POST-norm encoder** despite using LayerNorm in CNN (wav2vec2-large uses pre-norm)
All three are auto-detected from the HuggingFace model config and stored as GGUF metadata flags.
## Files
| File | Size | JFK Transcription |
|------|------|-------------------|
| data2vec-audio-base-960h-f16.gguf | 196 MB | perfect |
| data2vec-audio-base-960h-q4_k.gguf | 79 MB | perfect |
| data2vec-audio-base-960h-q8_0.gguf | 120 MB | perfect |
## Accuracy
Tested on JFK inaugural address (11s):
```
AND SO A MY FELLOW AMERICANS ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU
ASK WHAT YOU CAN DO FOR YOUR COUNTRY
```
Identical to the Python HuggingFace reference output. All quantized variants produce the same transcription.
## Citation
```bibtex
@inproceedings{baevski2022data2vec,
title={data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language},
author={Baevski, Alexei and Hsu, Wei-Ning and Xu, Qiantong and Babu, Arun and Gu, Jiatao and Auli, Michael},
booktitle={ICML},
year={2022}
}
```