---
license: apache-2.0
language:
- en
tags:
- gguf
- audio
- speech-recognition
- data2vec
- wav2vec2
- ctc
- automatic-speech-recognition
base_model: facebook/data2vec-audio-base-960h
pipeline_tag: automatic-speech-recognition
---

# Data2Vec Audio (GGUF)

GGUF conversion of [facebook/data2vec-audio-base-960h](https://huggingface.co/facebook/data2vec-audio-base-960h) for use with [CrispASR](https://github.com/CrispStrobe/CrispASR).

## Model Details

- **Architecture**: Data2Vec Audio — wav2vec2-style CNN (7L, 512-dim) + 12-layer transformer (768-dim, 12 heads) + CTC head
- **Parameters**: ~95M
- **Training**: Self-supervised pre-training on LibriSpeech 960h, fine-tuned with CTC loss
- **Language**: English only
- **License**: Apache 2.0
- **WER**: 1.89% (LibriSpeech test-clean), 4.07% (test-other)

## Usage with CrispASR

```bash
# Uses the wav2vec2 backend (auto-detected from GGUF architecture)
crispasr --backend wav2vec2 -m data2vec-audio-base-960h-q4_k.gguf -f audio.wav
```

## Architecture Notes

Data2Vec Audio differs from standard wav2vec2 in three ways handled by the converter:

1. **5-layer positional convolution** (vs 1 for wav2vec2), each with Conv1d + LayerNorm(no affine) + GELU
2. **Global encoder LayerNorm BEFORE transformer layers** (vs after for wav2vec2)
3. **POST-norm encoder** despite using LayerNorm in CNN (wav2vec2-large uses pre-norm)

All three are auto-detected from the HuggingFace model config and stored as GGUF metadata flags.

## Files

| File | Size | JFK Transcription |
|------|------|-------------------|
| data2vec-audio-base-960h-f16.gguf | 196 MB | perfect |
| data2vec-audio-base-960h-q4_k.gguf | 79 MB | perfect |
| data2vec-audio-base-960h-q8_0.gguf | 120 MB | perfect |

## Accuracy

Tested on JFK inaugural address (11s):

```
AND SO A MY FELLOW AMERICANS ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU
ASK WHAT YOU CAN DO FOR YOUR COUNTRY
```

Identical to the Python HuggingFace reference output. All quantized variants produce the same transcription.

## Citation

```bibtex
@inproceedings{baevski2022data2vec,
  title={data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language},
  author={Baevski, Alexei and Hsu, Wei-Ning and Xu, Qiantong and Babu, Arun and Gu, Jiatao and Auli, Michael},
  booktitle={ICML},
  year={2022}
}
```