cstr
/

data2vec-audio-960h-GGUF

Automatic Speech Recognition

speech-recognition

Model card Files Files and versions

data2vec-audio-960h-GGUF / README.md

cstr's picture

Upload README.md with huggingface_hub

05c11bc verified about 1 month ago

|

history blame contribute delete

2.33 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- gguf
	- audio
	- speech-recognition
	- data2vec
	- wav2vec2
	- ctc
	- automatic-speech-recognition
	base_model: facebook/data2vec-audio-base-960h
	pipeline_tag: automatic-speech-recognition
	---

	# Data2Vec Audio (GGUF)

	GGUF conversion of [facebook/data2vec-audio-base-960h](https://huggingface.co/facebook/data2vec-audio-base-960h) for use with [CrispASR](https://github.com/CrispStrobe/CrispASR).

	## Model Details

	- Architecture: Data2Vec Audio — wav2vec2-style CNN (7L, 512-dim) + 12-layer transformer (768-dim, 12 heads) + CTC head
	- Parameters: ~95M
	- Training: Self-supervised pre-training on LibriSpeech 960h, fine-tuned with CTC loss
	- Language: English only
	- License: Apache 2.0
	- WER: 1.89% (LibriSpeech test-clean), 4.07% (test-other)

	## Usage with CrispASR

	```bash
	# Uses the wav2vec2 backend (auto-detected from GGUF architecture)
	crispasr --backend wav2vec2 -m data2vec-audio-base-960h-q4_k.gguf -f audio.wav
	```

	## Architecture Notes

	Data2Vec Audio differs from standard wav2vec2 in three ways handled by the converter:

	1. 5-layer positional convolution (vs 1 for wav2vec2), each with Conv1d + LayerNorm(no affine) + GELU
	2. Global encoder LayerNorm BEFORE transformer layers (vs after for wav2vec2)
	3. POST-norm encoder despite using LayerNorm in CNN (wav2vec2-large uses pre-norm)

	All three are auto-detected from the HuggingFace model config and stored as GGUF metadata flags.

	## Files

	\| File \| Size \| JFK Transcription \|
	\|------\|------\|-------------------\|
	\| data2vec-audio-base-960h-f16.gguf \| 196 MB \| perfect \|
	\| data2vec-audio-base-960h-q4_k.gguf \| 79 MB \| perfect \|
	\| data2vec-audio-base-960h-q8_0.gguf \| 120 MB \| perfect \|

	## Accuracy

	Tested on JFK inaugural address (11s):

	```
	AND SO A MY FELLOW AMERICANS ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU
	ASK WHAT YOU CAN DO FOR YOUR COUNTRY
	```

	Identical to the Python HuggingFace reference output. All quantized variants produce the same transcription.

	## Citation

	```bibtex
	@inproceedings{baevski2022data2vec,
	title={data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language},
	author={Baevski, Alexei and Hsu, Wei-Ning and Xu, Qiantong and Babu, Arun and Gu, Jiatao and Auli, Michael},
	booktitle={ICML},
	year={2022}
	}
	```