| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - gguf |
| - audio |
| - speech-recognition |
| - data2vec |
| - wav2vec2 |
| - ctc |
| - automatic-speech-recognition |
| base_model: facebook/data2vec-audio-base-960h |
| pipeline_tag: automatic-speech-recognition |
| --- |
| |
| # Data2Vec Audio (GGUF) |
|
|
| GGUF conversion of [facebook/data2vec-audio-base-960h](https://huggingface.co/facebook/data2vec-audio-base-960h) for use with [CrispASR](https://github.com/CrispStrobe/CrispASR). |
|
|
| ## Model Details |
|
|
| - **Architecture**: Data2Vec Audio — wav2vec2-style CNN (7L, 512-dim) + 12-layer transformer (768-dim, 12 heads) + CTC head |
| - **Parameters**: ~95M |
| - **Training**: Self-supervised pre-training on LibriSpeech 960h, fine-tuned with CTC loss |
| - **Language**: English only |
| - **License**: Apache 2.0 |
| - **WER**: 1.89% (LibriSpeech test-clean), 4.07% (test-other) |
|
|
| ## Usage with CrispASR |
|
|
| ```bash |
| # Uses the wav2vec2 backend (auto-detected from GGUF architecture) |
| crispasr --backend wav2vec2 -m data2vec-audio-base-960h-q4_k.gguf -f audio.wav |
| ``` |
|
|
| ## Architecture Notes |
|
|
| Data2Vec Audio differs from standard wav2vec2 in three ways handled by the converter: |
|
|
| 1. **5-layer positional convolution** (vs 1 for wav2vec2), each with Conv1d + LayerNorm(no affine) + GELU |
| 2. **Global encoder LayerNorm BEFORE transformer layers** (vs after for wav2vec2) |
| 3. **POST-norm encoder** despite using LayerNorm in CNN (wav2vec2-large uses pre-norm) |
|
|
| All three are auto-detected from the HuggingFace model config and stored as GGUF metadata flags. |
|
|
| ## Files |
|
|
| | File | Size | JFK Transcription | |
| |------|------|-------------------| |
| | data2vec-audio-base-960h-f16.gguf | 196 MB | perfect | |
| | data2vec-audio-base-960h-q4_k.gguf | 79 MB | perfect | |
| | data2vec-audio-base-960h-q8_0.gguf | 120 MB | perfect | |
|
|
| ## Accuracy |
|
|
| Tested on JFK inaugural address (11s): |
|
|
| ``` |
| AND SO A MY FELLOW AMERICANS ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU |
| ASK WHAT YOU CAN DO FOR YOUR COUNTRY |
| ``` |
|
|
| Identical to the Python HuggingFace reference output. All quantized variants produce the same transcription. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{baevski2022data2vec, |
| title={data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language}, |
| author={Baevski, Alexei and Hsu, Wei-Ning and Xu, Qiantong and Babu, Arun and Gu, Jiatao and Auli, Michael}, |
| booktitle={ICML}, |
| year={2022} |
| } |
| ``` |
|
|