W2V-BERT 2.0 ASR Adapters

This repository contains 4 per-language bottleneck adapters for automatic speech recognition (ASR) trained on top of facebook/w2v-bert-2.0.

Model Description

Base Model: facebook/w2v-bert-2.0 (600M parameters, frozen)
Adapter Architecture: MMS-style bottleneck adapters (dim=64)
Decoder: Lightweight transformer decoder (1 layer)
Training: CTC loss with extended vocabulary for double vowels
Average WER: 80.01%

Trained Adapters

Adapter	Language	WER	Train Samples
ach_Latn	Acholi	99.72%	4825
eng_Latn_salt	English (SALT)	100.00%	4804
eng_Latn_tts	English (TTS)	20.50%	3030
ful_Latn	Fulah	99.81%	2355

Architecture

The model uses:

Frozen w2v-bert-2.0 encoder - Extracts audio representations
Bottleneck adapter - Language-specific adaptation (trainable)
Lightweight decoder - Transformer decoder block (trainable)
LM head - Per-language vocabulary projection (trainable)

Audio → Encoder(frozen) → Adapter → Decoder → LayerNorm → LM Head → Text

Usage

Each adapter folder contains:

adapter_weights.pt - Bottleneck adapter weights
decoder_weights.pt - Decoder block weights
lm_head_weights.pt - Language model head weights
final_norm_weights.pt - Final layer norm weights
vocab.json - Language-specific vocabulary
adapter_config.json - Adapter configuration
metrics.json - Training metrics

Loading an Adapter

import torch
from transformers import Wav2Vec2BertProcessor
from huggingface_hub import hf_hub_download

# Load processor for specific language (e.g., kik_Latn for Kikuyu)
adapter_id = "kik_Latn"
processor = Wav2Vec2BertProcessor.from_pretrained(
    "mutisya/w2v-bert-adapters-14lang-e5-25_52-v3",
    subfolder=adapter_id
)

# Load adapter configuration
import json
config_path = hf_hub_download("mutisya/w2v-bert-adapters-14lang-e5-25_52-v3", f"{adapter_id}/adapter_config.json")
with open(config_path) as f:
    adapter_config = json.load(f)

# Load adapter weights
adapter_weights = torch.load(
    hf_hub_download("mutisya/w2v-bert-adapters-14lang-e5-25_52-v3", f"{adapter_id}/adapter_weights.pt"),
    map_location="cpu"
)
decoder_weights = torch.load(
    hf_hub_download("mutisya/w2v-bert-adapters-14lang-e5-25_52-v3", f"{adapter_id}/decoder_weights.pt"),
    map_location="cpu"
)
lm_head_weights = torch.load(
    hf_hub_download("mutisya/w2v-bert-adapters-14lang-e5-25_52-v3", f"{adapter_id}/lm_head_weights.pt"),
    map_location="cpu"
)

Training Configuration

Epochs: 5
Learning Rate: 0.0005
Batch Size: 64 × 1 (effective: 64)
Extended Vocabulary: True
Adapter Dimension: 64
Decoder Layers: 1

Supported Languages

The following languages have trained adapters:

Acholi (ach_Latn): WER 99.72%
English (SALT) (eng_Latn_salt): WER 100.00%
English (TTS) (eng_Latn_tts): WER 20.50%
Fulah (ful_Latn): WER 99.81%

License

Apache 2.0

Citation

@misc{w2vbert-asr-adapters,
  author = {Mutisya},
  title = {W2V-BERT 2.0 ASR Adapters for African Languages},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/mutisya/w2v-bert-adapters-14lang-e5-25_52-v3}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mutisya/w2v-bert-adapters-14lang-e5-25_52-v3

Base model

facebook/w2v-bert-2.0

Finetuned

(399)

this model

Evaluation results

Average WER
self-reported

80.010