hubert-base-phoneme-en

This model is a fine-tuned version of facebook/hubert-base-ls960 for phoneme-level CTC prediction with the 41-token ARPABET vocabulary used by peacock-asr.

It is a P003 backbone artifact, not the final pronunciation-scoring result. The downstream GOP/GOPT evaluation on SpeechOcean762 happens in the separate P003 eval sweeps.

Model description

Base model: HuBERT-base (facebook/hubert-base-ls960, 95M parameters)
Fine-tuning objective: phoneme-level CTC
Training data: LibriSpeech 960h
Output vocabulary: repo-standard ARPABET token set used for pronunciation scoring backends
Hugging Face backend string for this repo: hf:Peacockery/hubert-base-phoneme-en

This artifact is intended to serve as the phoneme-posterior generator for the P003 compact-backbone comparison:

wav2vec2-base (95M) vs
HuBERT-base (95M) vs
w2v-bert-2.0 (600M) vs
future smaller backbones such as Citrinet

Intended uses & limitations

Intended use:

research on phoneme posterior extraction
backend for GOP-SF feature extraction
controlled backbone swaps inside the P003 pronunciation-scoring pipeline

Not intended as:

a general-purpose English ASR model
a production transcription endpoint
a standalone pronunciation assessor without the downstream GOP/GOPT pipeline

Important limitation:

the trainer-side eval/per metric reported here is only the phoneme-CTC training metric used during fine-tuning
the meaningful P003 research result is phone-level PCC on SpeechOcean762 after GOP/GOPT evaluation, which is tracked separately

Training and evaluation data

Training data:

LibriSpeech 960h (train_clean_100, train_clean_360, train_other_500)

Trainer-side evaluation data:

LibriSpeech validation split used by the Hugging Face Trainer

Research evaluation data:

SpeechOcean762 is used later in the P003 eval sweep, not in this training run

Training procedure

This run used the shared P003 phoneme-head recipe and completed all planned training steps locally on March 6, 2026.

Selection policy:

training ran to global_step = 13182
the exported root artifact was selected from the best checkpoint
best checkpoint by tracked trainer metric: checkpoint-8500
best tracked trainer metric:
- eval_per = 0.9988901220865705 at step 8500
final trainer eval at step 13000:
- eval_loss = 0.10874085873365402
- eval_per = 0.9992600813910469

Artifact integrity:

exported root model.safetensors matches local checkpoint-8500
pushed Hugging Face model.safetensors matches that same artifact

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 64
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1318
num_epochs: 3

Training results

These are trainer-side phoneme-CTC validation metrics, not final pronunciation scoring metrics.

Training Loss	Epoch	Step	Validation Loss	Eval PER
27.9965	0.1138	500	3.5980	1.0
27.0035	0.2276	1000	3.4275	1.0
4.2237	0.3414	1500	0.4311	1.0
2.6999	0.4552	2000	0.2837	1.0
2.0393	0.5690	2500	0.2120	1.0
1.6892	0.6828	3000	0.1822	0.9996
1.5122	0.7966	3500	0.1585	0.9993
1.3881	0.9104	4000	0.1543	0.9993
1.2519	1.0241	4500	0.1409	0.9996
1.2038	1.1379	5000	0.1329	0.9993
1.1547	1.2517	5500	0.1339	0.9996
1.1525	1.3655	6000	0.1276	0.9993
1.0925	1.4793	6500	0.1328	0.9993
1.0814	1.5931	7000	0.1172	0.9993
1.0529	1.7069	7500	0.1149	0.9993
1.0264	1.8207	8000	0.1172	0.9993
1.0404	1.9345	8500	0.1141	0.9989
0.9425	2.0483	9000	0.1150	0.9993
0.9543	2.1621	9500	0.1157	0.9993
0.9436	2.2759	10000	0.1175	0.9989
0.9406	2.3897	10500	0.1097	0.9993
0.9180	2.5035	11000	0.1096	0.9993
0.9193	2.6173	11500	0.1112	0.9993
0.9005	2.7311	12000	0.1105	0.9993
0.8939	2.8449	12500	0.1083	0.9993
0.9091	2.9587	13000	0.1087	0.9993

Related runs

W&B training run: https://wandb.ai/peacockery/hubert-base-phoneme-en/runs/qe7scuxw
Canonical train sweep: projects/P003-compact-backbones/experiments/sweeps/final/train_hubert_base.yaml
Canonical eval sweep: projects/P003-compact-backbones/experiments/sweeps/final/eval_hubert_base.yaml

Research status

Training artifact: complete
Hugging Face export: complete
P003 pronunciation-scoring sweep: pending / separate from this card

Framework versions

Transformers 5.2.0
Pytorch 2.8.0+cu128
Datasets 4.5.0
Tokenizers 0.22.2

Downloads last month: 31

Safetensors

Model size

94.4M params

Tensor type

F32

Model tree for Peacockery/hubert-base-phoneme-en

Base model

facebook/hubert-base-ls960

Finetuned

(150)

this model

Peacockery
/

hubert-base-phoneme-en