hubert-base-phoneme-en
This model is a fine-tuned version of
facebook/hubert-base-ls960
for phoneme-level CTC prediction with the 41-token ARPABET vocabulary used by
peacock-asr.
It is a P003 backbone artifact, not the final pronunciation-scoring result.
The downstream GOP/GOPT evaluation on SpeechOcean762 happens in the separate
P003 eval sweeps.
Model description
- Base model: HuBERT-base (
facebook/hubert-base-ls960, 95M parameters) - Fine-tuning objective: phoneme-level CTC
- Training data: LibriSpeech 960h
- Output vocabulary: repo-standard ARPABET token set used for pronunciation scoring backends
- Hugging Face backend string for this repo:
hf:Peacockery/hubert-base-phoneme-en
This artifact is intended to serve as the phoneme-posterior generator for the
P003 compact-backbone comparison:
wav2vec2-base(95M) vsHuBERT-base(95M) vsw2v-bert-2.0(600M) vs- future smaller backbones such as Citrinet
Intended uses & limitations
Intended use:
- research on phoneme posterior extraction
- backend for GOP-SF feature extraction
- controlled backbone swaps inside the
P003pronunciation-scoring pipeline
Not intended as:
- a general-purpose English ASR model
- a production transcription endpoint
- a standalone pronunciation assessor without the downstream GOP/GOPT pipeline
Important limitation:
- the trainer-side
eval/permetric reported here is only the phoneme-CTC training metric used during fine-tuning - the meaningful
P003research result is phone-level PCC on SpeechOcean762 after GOP/GOPT evaluation, which is tracked separately
Training and evaluation data
Training data:
- LibriSpeech 960h (
train_clean_100,train_clean_360,train_other_500)
Trainer-side evaluation data:
- LibriSpeech validation split used by the Hugging Face
Trainer
Research evaluation data:
- SpeechOcean762 is used later in the
P003eval sweep, not in this training run
Training procedure
This run used the shared P003 phoneme-head recipe and completed all planned
training steps locally on March 6, 2026.
Selection policy:
- training ran to
global_step = 13182 - the exported root artifact was selected from the best checkpoint
- best checkpoint by tracked trainer metric:
checkpoint-8500 - best tracked trainer metric:
eval_per = 0.9988901220865705at step8500
- final trainer eval at step
13000:eval_loss = 0.10874085873365402eval_per = 0.9992600813910469
Artifact integrity:
- exported root
model.safetensorsmatches localcheckpoint-8500 - pushed Hugging Face
model.safetensorsmatches that same artifact
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 3e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 64
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 1318
- num_epochs: 3
Training results
These are trainer-side phoneme-CTC validation metrics, not final pronunciation scoring metrics.
| Training Loss | Epoch | Step | Validation Loss | Eval PER |
|---|---|---|---|---|
| 27.9965 | 0.1138 | 500 | 3.5980 | 1.0 |
| 27.0035 | 0.2276 | 1000 | 3.4275 | 1.0 |
| 4.2237 | 0.3414 | 1500 | 0.4311 | 1.0 |
| 2.6999 | 0.4552 | 2000 | 0.2837 | 1.0 |
| 2.0393 | 0.5690 | 2500 | 0.2120 | 1.0 |
| 1.6892 | 0.6828 | 3000 | 0.1822 | 0.9996 |
| 1.5122 | 0.7966 | 3500 | 0.1585 | 0.9993 |
| 1.3881 | 0.9104 | 4000 | 0.1543 | 0.9993 |
| 1.2519 | 1.0241 | 4500 | 0.1409 | 0.9996 |
| 1.2038 | 1.1379 | 5000 | 0.1329 | 0.9993 |
| 1.1547 | 1.2517 | 5500 | 0.1339 | 0.9996 |
| 1.1525 | 1.3655 | 6000 | 0.1276 | 0.9993 |
| 1.0925 | 1.4793 | 6500 | 0.1328 | 0.9993 |
| 1.0814 | 1.5931 | 7000 | 0.1172 | 0.9993 |
| 1.0529 | 1.7069 | 7500 | 0.1149 | 0.9993 |
| 1.0264 | 1.8207 | 8000 | 0.1172 | 0.9993 |
| 1.0404 | 1.9345 | 8500 | 0.1141 | 0.9989 |
| 0.9425 | 2.0483 | 9000 | 0.1150 | 0.9993 |
| 0.9543 | 2.1621 | 9500 | 0.1157 | 0.9993 |
| 0.9436 | 2.2759 | 10000 | 0.1175 | 0.9989 |
| 0.9406 | 2.3897 | 10500 | 0.1097 | 0.9993 |
| 0.9180 | 2.5035 | 11000 | 0.1096 | 0.9993 |
| 0.9193 | 2.6173 | 11500 | 0.1112 | 0.9993 |
| 0.9005 | 2.7311 | 12000 | 0.1105 | 0.9993 |
| 0.8939 | 2.8449 | 12500 | 0.1083 | 0.9993 |
| 0.9091 | 2.9587 | 13000 | 0.1087 | 0.9993 |
Related runs
- W&B training run: https://wandb.ai/peacockery/hubert-base-phoneme-en/runs/qe7scuxw
- Canonical train sweep:
projects/P003-compact-backbones/experiments/sweeps/final/train_hubert_base.yaml - Canonical eval sweep:
projects/P003-compact-backbones/experiments/sweeps/final/eval_hubert_base.yaml
Research status
- Training artifact: complete
- Hugging Face export: complete
P003pronunciation-scoring sweep: pending / separate from this card
Framework versions
- Transformers 5.2.0
- Pytorch 2.8.0+cu128
- Datasets 4.5.0
- Tokenizers 0.22.2
- Downloads last month
- 47
Model tree for Peacockery/hubert-base-phoneme-en
Base model
facebook/hubert-base-ls960