hubert-base-phoneme-en

Visualize in Weights & Biases

This model is a fine-tuned version of facebook/hubert-base-ls960 for phoneme-level CTC prediction with the 41-token ARPABET vocabulary used by peacock-asr.

It is a P003 backbone artifact, not the final pronunciation-scoring result. The downstream GOP/GOPT evaluation on SpeechOcean762 happens in the separate P003 eval sweeps.

Model description

  • Base model: HuBERT-base (facebook/hubert-base-ls960, 95M parameters)
  • Fine-tuning objective: phoneme-level CTC
  • Training data: LibriSpeech 960h
  • Output vocabulary: repo-standard ARPABET token set used for pronunciation scoring backends
  • Hugging Face backend string for this repo: hf:Peacockery/hubert-base-phoneme-en

This artifact is intended to serve as the phoneme-posterior generator for the P003 compact-backbone comparison:

  • wav2vec2-base (95M) vs
  • HuBERT-base (95M) vs
  • w2v-bert-2.0 (600M) vs
  • future smaller backbones such as Citrinet

Intended uses & limitations

Intended use:

  • research on phoneme posterior extraction
  • backend for GOP-SF feature extraction
  • controlled backbone swaps inside the P003 pronunciation-scoring pipeline

Not intended as:

  • a general-purpose English ASR model
  • a production transcription endpoint
  • a standalone pronunciation assessor without the downstream GOP/GOPT pipeline

Important limitation:

  • the trainer-side eval/per metric reported here is only the phoneme-CTC training metric used during fine-tuning
  • the meaningful P003 research result is phone-level PCC on SpeechOcean762 after GOP/GOPT evaluation, which is tracked separately

Training and evaluation data

Training data:

  • LibriSpeech 960h (train_clean_100, train_clean_360, train_other_500)

Trainer-side evaluation data:

  • LibriSpeech validation split used by the Hugging Face Trainer

Research evaluation data:

  • SpeechOcean762 is used later in the P003 eval sweep, not in this training run

Training procedure

This run used the shared P003 phoneme-head recipe and completed all planned training steps locally on March 6, 2026.

Selection policy:

  • training ran to global_step = 13182
  • the exported root artifact was selected from the best checkpoint
  • best checkpoint by tracked trainer metric: checkpoint-8500
  • best tracked trainer metric:
    • eval_per = 0.9988901220865705 at step 8500
  • final trainer eval at step 13000:
    • eval_loss = 0.10874085873365402
    • eval_per = 0.9992600813910469

Artifact integrity:

  • exported root model.safetensors matches local checkpoint-8500
  • pushed Hugging Face model.safetensors matches that same artifact

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 64
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 1318
  • num_epochs: 3

Training results

These are trainer-side phoneme-CTC validation metrics, not final pronunciation scoring metrics.

Training Loss Epoch Step Validation Loss Eval PER
27.9965 0.1138 500 3.5980 1.0
27.0035 0.2276 1000 3.4275 1.0
4.2237 0.3414 1500 0.4311 1.0
2.6999 0.4552 2000 0.2837 1.0
2.0393 0.5690 2500 0.2120 1.0
1.6892 0.6828 3000 0.1822 0.9996
1.5122 0.7966 3500 0.1585 0.9993
1.3881 0.9104 4000 0.1543 0.9993
1.2519 1.0241 4500 0.1409 0.9996
1.2038 1.1379 5000 0.1329 0.9993
1.1547 1.2517 5500 0.1339 0.9996
1.1525 1.3655 6000 0.1276 0.9993
1.0925 1.4793 6500 0.1328 0.9993
1.0814 1.5931 7000 0.1172 0.9993
1.0529 1.7069 7500 0.1149 0.9993
1.0264 1.8207 8000 0.1172 0.9993
1.0404 1.9345 8500 0.1141 0.9989
0.9425 2.0483 9000 0.1150 0.9993
0.9543 2.1621 9500 0.1157 0.9993
0.9436 2.2759 10000 0.1175 0.9989
0.9406 2.3897 10500 0.1097 0.9993
0.9180 2.5035 11000 0.1096 0.9993
0.9193 2.6173 11500 0.1112 0.9993
0.9005 2.7311 12000 0.1105 0.9993
0.8939 2.8449 12500 0.1083 0.9993
0.9091 2.9587 13000 0.1087 0.9993

Related runs

Research status

  • Training artifact: complete
  • Hugging Face export: complete
  • P003 pronunciation-scoring sweep: pending / separate from this card

Framework versions

  • Transformers 5.2.0
  • Pytorch 2.8.0+cu128
  • Datasets 4.5.0
  • Tokenizers 0.22.2
Downloads last month
47
Safetensors
Model size
94.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Peacockery/hubert-base-phoneme-en

Finetuned
(134)
this model

Dataset used to train Peacockery/hubert-base-phoneme-en