ModernBERT-base Disfluency Detection — Real Data Baseline

Fine-tuned from answerdotai/ModernBERT-base using only real data from FluencyBank Timestamped (Romana et al., 2024).

Purpose

This model serves as the Experiment A baseline in an ablation study comparing:

The comparison quantifies the contribution of the synthetic data augmentation pipeline.

FluencyBank Timestamped — 3,430 segments from 37 adults who stutter. Split: 80/10/10 train/val/test (random_state=42). No synthetic data used.

FP > PW > RP > RV (corrected from original FP > RP > RV > PW) This allows ~2,048 real PW tokens to be correctly labeled.

Safetensors

Model size

0.1B params

Tensor type

F32