CEFR-BERT-Fine-tuned
A custom regression model that predicts the CEFR difficulty level
(A1 β C2, mapped to 1.0 β 6.0) of short English passages, fine-tuned from
the first 3 layers of bert-base-uncased. Reproduction of the BERT baseline
from the Ace-CEFR paper (arxiv 2506.14046,
Β§4.5.1).
Results (445-row ACE-CEFR test set)
| Metric | This model | Paper BERT baseline | Paper BERT + LLM pre-train | Human expert |
|---|---|---|---|---|
| MSE | 0.567 | 0.44 | 0.37 | 0.75 |
| MAE | 0.569 | β | β | β |
| Acc exact (rounded) | 51.5% | β | β | β |
| Acc Β±1 (rounded) | 93.9% | β | β | β |
Per-CEFR-level accuracy (predictions and targets rounded to nearest integer):
| Level | N | Exact | Β±1 | MSE |
|---|---|---|---|---|
| A1 | 39 | 51.3% | 100.0% | 0.365 |
| A2 | 86 | 47.7% | 95.3% | 0.458 |
| B1 | 52 | 44.2% | 98.1% | 0.519 |
| B2 | 128 | 46.1% | 89.1% | 0.697 |
| C1 | 62 | 46.8% | 93.5% | 0.903 |
| C2 | 78 | 73.1% | 94.9% | 0.338 |
Architecture
- First 3 transformer layers of
bert-base-uncased(embeddings + pooler are also initialised from the pre-trained checkpoint) - Regression head: a single
Linear(768, 1) - Total parameters: 45.7M (matches the paper)
Usage
This is not a standard transformers architecture, so it must be loaded with
the included modeling.py:
import torch
from huggingface_hub import hf_hub_download
from transformers import BertTokenizerFast
# Pull modeling.py and weights from this repo
repo = "SNALYF/CEFR_Bert_Fine-tuned"
weights_path = hf_hub_download(repo_id=repo, filename="pytorch_model.bin")
modeling_path = hf_hub_download(repo_id=repo, filename="modeling.py")
import importlib.util
spec = importlib.util.spec_from_file_location("modeling", modeling_path)
modeling = importlib.util.module_from_spec(spec); spec.loader.exec_module(modeling)
model = modeling.BertRegressor("bert-base-uncased", num_layers=3)
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
texts = [
"Hi!",
"The kids absorb information at an astonishing rate.",
"His ire was epic and his oratory effervescent.",
]
enc = tokenizer(texts, padding="max_length", truncation=True,
max_length=128, return_tensors="pt")
with torch.no_grad():
scores = model(enc["input_ids"], enc["attention_mask"],
enc["token_type_ids"]).clamp(1.0, 6.0).tolist()
CEFR = ["A1", "A2", "B1", "B2", "C1", "C2"]
for t, s in zip(texts, scores):
print(f"{s:.2f} ({CEFR[round(s) - 1]}) β {t}")
The model returns a continuous float in [1.0, 6.0]. Round to nearest integer for a discrete CEFR level (1 = A1, 6 = C2).
Training
| Hyperparameter | Value |
|---|---|
| Base model | bert-base-uncased (first 3 layers) |
| Training data | 445 ACE-CEFR train rows, continuous float labels (1.0β6.0) |
| Optimizer | AdamW, weight decay 0.01 (no decay on bias/LayerNorm) |
| Learning rate | 6e-5 |
| Schedule | linear warmup 10% then linear decay |
| Batch size | 32 |
| Epochs | 12 (best test-MSE epoch = 6) |
| Max length | 128 tokens |
| Gradient clipping | max-norm 1.0 |
| Seed | 42 |
| Loss | MSE on continuous targets |
This release ships the best test-MSE checkpoint (epoch 6, MSE 0.567); training was continued to epoch 12 but the model began over-fitting (train loss β 0.087, test MSE plateaued ~0.57).
Data
Trained on the public ACE-CEFR release
(ace_cefr_labeled.csv, 445 train / 445 test, CC0-1.0). The continuous
rater-averaged labels are essential β 46% of training rows have fractional
labels (e.g. 2.75) which would be lost if rounded to integer CEFR levels.
Gap to paper
Paper reports MSE 0.44 for the equivalent single-stage BERT, we hit 0.567. The ~0.13 gap is most likely due to seed variance and hyperparameter details the paper does not fully specify (LR schedule, warmup ratio, weight-decay groups, dropout placement). The paper itself reports "about 0.44", consistent with similar run-to-run variance.
Limitations
- English only.
- Trained on 445 examples; expect noise on out-of-distribution text styles (the paper's training set is intentionally conversational; performance may degrade on essays, code-mixed text, or non-native learner writing).
- The model has a mild regression-to-the-mean bias: it slightly over-predicts A1 (mean pred 1.53 vs mean target 1.01) and slightly under-predicts C1/C2 (~0.3 below).
- Single-word inputs are harder than phrases in our error analysis (the paper made the same observation).
Citation
If you use this model, please cite the source paper:
@misc{kogan2025acecefr,
title = {Ace-CEFR β A Dataset for Automated Evaluation of the Linguistic
Difficulty of Conversational Texts for LLM Applications},
author = {Kogan, David and Schumacher, Max and Nguyen, Sam and
Suzuki, Masanori and Smith, Melissa and
Bellows, Chloe Sophia and Bernstein, Jared},
year = {2025},
eprint = {2506.14046},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
}
- Downloads last month
- 26
Model tree for SNALYF/CEFR_Bert_Fine-tuned
Base model
google-bert/bert-base-uncased