SNALYF's picture
Upload ACE-CEFR BERT regression model (reproduction)
486b7ab verified
metadata
license: apache-2.0
library_name: pytorch
base_model: google-bert/bert-base-uncased
tags:
  - cefr
  - regression
  - text-classification
  - language-difficulty
  - bert
language:
  - en
metrics:
  - mse
  - mae
  - accuracy

CEFR-BERT-Fine-tuned

A custom regression model that predicts the CEFR difficulty level (A1 β†’ C2, mapped to 1.0 β†’ 6.0) of short English passages, fine-tuned from the first 3 layers of bert-base-uncased. Reproduction of the BERT baseline from the Ace-CEFR paper (arxiv 2506.14046, Β§4.5.1).

Results (445-row ACE-CEFR test set)

Metric This model Paper BERT baseline Paper BERT + LLM pre-train Human expert
MSE 0.567 0.44 0.37 0.75
MAE 0.569 β€” β€” β€”
Acc exact (rounded) 51.5% β€” β€” β€”
Acc Β±1 (rounded) 93.9% β€” β€” β€”

Per-CEFR-level accuracy (predictions and targets rounded to nearest integer):

Level N Exact Β±1 MSE
A1 39 51.3% 100.0% 0.365
A2 86 47.7% 95.3% 0.458
B1 52 44.2% 98.1% 0.519
B2 128 46.1% 89.1% 0.697
C1 62 46.8% 93.5% 0.903
C2 78 73.1% 94.9% 0.338

Architecture

  • First 3 transformer layers of bert-base-uncased (embeddings + pooler are also initialised from the pre-trained checkpoint)
  • Regression head: a single Linear(768, 1)
  • Total parameters: 45.7M (matches the paper)

Usage

This is not a standard transformers architecture, so it must be loaded with the included modeling.py:

import torch
from huggingface_hub import hf_hub_download
from transformers import BertTokenizerFast

# Pull modeling.py and weights from this repo
repo = "SNALYF/CEFR_Bert_Fine-tuned"
weights_path = hf_hub_download(repo_id=repo, filename="pytorch_model.bin")
modeling_path = hf_hub_download(repo_id=repo, filename="modeling.py")

import importlib.util
spec = importlib.util.spec_from_file_location("modeling", modeling_path)
modeling = importlib.util.module_from_spec(spec); spec.loader.exec_module(modeling)

model = modeling.BertRegressor("bert-base-uncased", num_layers=3)
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
texts = [
    "Hi!",
    "The kids absorb information at an astonishing rate.",
    "His ire was epic and his oratory effervescent.",
]
enc = tokenizer(texts, padding="max_length", truncation=True,
                max_length=128, return_tensors="pt")
with torch.no_grad():
    scores = model(enc["input_ids"], enc["attention_mask"],
                   enc["token_type_ids"]).clamp(1.0, 6.0).tolist()

CEFR = ["A1", "A2", "B1", "B2", "C1", "C2"]
for t, s in zip(texts, scores):
    print(f"{s:.2f} ({CEFR[round(s) - 1]}) β€” {t}")

The model returns a continuous float in [1.0, 6.0]. Round to nearest integer for a discrete CEFR level (1 = A1, 6 = C2).

Training

Hyperparameter Value
Base model bert-base-uncased (first 3 layers)
Training data 445 ACE-CEFR train rows, continuous float labels (1.0–6.0)
Optimizer AdamW, weight decay 0.01 (no decay on bias/LayerNorm)
Learning rate 6e-5
Schedule linear warmup 10% then linear decay
Batch size 32
Epochs 12 (best test-MSE epoch = 6)
Max length 128 tokens
Gradient clipping max-norm 1.0
Seed 42
Loss MSE on continuous targets

This release ships the best test-MSE checkpoint (epoch 6, MSE 0.567); training was continued to epoch 12 but the model began over-fitting (train loss β†’ 0.087, test MSE plateaued ~0.57).

Data

Trained on the public ACE-CEFR release (ace_cefr_labeled.csv, 445 train / 445 test, CC0-1.0). The continuous rater-averaged labels are essential β€” 46% of training rows have fractional labels (e.g. 2.75) which would be lost if rounded to integer CEFR levels.

Gap to paper

Paper reports MSE 0.44 for the equivalent single-stage BERT, we hit 0.567. The ~0.13 gap is most likely due to seed variance and hyperparameter details the paper does not fully specify (LR schedule, warmup ratio, weight-decay groups, dropout placement). The paper itself reports "about 0.44", consistent with similar run-to-run variance.

Limitations

  • English only.
  • Trained on 445 examples; expect noise on out-of-distribution text styles (the paper's training set is intentionally conversational; performance may degrade on essays, code-mixed text, or non-native learner writing).
  • The model has a mild regression-to-the-mean bias: it slightly over-predicts A1 (mean pred 1.53 vs mean target 1.01) and slightly under-predicts C1/C2 (~0.3 below).
  • Single-word inputs are harder than phrases in our error analysis (the paper made the same observation).

Citation

If you use this model, please cite the source paper:

@misc{kogan2025acecefr,
  title = {Ace-CEFR β€” A Dataset for Automated Evaluation of the Linguistic
           Difficulty of Conversational Texts for LLM Applications},
  author = {Kogan, David and Schumacher, Max and Nguyen, Sam and
            Suzuki, Masanori and Smith, Melissa and
            Bellows, Chloe Sophia and Bernstein, Jared},
  year = {2025},
  eprint = {2506.14046},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
}