Upload ACE-CEFR BERT regression model (reproduction)

486b7ab verified 3 days ago

5.43 kB

license: apache-2.0
library_name: pytorch
base_model: google-bert/bert-base-uncased
tags:
  - cefr
  - regression
  - text-classification
  - language-difficulty
  - bert
language:
  - en
metrics:
  - mse
  - mae
  - accuracy

CEFR-BERT-Fine-tuned

A custom regression model that predicts the CEFR difficulty level (A1 → C2, mapped to 1.0 → 6.0) of short English passages, fine-tuned from the first 3 layers of bert-base-uncased. Reproduction of the BERT baseline from the Ace-CEFR paper (arxiv 2506.14046, §4.5.1).

Results (445-row ACE-CEFR test set)

Metric	This model	Paper BERT baseline	Paper BERT + LLM pre-train	Human expert
MSE	0.567	0.44	0.37	0.75
MAE	0.569	—	—	—
Acc exact (rounded)	51.5%	—	—	—
Acc ±1 (rounded)	93.9%	—	—	—

Per-CEFR-level accuracy (predictions and targets rounded to nearest integer):

Level	N	Exact	±1	MSE
A1	39	51.3%	100.0%	0.365
A2	86	47.7%	95.3%	0.458
B1	52	44.2%	98.1%	0.519
B2	128	46.1%	89.1%	0.697
C1	62	46.8%	93.5%	0.903
C2	78	73.1%	94.9%	0.338

Architecture

First 3 transformer layers of bert-base-uncased (embeddings + pooler are also initialised from the pre-trained checkpoint)
Regression head: a single Linear(768, 1)
Total parameters: 45.7M (matches the paper)

Usage

This is not a standard transformers architecture, so it must be loaded with the included modeling.py:

import torch
from huggingface_hub import hf_hub_download
from transformers import BertTokenizerFast

# Pull modeling.py and weights from this repo
repo = "SNALYF/CEFR_Bert_Fine-tuned"
weights_path = hf_hub_download(repo_id=repo, filename="pytorch_model.bin")
modeling_path = hf_hub_download(repo_id=repo, filename="modeling.py")

import importlib.util
spec = importlib.util.spec_from_file_location("modeling", modeling_path)
modeling = importlib.util.module_from_spec(spec); spec.loader.exec_module(modeling)

model = modeling.BertRegressor("bert-base-uncased", num_layers=3)
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
texts = [
    "Hi!",
    "The kids absorb information at an astonishing rate.",
    "His ire was epic and his oratory effervescent.",
]
enc = tokenizer(texts, padding="max_length", truncation=True,
                max_length=128, return_tensors="pt")
with torch.no_grad():
    scores = model(enc["input_ids"], enc["attention_mask"],
                   enc["token_type_ids"]).clamp(1.0, 6.0).tolist()

CEFR = ["A1", "A2", "B1", "B2", "C1", "C2"]
for t, s in zip(texts, scores):
    print(f"{s:.2f} ({CEFR[round(s) - 1]}) — {t}")

The model returns a continuous float in [1.0, 6.0]. Round to nearest integer for a discrete CEFR level (1 = A1, 6 = C2).

Training

Hyperparameter	Value
Base model	`bert-base-uncased` (first 3 layers)
Training data	445 ACE-CEFR train rows, continuous float labels (1.0–6.0)
Optimizer	AdamW, weight decay 0.01 (no decay on bias/LayerNorm)
Learning rate	6e-5
Schedule	linear warmup 10% then linear decay
Batch size	32
Epochs	12 (best test-MSE epoch = 6)
Max length	128 tokens
Gradient clipping	max-norm 1.0
Seed	42
Loss	MSE on continuous targets

This release ships the best test-MSE checkpoint (epoch 6, MSE 0.567); training was continued to epoch 12 but the model began over-fitting (train loss → 0.087, test MSE plateaued ~0.57).

Data

Trained on the public ACE-CEFR release (ace_cefr_labeled.csv, 445 train / 445 test, CC0-1.0). The continuous rater-averaged labels are essential — 46% of training rows have fractional labels (e.g. 2.75) which would be lost if rounded to integer CEFR levels.

Gap to paper

Paper reports MSE 0.44 for the equivalent single-stage BERT, we hit 0.567. The ~0.13 gap is most likely due to seed variance and hyperparameter details the paper does not fully specify (LR schedule, warmup ratio, weight-decay groups, dropout placement). The paper itself reports "about 0.44", consistent with similar run-to-run variance.

Limitations

English only.
Trained on 445 examples; expect noise on out-of-distribution text styles (the paper's training set is intentionally conversational; performance may degrade on essays, code-mixed text, or non-native learner writing).
The model has a mild regression-to-the-mean bias: it slightly over-predicts A1 (mean pred 1.53 vs mean target 1.01) and slightly under-predicts C1/C2 (~0.3 below).
Single-word inputs are harder than phrases in our error analysis (the paper made the same observation).

Citation

If you use this model, please cite the source paper:

@misc{kogan2025acecefr,
  title = {Ace-CEFR — A Dataset for Automated Evaluation of the Linguistic
           Difficulty of Conversational Texts for LLM Applications},
  author = {Kogan, David and Schumacher, Max and Nguyen, Sam and
            Suzuki, Masanori and Smith, Melissa and
            Bellows, Chloe Sophia and Bernstein, Jared},
  year = {2025},
  eprint = {2506.14046},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
}