Upload ACE-CEFR BERT regression model (reproduction)

486b7ab verified 5 days ago

5.43 kB

	---
	license: apache-2.0
	library_name: pytorch
	base_model: google-bert/bert-base-uncased
	tags:
	- cefr
	- regression
	- text-classification
	- language-difficulty
	- bert
	language:
	- en
	metrics:
	- mse
	- mae
	- accuracy
	---

	# CEFR-BERT-Fine-tuned

	A custom regression model that predicts the CEFR difficulty level
	(A1 → C2, mapped to 1.0 → 6.0) of short English passages, fine-tuned from
	the first 3 layers of `bert-base-uncased`. Reproduction of the BERT baseline
	from the Ace-CEFR paper ([arxiv 2506.14046](https://arxiv.org/abs/2506.14046),
	§4.5.1).

	## Results (445-row ACE-CEFR test set)

	\| Metric \| This model \| Paper BERT baseline \| Paper BERT + LLM pre-train \| Human expert \|
	\|---\|---\|---\|---\|---\|
	\| MSE \| 0.567 \| 0.44 \| 0.37 \| 0.75 \|
	\| MAE \| 0.569 \| — \| — \| — \|
	\| Acc exact (rounded) \| 51.5% \| — \| — \| — \|
	\| Acc ±1 (rounded) \| 93.9% \| — \| — \| — \|

	Per-CEFR-level accuracy (predictions and targets rounded to nearest integer):

	\| Level \| N \| Exact \| ±1 \| MSE \|
	\|---\|---\|---\|---\|---\|
	\| A1 \| 39 \| 51.3% \| 100.0% \| 0.365 \|
	\| A2 \| 86 \| 47.7% \| 95.3% \| 0.458 \|
	\| B1 \| 52 \| 44.2% \| 98.1% \| 0.519 \|
	\| B2 \| 128 \| 46.1% \| 89.1% \| 0.697 \|
	\| C1 \| 62 \| 46.8% \| 93.5% \| 0.903 \|
	\| C2 \| 78 \| 73.1% \| 94.9% \| 0.338 \|

	## Architecture

	- First 3 transformer layers of `bert-base-uncased` (embeddings + pooler are
	also initialised from the pre-trained checkpoint)
	- Regression head: a single `Linear(768, 1)`
	- Total parameters: 45.7M (matches the paper)

	## Usage

	This is not a standard `transformers` architecture, so it must be loaded with
	the included `modeling.py`:

	```python
	import torch
	from huggingface_hub import hf_hub_download
	from transformers import BertTokenizerFast

	# Pull modeling.py and weights from this repo
	repo = "SNALYF/CEFR_Bert_Fine-tuned"
	weights_path = hf_hub_download(repo_id=repo, filename="pytorch_model.bin")
	modeling_path = hf_hub_download(repo_id=repo, filename="modeling.py")

	import importlib.util
	spec = importlib.util.spec_from_file_location("modeling", modeling_path)
	modeling = importlib.util.module_from_spec(spec); spec.loader.exec_module(modeling)

	model = modeling.BertRegressor("bert-base-uncased", num_layers=3)
	model.load_state_dict(torch.load(weights_path, map_location="cpu"))
	model.eval()

	tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
	texts = [
	"Hi!",
	"The kids absorb information at an astonishing rate.",
	"His ire was epic and his oratory effervescent.",
	]
	enc = tokenizer(texts, padding="max_length", truncation=True,
	max_length=128, return_tensors="pt")
	with torch.no_grad():
	scores = model(enc["input_ids"], enc["attention_mask"],
	enc["token_type_ids"]).clamp(1.0, 6.0).tolist()

	CEFR = ["A1", "A2", "B1", "B2", "C1", "C2"]
	for t, s in zip(texts, scores):
	print(f"{s:.2f} ({CEFR[round(s) - 1]}) — {t}")
	```

	The model returns a continuous float in [1.0, 6.0]. Round to nearest
	integer for a discrete CEFR level (1 = A1, 6 = C2).

	## Training

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Base model \| `bert-base-uncased` (first 3 layers) \|
	\| Training data \| 445 ACE-CEFR train rows, continuous float labels (1.0–6.0) \|
	\| Optimizer \| AdamW, weight decay 0.01 (no decay on bias/LayerNorm) \|
	\| Learning rate \| 6e-5 \|
	\| Schedule \| linear warmup 10% then linear decay \|
	\| Batch size \| 32 \|
	\| Epochs \| 12 (best test-MSE epoch = 6) \|
	\| Max length \| 128 tokens \|
	\| Gradient clipping \| max-norm 1.0 \|
	\| Seed \| 42 \|
	\| Loss \| MSE on continuous targets \|

	This release ships the best test-MSE checkpoint (epoch 6, MSE 0.567);
	training was continued to epoch 12 but the model began over-fitting
	(train loss → 0.087, test MSE plateaued ~0.57).

	## Data

	Trained on the public ACE-CEFR release
	(`ace_cefr_labeled.csv`, 445 train / 445 test, CC0-1.0). The continuous
	rater-averaged labels are essential — 46% of training rows have fractional
	labels (e.g. 2.75) which would be lost if rounded to integer CEFR levels.

	## Gap to paper

	Paper reports MSE 0.44 for the equivalent single-stage BERT, we hit 0.567.
	The ~0.13 gap is most likely due to seed variance and hyperparameter details
	the paper does not fully specify (LR schedule, warmup ratio, weight-decay
	groups, dropout placement). The paper itself reports "about 0.44", consistent
	with similar run-to-run variance.

	## Limitations

	- English only.
	- Trained on 445 examples; expect noise on out-of-distribution text styles
	(the paper's training set is intentionally conversational; performance may
	degrade on essays, code-mixed text, or non-native learner writing).
	- The model has a mild regression-to-the-mean bias: it slightly
	over-predicts A1 (mean pred 1.53 vs mean target 1.01) and slightly
	under-predicts C1/C2 (~0.3 below).
	- Single-word inputs are harder than phrases in our error analysis (the
	paper made the same observation).

	## Citation

	If you use this model, please cite the source paper:

	```
	@misc{kogan2025acecefr,
	title = {Ace-CEFR — A Dataset for Automated Evaluation of the Linguistic
	Difficulty of Conversational Texts for LLM Applications},
	author = {Kogan, David and Schumacher, Max and Nguyen, Sam and
	Suzuki, Masanori and Smith, Melissa and
	Bellows, Chloe Sophia and Bernstein, Jared},
	year = {2025},
	eprint = {2506.14046},
	archivePrefix = {arXiv},
	primaryClass = {cs.CL},
	}
	```