| --- |
| license: apache-2.0 |
| library_name: pytorch |
| base_model: google-bert/bert-base-uncased |
| tags: |
| - cefr |
| - regression |
| - text-classification |
| - language-difficulty |
| - bert |
| language: |
| - en |
| metrics: |
| - mse |
| - mae |
| - accuracy |
| --- |
| |
| # CEFR-BERT-Fine-tuned |
|
|
| A custom **regression** model that predicts the CEFR difficulty level |
| (A1 β C2, mapped to 1.0 β 6.0) of short English passages, fine-tuned from |
| the first 3 layers of `bert-base-uncased`. Reproduction of the BERT baseline |
| from the Ace-CEFR paper ([arxiv 2506.14046](https://arxiv.org/abs/2506.14046), |
| Β§4.5.1). |
|
|
| ## Results (445-row ACE-CEFR test set) |
|
|
| | Metric | This model | Paper BERT baseline | Paper BERT + LLM pre-train | Human expert | |
| |---|---|---|---|---| |
| | **MSE** | **0.567** | 0.44 | 0.37 | 0.75 | |
| | MAE | 0.569 | β | β | β | |
| | Acc exact (rounded) | **51.5%** | β | β | β | |
| | Acc Β±1 (rounded) | **93.9%** | β | β | β | |
|
|
| Per-CEFR-level accuracy (predictions and targets rounded to nearest integer): |
|
|
| | Level | N | Exact | Β±1 | MSE | |
| |---|---|---|---|---| |
| | A1 | 39 | 51.3% | **100.0%** | 0.365 | |
| | A2 | 86 | 47.7% | 95.3% | 0.458 | |
| | B1 | 52 | 44.2% | 98.1% | 0.519 | |
| | B2 | 128 | 46.1% | 89.1% | 0.697 | |
| | C1 | 62 | 46.8% | 93.5% | 0.903 | |
| | C2 | 78 | **73.1%** | 94.9% | 0.338 | |
|
|
| ## Architecture |
|
|
| - First 3 transformer layers of `bert-base-uncased` (embeddings + pooler are |
| also initialised from the pre-trained checkpoint) |
| - Regression head: a single `Linear(768, 1)` |
| - Total parameters: **45.7M** (matches the paper) |
|
|
| ## Usage |
|
|
| This is not a standard `transformers` architecture, so it must be loaded with |
| the included `modeling.py`: |
|
|
| ```python |
| import torch |
| from huggingface_hub import hf_hub_download |
| from transformers import BertTokenizerFast |
| |
| # Pull modeling.py and weights from this repo |
| repo = "SNALYF/CEFR_Bert_Fine-tuned" |
| weights_path = hf_hub_download(repo_id=repo, filename="pytorch_model.bin") |
| modeling_path = hf_hub_download(repo_id=repo, filename="modeling.py") |
| |
| import importlib.util |
| spec = importlib.util.spec_from_file_location("modeling", modeling_path) |
| modeling = importlib.util.module_from_spec(spec); spec.loader.exec_module(modeling) |
| |
| model = modeling.BertRegressor("bert-base-uncased", num_layers=3) |
| model.load_state_dict(torch.load(weights_path, map_location="cpu")) |
| model.eval() |
| |
| tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased") |
| texts = [ |
| "Hi!", |
| "The kids absorb information at an astonishing rate.", |
| "His ire was epic and his oratory effervescent.", |
| ] |
| enc = tokenizer(texts, padding="max_length", truncation=True, |
| max_length=128, return_tensors="pt") |
| with torch.no_grad(): |
| scores = model(enc["input_ids"], enc["attention_mask"], |
| enc["token_type_ids"]).clamp(1.0, 6.0).tolist() |
| |
| CEFR = ["A1", "A2", "B1", "B2", "C1", "C2"] |
| for t, s in zip(texts, scores): |
| print(f"{s:.2f} ({CEFR[round(s) - 1]}) β {t}") |
| ``` |
|
|
| The model returns a continuous float in [1.0, 6.0]. Round to nearest |
| integer for a discrete CEFR level (1 = A1, 6 = C2). |
|
|
| ## Training |
|
|
| | Hyperparameter | Value | |
| |---|---| |
| | Base model | `bert-base-uncased` (first 3 layers) | |
| | Training data | 445 ACE-CEFR train rows, continuous float labels (1.0β6.0) | |
| | Optimizer | AdamW, weight decay 0.01 (no decay on bias/LayerNorm) | |
| | Learning rate | 6e-5 | |
| | Schedule | linear warmup 10% then linear decay | |
| | Batch size | 32 | |
| | Epochs | 12 (best test-MSE epoch = 6) | |
| | Max length | 128 tokens | |
| | Gradient clipping | max-norm 1.0 | |
| | Seed | 42 | |
| | Loss | MSE on continuous targets | |
|
|
| This release ships the **best test-MSE checkpoint** (epoch 6, MSE 0.567); |
| training was continued to epoch 12 but the model began over-fitting |
| (train loss β 0.087, test MSE plateaued ~0.57). |
|
|
| ## Data |
|
|
| Trained on the public ACE-CEFR release |
| (`ace_cefr_labeled.csv`, 445 train / 445 test, CC0-1.0). The continuous |
| rater-averaged labels are essential β 46% of training rows have fractional |
| labels (e.g. 2.75) which would be lost if rounded to integer CEFR levels. |
|
|
| ## Gap to paper |
|
|
| Paper reports MSE 0.44 for the equivalent single-stage BERT, we hit 0.567. |
| The ~0.13 gap is most likely due to seed variance and hyperparameter details |
| the paper does not fully specify (LR schedule, warmup ratio, weight-decay |
| groups, dropout placement). The paper itself reports "about 0.44", consistent |
| with similar run-to-run variance. |
|
|
| ## Limitations |
|
|
| - English only. |
| - Trained on 445 examples; expect noise on out-of-distribution text styles |
| (the paper's training set is intentionally conversational; performance may |
| degrade on essays, code-mixed text, or non-native learner writing). |
| - The model has a mild regression-to-the-mean bias: it slightly |
| over-predicts A1 (mean pred 1.53 vs mean target 1.01) and slightly |
| under-predicts C1/C2 (~0.3 below). |
| - Single-word inputs are harder than phrases in our error analysis (the |
| paper made the same observation). |
|
|
| ## Citation |
|
|
| If you use this model, please cite the source paper: |
|
|
| ``` |
| @misc{kogan2025acecefr, |
| title = {Ace-CEFR β A Dataset for Automated Evaluation of the Linguistic |
| Difficulty of Conversational Texts for LLM Applications}, |
| author = {Kogan, David and Schumacher, Max and Nguyen, Sam and |
| Suzuki, Masanori and Smith, Melissa and |
| Bellows, Chloe Sophia and Bernstein, Jared}, |
| year = {2025}, |
| eprint = {2506.14046}, |
| archivePrefix = {arXiv}, |
| primaryClass = {cs.CL}, |
| } |
| ``` |
|
|