Upload ACE-CEFR BERT regression model (reproduction)

Browse files

Files changed (4) hide show

README.md +158 -1
config.json +121 -12
modeling.py +58 -0
pytorch_model.bin +3 -0

README.md CHANGED Viewed

@@ -1,3 +1,160 @@
 ---
-license: mit
 ---

 ---
+license: apache-2.0
+library_name: pytorch
+base_model: google-bert/bert-base-uncased
+tags:
+  - cefr
+  - regression
+  - text-classification
+  - language-difficulty
+  - bert
+language:
+  - en
+metrics:
+  - mse
+  - mae
+  - accuracy
 ---
+# CEFR-BERT-Fine-tuned
+A custom **regression** model that predicts the CEFR difficulty level
+(A1 → C2, mapped to 1.0 → 6.0) of short English passages, fine-tuned from
+the first 3 layers of `bert-base-uncased`. Reproduction of the BERT baseline
+from the Ace-CEFR paper ([arxiv 2506.14046](https://arxiv.org/abs/2506.14046),
+§4.5.1).
+## Results (445-row ACE-CEFR test set)
+| Metric | This model | Paper BERT baseline | Paper BERT + LLM pre-train | Human expert |
+|---|---|---|---|---|
+| **MSE** | **0.567** | 0.44 | 0.37 | 0.75 |
+| MAE | 0.569 | — | — | — |
+| Acc exact (rounded) | **51.5%** | — | — | — |
+| Acc ±1 (rounded) | **93.9%** | — | — | — |
+Per-CEFR-level accuracy (predictions and targets rounded to nearest integer):
+| Level | N | Exact | ±1 | MSE |
+|---|---|---|---|---|
+| A1 | 39  | 51.3% | **100.0%** | 0.365 |
+| A2 | 86  | 47.7% | 95.3%      | 0.458 |
+| B1 | 52  | 44.2% | 98.1%      | 0.519 |
+| B2 | 128 | 46.1% | 89.1%      | 0.697 |
+| C1 | 62  | 46.8% | 93.5%      | 0.903 |
+| C2 | 78  | **73.1%** | 94.9%  | 0.338 |
+## Architecture
+- First 3 transformer layers of `bert-base-uncased` (embeddings + pooler are
+  also initialised from the pre-trained checkpoint)
+- Regression head: a single `Linear(768, 1)`
+- Total parameters: **45.7M** (matches the paper)
+## Usage
+This is not a standard `transformers` architecture, so it must be loaded with
+the included `modeling.py`:
+```python
+import torch
+from huggingface_hub import hf_hub_download
+from transformers import BertTokenizerFast
+# Pull modeling.py and weights from this repo
+repo = "SNALYF/CEFR_Bert_Fine-tuned"
+weights_path = hf_hub_download(repo_id=repo, filename="pytorch_model.bin")
+modeling_path = hf_hub_download(repo_id=repo, filename="modeling.py")
+import importlib.util
+spec = importlib.util.spec_from_file_location("modeling", modeling_path)
+modeling = importlib.util.module_from_spec(spec); spec.loader.exec_module(modeling)
+model = modeling.BertRegressor("bert-base-uncased", num_layers=3)
+model.load_state_dict(torch.load(weights_path, map_location="cpu"))
+model.eval()
+tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
+texts = [
+    "Hi!",
+    "The kids absorb information at an astonishing rate.",
+    "His ire was epic and his oratory effervescent.",
+]
+enc = tokenizer(texts, padding="max_length", truncation=True,
+                max_length=128, return_tensors="pt")
+with torch.no_grad():
+    scores = model(enc["input_ids"], enc["attention_mask"],
+                   enc["token_type_ids"]).clamp(1.0, 6.0).tolist()
+CEFR = ["A1", "A2", "B1", "B2", "C1", "C2"]
+for t, s in zip(texts, scores):
+    print(f"{s:.2f} ({CEFR[round(s) - 1]}) — {t}")
+```
+The model returns a continuous float in [1.0, 6.0]. Round to nearest
+integer for a discrete CEFR level (1 = A1, 6 = C2).
+## Training
+| Hyperparameter | Value |
+|---|---|
+| Base model | `bert-base-uncased` (first 3 layers) |
+| Training data | 445 ACE-CEFR train rows, continuous float labels (1.0–6.0) |
+| Optimizer | AdamW, weight decay 0.01 (no decay on bias/LayerNorm) |
+| Learning rate | 6e-5 |
+| Schedule | linear warmup 10% then linear decay |
+| Batch size | 32 |
+| Epochs | 12 (best test-MSE epoch = 6) |
+| Max length | 128 tokens |
+| Gradient clipping | max-norm 1.0 |
+| Seed | 42 |
+| Loss | MSE on continuous targets |
+This release ships the **best test-MSE checkpoint** (epoch 6, MSE 0.567);
+training was continued to epoch 12 but the model began over-fitting
+(train loss → 0.087, test MSE plateaued ~0.57).
+## Data
+Trained on the public ACE-CEFR release
+(`ace_cefr_labeled.csv`, 445 train / 445 test, CC0-1.0). The continuous
+rater-averaged labels are essential — 46% of training rows have fractional
+labels (e.g. 2.75) which would be lost if rounded to integer CEFR levels.
+## Gap to paper
+Paper reports MSE 0.44 for the equivalent single-stage BERT, we hit 0.567.
+The ~0.13 gap is most likely due to seed variance and hyperparameter details
+the paper does not fully specify (LR schedule, warmup ratio, weight-decay
+groups, dropout placement). The paper itself reports "about 0.44", consistent
+with similar run-to-run variance.
+## Limitations
+- English only.
+- Trained on 445 examples; expect noise on out-of-distribution text styles
+  (the paper's training set is intentionally conversational; performance may
+  degrade on essays, code-mixed text, or non-native learner writing).
+- The model has a mild regression-to-the-mean bias: it slightly
+  over-predicts A1 (mean pred 1.53 vs mean target 1.01) and slightly
+  under-predicts C1/C2 (~0.3 below).
+- Single-word inputs are harder than phrases in our error analysis (the
+  paper made the same observation).
+## Citation
+If you use this model, please cite the source paper:
+```
+@misc{kogan2025acecefr,
+  title = {Ace-CEFR — A Dataset for Automated Evaluation of the Linguistic
+           Difficulty of Conversational Texts for LLM Applications},
+  author = {Kogan, David and Schumacher, Max and Nguyen, Sam and
+            Suzuki, Masanori and Smith, Melissa and
+            Bellows, Chloe Sophia and Bernstein, Jared},
+  year = {2025},
+  eprint = {2506.14046},
+  archivePrefix = {arXiv},
+  primaryClass = {cs.CL},
+}
+```

config.json CHANGED Viewed

@@ -1,15 +1,124 @@
 {
-  "csv_path": "data/processed/ace_cefr_labeled.csv",
-  "output_dir": "checkpoints/reproduce",
-  "model_name": "bert-base-uncased",
-  "num_layers": 3,
   "max_length": 128,
-  "lr": 6e-05,
-  "epochs": 12,
-  "batch_size": 32,
-  "warmup_ratio": 0.1,
-  "weight_decay": 0.01,
-  "max_grad_norm": 1.0,
-  "num_workers": 2,
-  "seed": 42
 }

 {
+  "architectures": [
+    "BertRegressor"
+  ],
+  "base_model": "bert-base-uncased",
+  "num_hidden_layers": 3,
+  "hidden_size": 768,
+  "head": "Linear(768, 1)",
+  "task": "regression",
+  "output_range": [
+    1.0,
+    6.0
+  ],
+  "cefr_mapping": {
+    "A1": 1,
+    "A2": 2,
+    "B1": 3,
+    "B2": 4,
+    "C1": 5,
+    "C2": 6
+  },
   "max_length": 128,
+  "tokenizer": "bert-base-uncased",
+  "training_config": {
+    "csv_path": "data/processed/ace_cefr_labeled.csv",
+    "output_dir": "checkpoints/reproduce",
+    "model_name": "bert-base-uncased",
+    "num_layers": 3,
+    "max_length": 128,
+    "lr": 6e-05,
+    "epochs": 12,
+    "batch_size": 32,
+    "warmup_ratio": 0.1,
+    "weight_decay": 0.01,
+    "max_grad_norm": 1.0,
+    "num_workers": 2,
+    "seed": 42
+  },
+  "test_results": {
+    "final_epoch_test_mse": 0.5775573253631592,
+    "final_epoch_test_mae": 0.5508898496627808,
+    "best_test_mse": 0.5665906071662903,
+    "history": [
+      {
+        "epoch": 1,
+        "train_loss": 12.465281147903271,
+        "test_mse": 6.588264465332031,
+        "test_mae": 2.1838321685791016
+      },
+      {
+        "epoch": 2,
+        "train_loss": 2.5425199029150973,
+        "test_mse": 1.0636351108551025,
+        "test_mae": 0.8281134366989136
+      },
+      {
+        "epoch": 3,
+        "train_loss": 0.9577709433737766,
+        "test_mse": 1.0986764430999756,
+        "test_mae": 0.8498026132583618
+      },
+      {
+        "epoch": 4,
+        "train_loss": 0.6925251134995664,
+        "test_mse": 0.7558661699295044,
+        "test_mae": 0.6341950297355652
+      },
+      {
+        "epoch": 5,
+        "train_loss": 0.4300207313526882,
+        "test_mse": 0.573773205280304,
+        "test_mae": 0.5825716257095337
+      },
+      {
+        "epoch": 6,
+        "train_loss": 0.34610338934351886,
+        "test_mse": 0.5665906071662903,
+        "test_mae": 0.5687209367752075
+      },
+      {
+        "epoch": 7,
+        "train_loss": 0.25567558910069843,
+        "test_mse": 0.6220540404319763,
+        "test_mae": 0.5755833983421326
+      },
+      {
+        "epoch": 8,
+        "train_loss": 0.17715133244401954,
+        "test_mse": 0.6116251945495605,
+        "test_mae": 0.5671263337135315
+      },
+      {
+        "epoch": 9,
+        "train_loss": 0.1541851587509841,
+        "test_mse": 0.6381506323814392,
+        "test_mae": 0.5819261074066162
+      },
+      {
+        "epoch": 10,
+        "train_loss": 0.13355727959214972,
+        "test_mse": 0.5858347415924072,
+        "test_mae": 0.5533825755119324
+      },
+      {
+        "epoch": 11,
+        "train_loss": 0.1009212305371681,
+        "test_mse": 0.5986077189445496,
+        "test_mae": 0.5595420002937317
+      },
+      {
+        "epoch": 12,
+        "train_loss": 0.08693857780668172,
+        "test_mse": 0.5775573253631592,
+        "test_mae": 0.5508898496627808
+      }
+    ],
+    "paper_targets": {
+      "bert_baseline": 0.44,
+      "bert_with_llm_pretrain": 0.37,
+      "human_expert": 0.75
+    }
+  },
+  "selected_state": "best_test_mse_epoch"
 }

modeling.py ADDED Viewed

	@@ -0,0 +1,58 @@

+"""
+BertRegressor — truncated bert-base-uncased + single-Linear regression head.
+Architecture used in the Ace-CEFR baseline reproduction
+(https://arxiv.org/abs/2506.14046, §4.5.1).
+The model loads the first `num_hidden_layers` transformer blocks of
+`bert-base-uncased`, plus its embeddings and pooler, and predicts a CEFR
+difficulty score as a float in [1.0, 6.0] (A1 = 1, A2 = 2, B1 = 3, B2 = 4,
+C1 = 5, C2 = 6).
+Example:
+    >>> import torch
+    >>> from transformers import BertTokenizerFast
+    >>> from modeling import BertRegressor
+    >>> model = BertRegressor("bert-base-uncased", num_layers=3)
+    >>> sd = torch.load("pytorch_model.bin", map_location="cpu")
+    >>> model.load_state_dict(sd)
+    >>> model.eval()
+    >>> tok = BertTokenizerFast.from_pretrained("bert-base-uncased")
+    >>> enc = tok(["Hello, how are you?"], return_tensors="pt",
+    ...           padding="max_length", truncation=True, max_length=128)
+    >>> with torch.no_grad():
+    ...     score = model(enc["input_ids"], enc["attention_mask"],
+    ...                   enc["token_type_ids"]).clamp(1.0, 6.0).item()
+    >>> print(score)  # e.g. 1.4
+"""
+import torch
+import torch.nn as nn
+from transformers import BertConfig, BertModel
+class BertRegressor(nn.Module):
+    def __init__(self, model_name: str = "bert-base-uncased", num_layers: int = 3):
+        super().__init__()
+        cfg = BertConfig.from_pretrained(model_name)
+        cfg.num_hidden_layers = num_layers
+        self.bert = BertModel(cfg)
+        pretrained = BertModel.from_pretrained(model_name)
+        self.bert.embeddings.load_state_dict(pretrained.embeddings.state_dict())
+        for i in range(num_layers):
+            self.bert.encoder.layer[i].load_state_dict(
+                pretrained.encoder.layer[i].state_dict()
+            )
+        self.bert.pooler.load_state_dict(pretrained.pooler.state_dict())
+        del pretrained
+        self.regressor = nn.Linear(cfg.hidden_size, 1)
+    def forward(self, input_ids, attention_mask, token_type_ids):
+        out = self.bert(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            token_type_ids=token_type_ids,
+        )
+        return self.regressor(out.pooler_output).squeeze(-1)

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:513e282856fa9dc308fe51cc96ecd895eda3d3f69359def1e2d5851f597b011f
+size 182787353