File size: 4,477 Bytes
742fffd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | ---
base_model: answerdotai/ModernBERT-base
language:
- en
library_name: transformers
pipeline_tag: text-classification
tags:
- reasoning
- complexity
- education
- regression
- fineweb-edu
---
# Reasoning Complexity Classifier
A ModernBERT-base model fine-tuned to predict the **reasoning complexity** of educational text on a continuous 1–4 scale. Trained on FineWeb-Edu documents labeled by GPT-5-nano via the OpenAI Batch API (~$20 in credits).
## Model Description
This is a regression model (`num_labels=1`, `problem_type="regression"`) that outputs a continuous score. The score can be rounded to the nearest integer to obtain a discrete complexity level. Level 5 (Formal/Abstract reasoning) was excluded from training due to data scarcity; the model's effective range is **1.0–4.0**.
### Complexity Levels
| Level | Name | Description | Example |
|-------|------|-------------|---------|
| 1 | Factual/Declarative | States facts with no reasoning | "The Pacific Ocean covers ~165 million km²." |
| 2 | Single-step reasoning | One inference or comparison | "Because boiling point decreases at altitude, water boils faster in Denver than Miami." |
| 3 | Multi-step reasoning | 2–4 chained logical steps | "Demand rose while supply held fixed → prices rose → consumer spending fell → GDP slowed." |
| 4 | Complex reasoning | 5+ steps, conditionals, competing factors | Medical differential diagnosis with branching conditions and exclusion criteria. |
## Training Details
### Data
- **Source**: [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) — a curated subset of Common Crawl filtered for educational content.
- **Labeling**: ~100,000 documents reservoir-sampled from ~6,000 records per subject category, then labeled with GPT-5-nano via the OpenAI Batch API using structured output (integer 1–5).
- **Splits**: 80% train / 10% validation / 10% test (stratified by integer complexity level).
- **Preprocessing**: Texts truncated to 8,000 characters before labeling; tokenized to 512 tokens during training with dynamic padding.
- **Level 5 exclusion**: Rows labeled as level 5 were excluded from the training set.
### Hyperparameters
| Parameter | Value |
|-----------|-------|
| Base model | `answerdotai/ModernBERT-base` |
| Epochs | 3 |
| Batch size | 32 |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Warmup ratio | 0.1 |
| Max token length | 512 |
| Optimizer | AdamW |
| Scheduler | Linear with warmup |
| AMP | bf16 (CUDA) |
| Loss | MSE |
### Training History
| Epoch | Train Loss | Val MAE | Val Acc (rounded) | Val Spearman r |
|-------|-----------|---------|-------------------|----------------|
| 1 | 0.6002 | 0.5190 | 56.98% | 0.7533 |
| **2** | **0.3631** | **0.5040** | **58.43%** | **0.7597** |
| 3 | 0.2040 | 0.5114 | 58.19% | 0.7485 |
The best checkpoint (by validation MAE) was saved at **epoch 2**.
## Evaluation Results
Evaluated on a held-out test set:
| Metric | Value |
|--------|-------|
| MSE | 0.4388 |
| MAE | 0.5063 |
| Rounded accuracy | 58.6% |
| Spearman r | 0.7527 |
**Interpretation**: The model achieves a Spearman correlation of ~0.75 with gold labels, indicating strong ordinal ranking ability. The MAE of ~0.51 means predictions are on average within half a level of the true score when treated as a continuous signal.
### Output Interpretation
| Raw score | Meaning |
|-----------|---------|
| ~1.0 | Factual/Declarative |
| ~2.0 | Single-step reasoning |
| ~3.0 | Multi-step reasoning |
| ~4.0 | Complex reasoning |
Clip and round the raw float output to `[1, 4]` for a discrete level.
## Architecture
Based on `answerdotai/ModernBERT-base`:
- **Layers**: 22 transformer layers (alternating full and sliding attention)
- **Hidden size**: 768
- **Attention heads**: 12
- **Intermediate size**: 1,152
- **Max position embeddings**: 8,192
- **Classifier pooling**: mean
- **Classifier activation**: GELU
## Limitations
- Labels are silver-standard (GPT-5-nano), not human-annotated; label noise may affect the ~1.5% of ambiguous texts.
- Texts are truncated to 512 tokens; very long documents are judged on their first ~512 tokens only.
- Trained primarily on English educational web text; performance may degrade on other domains or languages.
## Intended Use
Designed for data curation pipelines that need to filter or balance training corpora by reasoning complexity — for example, constructing curriculum-ordered datasets for language model training.
|