File size: 4,477 Bytes

742fffd

---
base_model: answerdotai/ModernBERT-base
language:
- en
library_name: transformers
pipeline_tag: text-classification
tags:
- reasoning
- complexity
- education
- regression
- fineweb-edu
---

# Reasoning Complexity Classifier

A ModernBERT-base model fine-tuned to predict the **reasoning complexity** of educational text on a continuous 1–4 scale. Trained on FineWeb-Edu documents labeled by GPT-5-nano via the OpenAI Batch API (~$20 in credits).

## Model Description

This is a regression model (`num_labels=1`, `problem_type="regression"`) that outputs a continuous score. The score can be rounded to the nearest integer to obtain a discrete complexity level. Level 5 (Formal/Abstract reasoning) was excluded from training due to data scarcity; the model's effective range is **1.0–4.0**.

### Complexity Levels

| Level | Name | Description | Example |
|-------|------|-------------|---------|
| 1 | Factual/Declarative | States facts with no reasoning | "The Pacific Ocean covers ~165 million km²." |
| 2 | Single-step reasoning | One inference or comparison | "Because boiling point decreases at altitude, water boils faster in Denver than Miami." |
| 3 | Multi-step reasoning | 2–4 chained logical steps | "Demand rose while supply held fixed → prices rose → consumer spending fell → GDP slowed." |
| 4 | Complex reasoning | 5+ steps, conditionals, competing factors | Medical differential diagnosis with branching conditions and exclusion criteria. |

## Training Details

### Data

- **Source**: [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) — a curated subset of Common Crawl filtered for educational content.
- **Labeling**: ~100,000 documents reservoir-sampled from ~6,000 records per subject category, then labeled with GPT-5-nano via the OpenAI Batch API using structured output (integer 1–5).
- **Splits**: 80% train / 10% validation / 10% test (stratified by integer complexity level).
- **Preprocessing**: Texts truncated to 8,000 characters before labeling; tokenized to 512 tokens during training with dynamic padding.
- **Level 5 exclusion**: Rows labeled as level 5 were excluded from the training set.

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Base model | `answerdotai/ModernBERT-base` |
| Epochs | 3 |
| Batch size | 32 |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Warmup ratio | 0.1 |
| Max token length | 512 |
| Optimizer | AdamW |
| Scheduler | Linear with warmup |
| AMP | bf16 (CUDA) |
| Loss | MSE |

### Training History

| Epoch | Train Loss | Val MAE | Val Acc (rounded) | Val Spearman r |
|-------|-----------|---------|-------------------|----------------|
| 1 | 0.6002 | 0.5190 | 56.98% | 0.7533 |
| **2** | **0.3631** | **0.5040** | **58.43%** | **0.7597** |
| 3 | 0.2040 | 0.5114 | 58.19% | 0.7485 |

The best checkpoint (by validation MAE) was saved at **epoch 2**.

## Evaluation Results

Evaluated on a held-out test set:

| Metric | Value |
|--------|-------|
| MSE | 0.4388 |
| MAE | 0.5063 |
| Rounded accuracy | 58.6% |
| Spearman r | 0.7527 |

**Interpretation**: The model achieves a Spearman correlation of ~0.75 with gold labels, indicating strong ordinal ranking ability. The MAE of ~0.51 means predictions are on average within half a level of the true score when treated as a continuous signal.

### Output Interpretation

| Raw score | Meaning |
|-----------|---------|
| ~1.0 | Factual/Declarative |
| ~2.0 | Single-step reasoning |
| ~3.0 | Multi-step reasoning |
| ~4.0 | Complex reasoning |

Clip and round the raw float output to `[1, 4]` for a discrete level.

## Architecture

Based on `answerdotai/ModernBERT-base`:

- **Layers**: 22 transformer layers (alternating full and sliding attention)
- **Hidden size**: 768
- **Attention heads**: 12
- **Intermediate size**: 1,152
- **Max position embeddings**: 8,192
- **Classifier pooling**: mean
- **Classifier activation**: GELU

## Limitations

- Labels are silver-standard (GPT-5-nano), not human-annotated; label noise may affect the ~1.5% of ambiguous texts.
- Texts are truncated to 512 tokens; very long documents are judged on their first ~512 tokens only.
- Trained primarily on English educational web text; performance may degrade on other domains or languages.

## Intended Use

Designed for data curation pipelines that need to filter or balance training corpora by reasoning complexity — for example, constructing curriculum-ordered datasets for language model training.