--- base_model: answerdotai/ModernBERT-base language: - en library_name: transformers pipeline_tag: text-classification tags: - reasoning - complexity - education - regression - fineweb-edu --- # Reasoning Complexity Classifier A ModernBERT-base model fine-tuned to predict the **reasoning complexity** of educational text on a continuous 1–4 scale. Trained on FineWeb-Edu documents labeled by GPT-5-nano via the OpenAI Batch API (~$20 in credits). ## Model Description This is a regression model (`num_labels=1`, `problem_type="regression"`) that outputs a continuous score. The score can be rounded to the nearest integer to obtain a discrete complexity level. Level 5 (Formal/Abstract reasoning) was excluded from training due to data scarcity; the model's effective range is **1.0–4.0**. ### Complexity Levels | Level | Name | Description | Example | |-------|------|-------------|---------| | 1 | Factual/Declarative | States facts with no reasoning | "The Pacific Ocean covers ~165 million km²." | | 2 | Single-step reasoning | One inference or comparison | "Because boiling point decreases at altitude, water boils faster in Denver than Miami." | | 3 | Multi-step reasoning | 2–4 chained logical steps | "Demand rose while supply held fixed → prices rose → consumer spending fell → GDP slowed." | | 4 | Complex reasoning | 5+ steps, conditionals, competing factors | Medical differential diagnosis with branching conditions and exclusion criteria. | ## Training Details ### Data - **Source**: [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) — a curated subset of Common Crawl filtered for educational content. - **Labeling**: ~100,000 documents reservoir-sampled from ~6,000 records per subject category, then labeled with GPT-5-nano via the OpenAI Batch API using structured output (integer 1–5). - **Splits**: 80% train / 10% validation / 10% test (stratified by integer complexity level). - **Preprocessing**: Texts truncated to 8,000 characters before labeling; tokenized to 512 tokens during training with dynamic padding. - **Level 5 exclusion**: Rows labeled as level 5 were excluded from the training set. ### Hyperparameters | Parameter | Value | |-----------|-------| | Base model | `answerdotai/ModernBERT-base` | | Epochs | 3 | | Batch size | 32 | | Learning rate | 2e-5 | | Weight decay | 0.01 | | Warmup ratio | 0.1 | | Max token length | 512 | | Optimizer | AdamW | | Scheduler | Linear with warmup | | AMP | bf16 (CUDA) | | Loss | MSE | ### Training History | Epoch | Train Loss | Val MAE | Val Acc (rounded) | Val Spearman r | |-------|-----------|---------|-------------------|----------------| | 1 | 0.6002 | 0.5190 | 56.98% | 0.7533 | | **2** | **0.3631** | **0.5040** | **58.43%** | **0.7597** | | 3 | 0.2040 | 0.5114 | 58.19% | 0.7485 | The best checkpoint (by validation MAE) was saved at **epoch 2**. ## Evaluation Results Evaluated on a held-out test set: | Metric | Value | |--------|-------| | MSE | 0.4388 | | MAE | 0.5063 | | Rounded accuracy | 58.6% | | Spearman r | 0.7527 | **Interpretation**: The model achieves a Spearman correlation of ~0.75 with gold labels, indicating strong ordinal ranking ability. The MAE of ~0.51 means predictions are on average within half a level of the true score when treated as a continuous signal. ### Output Interpretation | Raw score | Meaning | |-----------|---------| | ~1.0 | Factual/Declarative | | ~2.0 | Single-step reasoning | | ~3.0 | Multi-step reasoning | | ~4.0 | Complex reasoning | Clip and round the raw float output to `[1, 4]` for a discrete level. ## Architecture Based on `answerdotai/ModernBERT-base`: - **Layers**: 22 transformer layers (alternating full and sliding attention) - **Hidden size**: 768 - **Attention heads**: 12 - **Intermediate size**: 1,152 - **Max position embeddings**: 8,192 - **Classifier pooling**: mean - **Classifier activation**: GELU ## Limitations - Labels are silver-standard (GPT-5-nano), not human-annotated; label noise may affect the ~1.5% of ambiguous texts. - Texts are truncated to 512 tokens; very long documents are judged on their first ~512 tokens only. - Trained primarily on English educational web text; performance may degrade on other domains or languages. ## Intended Use Designed for data curation pipelines that need to filter or balance training corpora by reasoning complexity — for example, constructing curriculum-ordered datasets for language model training.