mdonigian
/

fineweb-edu-topic-classifier

Text Classification

multi-label-classification

Model card Files Files and versions

mdonigian commited on Feb 18

Commit

0e14f17

·

verified ·

1 Parent(s): eae3b92

Upload folder using huggingface_hub

Files changed (1) hide show

README.md +72 -0

README.md ADDED Viewed

	@@ -0,0 +1,72 @@

+# fineweb-edu-classifier
+A fine-tuned [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) model for **multi-label subject classification** of educational web text. Given a passage of text, it predicts which of 17 academic/professional subject categories apply.
+## Model Details
+| Property | Value |
+|---|---|
+| Base model | `answerdotai/ModernBERT-base` |
+| Architecture | `ModernBertForSequenceClassification` |
+| Task | Multi-label classification |
+| Number of labels | 17 |
+| Max input length | 512 tokens |
+| Hidden size | 768 |
+| Attention heads | 12 |
+| Transformer layers | 22 (alternating full + sliding window attention) |
+| Pooling | Mean pooling |
+## Labels
+| Index | Field | Display Name |
+|---|---|---|
+| 0 | `mathematics_statistics` | Mathematics Statistics |
+| 1 | `computer_science_software_engineering` | Computer Science Software Engineering |
+| 2 | `machine_learning_ai` | Machine Learning AI |
+| 3 | `physical_sciences` | Physical Sciences |
+| 4 | `life_sciences_biology` | Life Sciences Biology |
+| 5 | `medicine_health` | Medicine Health |
+| 6 | `engineering_technology` | Engineering Technology |
+| 7 | `business_economics` | Business Economics |
+| 8 | `law_government` | Law Government |
+| 9 | `social_sciences` | Social Sciences |
+| 10 | `history_geography` | History Geography |
+| 11 | `philosophy_ethics` | Philosophy Ethics |
+| 12 | `education_pedagogy` | Education Pedagogy |
+| 13 | `language_writing` | Language Writing |
+| 14 | `arts_humanities` | Arts Humanities |
+| 15 | `environmental_science_energy` | Environmental Science Energy |
+| 16 | `personal_finance_practical_life` | Personal Finance Practical Life |
+## Training Data
+- Source: [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (CC-MAIN-2021-04 shard) plus ~50K rows from [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (10BT sample)
+- Labels were generated by gpt-5-nano via the OpenAI Batch API (~$80 in batch credits)
+- Data was split 80% train / 10% val / 10% test (random seed 42)
+## Training Configuration
+| Hyperparameter | Value |
+|---|---|
+| Epochs | 3 |
+| Batch size | 32 |
+| Learning rate | 2e-5 |
+| Weight decay | 0.01 |
+| Warmup ratio | 0.1 |
+| Max token length | 512 |
+| Optimizer | AdamW |
+| Scheduler | Linear with warmup |
+| AMP | bf16 (on CUDA) |
+| Gradient clipping | max norm 1.0 |
+Model checkpoint was saved at the epoch with the best validation micro-F1 (epoch 2).
+## Test Set Performance
+| Metric | Score |
+|---|---|
+| Micro F1 | **0.8545** |
+| Macro F1 | **0.8264** |
+| Precision (micro) | **0.8799** |
+| Recall (micro) | **0.8304** |
+| Loss | 0.1222 |