mdonigian commited on
Commit
0e14f17
·
verified ·
1 Parent(s): eae3b92

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +72 -0
README.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # fineweb-edu-classifier
2
+
3
+ A fine-tuned [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) model for **multi-label subject classification** of educational web text. Given a passage of text, it predicts which of 17 academic/professional subject categories apply.
4
+
5
+ ## Model Details
6
+
7
+ | Property | Value |
8
+ |---|---|
9
+ | Base model | `answerdotai/ModernBERT-base` |
10
+ | Architecture | `ModernBertForSequenceClassification` |
11
+ | Task | Multi-label classification |
12
+ | Number of labels | 17 |
13
+ | Max input length | 512 tokens |
14
+ | Hidden size | 768 |
15
+ | Attention heads | 12 |
16
+ | Transformer layers | 22 (alternating full + sliding window attention) |
17
+ | Pooling | Mean pooling |
18
+
19
+ ## Labels
20
+
21
+ | Index | Field | Display Name |
22
+ |---|---|---|
23
+ | 0 | `mathematics_statistics` | Mathematics Statistics |
24
+ | 1 | `computer_science_software_engineering` | Computer Science Software Engineering |
25
+ | 2 | `machine_learning_ai` | Machine Learning AI |
26
+ | 3 | `physical_sciences` | Physical Sciences |
27
+ | 4 | `life_sciences_biology` | Life Sciences Biology |
28
+ | 5 | `medicine_health` | Medicine Health |
29
+ | 6 | `engineering_technology` | Engineering Technology |
30
+ | 7 | `business_economics` | Business Economics |
31
+ | 8 | `law_government` | Law Government |
32
+ | 9 | `social_sciences` | Social Sciences |
33
+ | 10 | `history_geography` | History Geography |
34
+ | 11 | `philosophy_ethics` | Philosophy Ethics |
35
+ | 12 | `education_pedagogy` | Education Pedagogy |
36
+ | 13 | `language_writing` | Language Writing |
37
+ | 14 | `arts_humanities` | Arts Humanities |
38
+ | 15 | `environmental_science_energy` | Environmental Science Energy |
39
+ | 16 | `personal_finance_practical_life` | Personal Finance Practical Life |
40
+
41
+ ## Training Data
42
+
43
+ - Source: [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (CC-MAIN-2021-04 shard) plus ~50K rows from [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (10BT sample)
44
+ - Labels were generated by gpt-5-nano via the OpenAI Batch API (~$80 in batch credits)
45
+ - Data was split 80% train / 10% val / 10% test (random seed 42)
46
+
47
+ ## Training Configuration
48
+
49
+ | Hyperparameter | Value |
50
+ |---|---|
51
+ | Epochs | 3 |
52
+ | Batch size | 32 |
53
+ | Learning rate | 2e-5 |
54
+ | Weight decay | 0.01 |
55
+ | Warmup ratio | 0.1 |
56
+ | Max token length | 512 |
57
+ | Optimizer | AdamW |
58
+ | Scheduler | Linear with warmup |
59
+ | AMP | bf16 (on CUDA) |
60
+ | Gradient clipping | max norm 1.0 |
61
+
62
+ Model checkpoint was saved at the epoch with the best validation micro-F1 (epoch 2).
63
+
64
+ ## Test Set Performance
65
+
66
+ | Metric | Score |
67
+ |---|---|
68
+ | Micro F1 | **0.8545** |
69
+ | Macro F1 | **0.8264** |
70
+ | Precision (micro) | **0.8799** |
71
+ | Recall (micro) | **0.8304** |
72
+ | Loss | 0.1222 |