File size: 7,749 Bytes
63dad66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83383cc
e71daea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63dad66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8418b25
 
63dad66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
library_name: transformers
datasets:
- WebOrganizer/TopicAnnotations-Llama-3.1-8B
- WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8
base_model:
- answerdotai/ModernBERT-base
---
# wissamantoun/WebOrganizer-FormatClassifier-ModernBERT

[[Paper](https://arxiv.org/abs/2502.10341)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]

*All credit goes to the original authors of the model and dataset. This is a retraining of the original model with a different base model*

The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages.
The model is a [ModernBERT-base](answerdotai/ModernBERT-base) with 140M parameters fine-tuned on the following training data:

1. [WebOrganizer/TopicAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
2. [WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)

#### All Domain Classifiers
- [wissamantoun/WebOrganizer-FormatClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-FormatClassifier-ModernBERT)  *← you are here!*
- [wissamantoun/WebOrganizer-TopicClassifier-ModernBERT](https://huggingface.co/wissamantoun/WebOrganizer-TopicClassifier-ModernBERT)

## Usage

This classifier expects input in the following input format:
```
{url}

{text}
```

Example:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("wissamantoun/WebOrganizer-FormatClassifier-ModernBERT")
model = AutoModelForSequenceClassification.from_pretrained(
    "wissamantoun/WebOrganizer-FormatClassifier-ModernBERT",
    trust_remote_code=True,
    use_memory_efficient_attention=False)

web_page = """http://www.example.com

How to build a computer from scratch? Here are the components you need..."""

inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)

probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 5 ("Hardware" topic)
```

You can convert the `logits` of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see `id2label` and `label2id` in the model config):

0. Academic Writing
1. Content Listing
2. Creative Writing
3. Customer Support
4. Comment Section
5. FAQ
6. Truncated
7. Knowledge Article
8. Legal Notices
9. Listicle
10. News Article
11. Nonfiction Writing
12. About (Org
13. News (Org
14. About (Pers
15. Personal Blog
16. Product Page
17. Q&A Forum
18. Spam / Ads
19. Structured Data
20. Documentation
21. Audio Transcript
22. Tutorial
23. User Review

The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/topics.yaml).

# Scores
```
***** pred metrics *****
  test_accuracy                      =     0.8154
  test_accuracy__0                   =      0.855
  test_accuracy__1                   =     0.7558
  test_accuracy__10                  =     0.9071
  test_accuracy__11                  =     0.6869
  test_accuracy__12                  =     0.8055
  test_accuracy__13                  =     0.7897
  test_accuracy__14                  =     0.8592
  test_accuracy__15                  =     0.8541
  test_accuracy__16                  =     0.8788
  test_accuracy__17                  =     0.7733
  test_accuracy__18                  =     0.7286
  test_accuracy__19                  =     0.6989
  test_accuracy__2                   =     0.7474
  test_accuracy__20                  =     0.7609
  test_accuracy__21                  =     0.7807
  test_accuracy__22                  =     0.7703
  test_accuracy__23                  =     0.7931
  test_accuracy__3                   =     0.6351
  test_accuracy__4                   =      0.871
  test_accuracy__5                   =     0.8333
  test_accuracy__6                   =     0.6125
  test_accuracy__7                   =     0.6416
  test_accuracy__8                   =       0.78
  test_accuracy__9                   =     0.7668
  test_accuracy_conf50               =     0.8312
  test_accuracy_conf50__0            =     0.8852
  test_accuracy_conf50__1            =     0.7651
  test_accuracy_conf50__10           =     0.9167
  test_accuracy_conf50__11           =     0.7168
  test_accuracy_conf50__12           =     0.8256
  test_accuracy_conf50__13           =     0.7996
  test_accuracy_conf50__14           =     0.8696
  test_accuracy_conf50__15           =     0.8684
  test_accuracy_conf50__16           =     0.8878
  test_accuracy_conf50__17           =     0.7838
  test_accuracy_conf50__18           =     0.7663
  test_accuracy_conf50__19           =     0.7276
  test_accuracy_conf50__2            =     0.7609
  test_accuracy_conf50__20           =     0.7907
  test_accuracy_conf50__21           =        0.8
  test_accuracy_conf50__22           =     0.7927
  test_accuracy_conf50__23           =     0.7904
  test_accuracy_conf50__3            =     0.6617
  test_accuracy_conf50__4            =      0.877
  test_accuracy_conf50__5            =     0.8571
  test_accuracy_conf50__6            =     0.6299
  test_accuracy_conf50__7            =     0.6786
  test_accuracy_conf50__8            =     0.7755
  test_accuracy_conf50__9            =     0.7796
  test_accuracy_conf75               =     0.9003 <--- Metric from the paper
  test_accuracy_conf75__0            =     0.9412
  test_accuracy_conf75__1            =     0.8318
  test_accuracy_conf75__10           =     0.9542
  test_accuracy_conf75__11           =     0.8478
  test_accuracy_conf75__12           =     0.8841
  test_accuracy_conf75__13           =     0.8724
  test_accuracy_conf75__14           =      0.914
  test_accuracy_conf75__15           =     0.9345
  test_accuracy_conf75__16           =     0.9316
  test_accuracy_conf75__17           =     0.8667
  test_accuracy_conf75__18           =     0.8446
  test_accuracy_conf75__19           =     0.8209
  test_accuracy_conf75__2            =     0.8333
  test_accuracy_conf75__20           =     0.9333
  test_accuracy_conf75__21           =     0.8587
  test_accuracy_conf75__22           =     0.8708
  test_accuracy_conf75__23           =     0.8309
  test_accuracy_conf75__3            =     0.7292
  test_accuracy_conf75__4            =     0.9357
  test_accuracy_conf75__5            =     0.9032
  test_accuracy_conf75__6            =     0.7816
  test_accuracy_conf75__7            =     0.8011
  test_accuracy_conf75__8            =     0.8409
  test_accuracy_conf75__9            =     0.8592
  test_accuracy_label_average        =     0.7744
  test_accuracy_label_average_conf50 =     0.7919
  test_accuracy_label_average_conf75 =     0.8676
  test_accuracy_label_min            =     0.6125
  test_accuracy_label_min_conf75     =     0.7292 <--- Metric from the paper
  test_loss                          =     0.6023
  test_proportion_conf50             =     0.9638
  test_proportion_conf75             =     0.7951
  test_runtime                       = 0:00:08.38
  test_samples_per_second            =   1192.262
  test_steps_per_second              =     37.318
```



## Citation
```bibtex
@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}
```