fredriko commited on
Commit
7515abf
·
verified ·
1 Parent(s): ae3a9ed

Add model card

Browse files
Files changed (1) hide show
  1. README.md +153 -0
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ license: apache-2.0
5
+ base_model: answerdotai/ModernBERT-large
6
+ tags:
7
+ - text-classification
8
+ - multi-label-classification
9
+ - topic-classification
10
+ - modernbert
11
+ - metacurate
12
+ pipeline_tag: text-classification
13
+ model-index:
14
+ - name: topic-classifier-v13
15
+ results:
16
+ - task:
17
+ type: text-classification
18
+ name: Multi-label Topic Classification
19
+ metrics:
20
+ - type: f1
21
+ value: 0.7017
22
+ name: Tuned Macro F1 (F1-optimized thresholds)
23
+ - type: precision
24
+ value: 0.7578
25
+ name: Macro Precision (precision-biased thresholds)
26
+ - type: f1
27
+ value: 0.6409
28
+ name: Tuned Micro F1
29
+ ---
30
+
31
+ # topic-classifier-v13
32
+
33
+ Multi-label topic classifier for tech/AI web content, fine-tuned from
34
+ [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large).
35
+
36
+ Developed by [Metacurate](https://metacurate.io) to classify ingested web documents
37
+ into 516 granular tech/AI topic labels, supporting content discovery and filtering.
38
+
39
+ ## Model details
40
+
41
+ | Property | Value |
42
+ |---|---|
43
+ | Base model | `answerdotai/ModernBERT-large` |
44
+ | Task | Multi-label text classification |
45
+ | Labels | 516 total / 478 active |
46
+ | Max input length | 8,192 tokens |
47
+ | Languages | Multilingual (trained on EN-translated text) |
48
+ | Training epochs | 15 |
49
+ | Learning rate | 2e-5 |
50
+ | Batch size | 16 |
51
+ | Warmup ratio | 0.1 |
52
+ | Positive weight cap | 100.0 |
53
+
54
+ ## Performance
55
+
56
+ Evaluated on a held-out 15% stratified validation split.
57
+
58
+ | Threshold strategy | Macro F1 | Micro F1 | Macro Precision |
59
+ |---|---|---|---|
60
+ | Raw (0.5) | 0.6497 | 0.6130 | — |
61
+ | F1-optimized per-label thresholds | **0.7017** | 0.6409 | — |
62
+ | Precision-biased thresholds (F-beta=0.5, floor=0.5) | 0.6589 | 0.6287 | **0.7578** |
63
+
64
+ The model ships with two threshold files:
65
+ - `thresholds.json` — per-label thresholds that maximize F1
66
+ - `thresholds_precision.json` — per-label thresholds tuned for F-beta (β=0.5, precision floor=0.5)
67
+
68
+ For production use, `thresholds_precision.json` is recommended: it suppresses 38 low-precision labels
69
+ entirely and raises thresholds on the remaining 478, trading a small F1 reduction for substantially
70
+ higher precision (~75.8% macro precision).
71
+
72
+ ## Labels
73
+
74
+ 516 granular tech/AI topic labels derived from a data-driven taxonomy built over the Metacurate document corpus.
75
+ 478 labels are active in production (38 suppressed by precision floor).
76
+
77
+ Example labels: `large language models`, `computer vision`, `reinforcement learning`,
78
+ `cybersecurity`, `semiconductor industry`, `natural language processing`,
79
+ `autonomous vehicles`, `quantum computing`, `blockchain technology`, `robotics`, ...
80
+
81
+ Full label list: see `label_list.json` in this repository.
82
+
83
+ ## Usage
84
+
85
+ ### Direct inference with `transformers`
86
+
87
+ ```python
88
+ import json
89
+ import torch
90
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
91
+
92
+ model_id = "metacurate/topic-classifier-v13"
93
+
94
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
95
+ model = AutoModelForSequenceClassification.from_pretrained(model_id)
96
+ model.eval()
97
+
98
+ # Load labels and precision thresholds
99
+ with open("label_list.json") as f:
100
+ labels = json.load(f)
101
+
102
+ with open("thresholds_precision.json") as f:
103
+ thresh_data = json.load(f)
104
+ thresholds = dict(zip(thresh_data["labels"], thresh_data["thresholds"]))
105
+
106
+ text = "OpenAI released GPT-5 with improved reasoning and coding capabilities."
107
+
108
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
109
+ with torch.no_grad():
110
+ logits = model(**inputs).logits
111
+ probs = torch.sigmoid(logits).squeeze().tolist()
112
+
113
+ active = [
114
+ (label, round(score, 4))
115
+ for label, score in zip(labels, probs)
116
+ if score >= thresholds.get(label, 1.0)
117
+ ]
118
+ print(active)
119
+ # e.g. [('large language models', 0.9123), ('AI startups', 0.8741), ...]
120
+ ```
121
+
122
+ ### Via the Metacurate inference service
123
+
124
+ The model is served via a Modal FastAPI endpoint with per-label precision thresholds
125
+ applied server-side. The service accepts a batch of texts and returns labels and scores
126
+ per text.
127
+
128
+ ## Training data
129
+
130
+ - ~13,000 real documents labeled by GPT-4.1-mini using the full taxonomy
131
+ - ~2,700 supplementary synthetic records for low-support labels (generated with GPT-4.1-mini)
132
+ - 15% stratified held-out validation split
133
+
134
+ ## Intended use
135
+
136
+ Designed for classifying tech/AI web articles at ingestion time. Input is the full
137
+ multilingual document text (title + body), translated to English where needed.
138
+ Output is a set of topic labels for each document.
139
+
140
+ **Not recommended for:**
141
+ - General-purpose multi-label classification outside the tech/AI domain
142
+ - Documents shorter than ~50 words (label coverage degrades)
143
+
144
+ ## Limitations
145
+
146
+ - Taxonomy is tech/AI-centric; coverage of other domains is limited
147
+ - 38 labels are suppressed in production due to insufficient training data precision
148
+ - Performance varies by label; rare topics (< 50 training examples) have lower recall
149
+ - Thresholds were tuned on a held-out split from the same distribution — out-of-distribution generalization is untested
150
+
151
+ ## Citation
152
+
153
+ Developed at Metacurate for internal use. Not peer-reviewed.