File size: 5,070 Bytes
7515abf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
language:
- multilingual
license: apache-2.0
base_model: answerdotai/ModernBERT-large
tags:
- text-classification
- multi-label-classification
- topic-classification
- modernbert
- metacurate
pipeline_tag: text-classification
model-index:
- name: topic-classifier-v13
  results:
  - task:
      type: text-classification
      name: Multi-label Topic Classification
    metrics:
    - type: f1
      value: 0.7017
      name: Tuned Macro F1 (F1-optimized thresholds)
    - type: precision
      value: 0.7578
      name: Macro Precision (precision-biased thresholds)
    - type: f1
      value: 0.6409
      name: Tuned Micro F1
---

# topic-classifier-v13

Multi-label topic classifier for tech/AI web content, fine-tuned from
[answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large).

Developed by [Metacurate](https://metacurate.io) to classify ingested web documents
into 516 granular tech/AI topic labels, supporting content discovery and filtering.

## Model details

| Property | Value |
|---|---|
| Base model | `answerdotai/ModernBERT-large` |
| Task | Multi-label text classification |
| Labels | 516 total / 478 active |
| Max input length | 8,192 tokens |
| Languages | Multilingual (trained on EN-translated text) |
| Training epochs | 15 |
| Learning rate | 2e-5 |
| Batch size | 16 |
| Warmup ratio | 0.1 |
| Positive weight cap | 100.0 |

## Performance

Evaluated on a held-out 15% stratified validation split.

| Threshold strategy | Macro F1 | Micro F1 | Macro Precision |
|---|---|---|---|
| Raw (0.5) | 0.6497 | 0.6130 | — |
| F1-optimized per-label thresholds | **0.7017** | 0.6409 | — |
| Precision-biased thresholds (F-beta=0.5, floor=0.5) | 0.6589 | 0.6287 | **0.7578** |

The model ships with two threshold files:
- `thresholds.json` — per-label thresholds that maximize F1
- `thresholds_precision.json` — per-label thresholds tuned for F-beta (β=0.5, precision floor=0.5)

For production use, `thresholds_precision.json` is recommended: it suppresses 38 low-precision labels
entirely and raises thresholds on the remaining 478, trading a small F1 reduction for substantially
higher precision (~75.8% macro precision).

## Labels

516 granular tech/AI topic labels derived from a data-driven taxonomy built over the Metacurate document corpus.
478 labels are active in production (38 suppressed by precision floor).

Example labels: `large language models`, `computer vision`, `reinforcement learning`,
`cybersecurity`, `semiconductor industry`, `natural language processing`,
`autonomous vehicles`, `quantum computing`, `blockchain technology`, `robotics`, ...

Full label list: see `label_list.json` in this repository.

## Usage

### Direct inference with `transformers`

```python
import json
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "metacurate/topic-classifier-v13"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

# Load labels and precision thresholds
with open("label_list.json") as f:
    labels = json.load(f)

with open("thresholds_precision.json") as f:
    thresh_data = json.load(f)
thresholds = dict(zip(thresh_data["labels"], thresh_data["thresholds"]))

text = "OpenAI released GPT-5 with improved reasoning and coding capabilities."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
    logits = model(**inputs).logits
probs = torch.sigmoid(logits).squeeze().tolist()

active = [
    (label, round(score, 4))
    for label, score in zip(labels, probs)
    if score >= thresholds.get(label, 1.0)
]
print(active)
# e.g. [('large language models', 0.9123), ('AI startups', 0.8741), ...]
```

### Via the Metacurate inference service

The model is served via a Modal FastAPI endpoint with per-label precision thresholds
applied server-side. The service accepts a batch of texts and returns labels and scores
per text.

## Training data

- ~13,000 real documents labeled by GPT-4.1-mini using the full taxonomy
- ~2,700 supplementary synthetic records for low-support labels (generated with GPT-4.1-mini)
- 15% stratified held-out validation split

## Intended use

Designed for classifying tech/AI web articles at ingestion time. Input is the full
multilingual document text (title + body), translated to English where needed.
Output is a set of topic labels for each document.

**Not recommended for:**
- General-purpose multi-label classification outside the tech/AI domain
- Documents shorter than ~50 words (label coverage degrades)

## Limitations

- Taxonomy is tech/AI-centric; coverage of other domains is limited
- 38 labels are suppressed in production due to insufficient training data precision
- Performance varies by label; rare topics (< 50 training examples) have lower recall
- Thresholds were tuned on a held-out split from the same distribution — out-of-distribution generalization is untested

## Citation

Developed at Metacurate for internal use. Not peer-reviewed.