File size: 5,070 Bytes
7515abf | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | ---
language:
- multilingual
license: apache-2.0
base_model: answerdotai/ModernBERT-large
tags:
- text-classification
- multi-label-classification
- topic-classification
- modernbert
- metacurate
pipeline_tag: text-classification
model-index:
- name: topic-classifier-v13
results:
- task:
type: text-classification
name: Multi-label Topic Classification
metrics:
- type: f1
value: 0.7017
name: Tuned Macro F1 (F1-optimized thresholds)
- type: precision
value: 0.7578
name: Macro Precision (precision-biased thresholds)
- type: f1
value: 0.6409
name: Tuned Micro F1
---
# topic-classifier-v13
Multi-label topic classifier for tech/AI web content, fine-tuned from
[answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large).
Developed by [Metacurate](https://metacurate.io) to classify ingested web documents
into 516 granular tech/AI topic labels, supporting content discovery and filtering.
## Model details
| Property | Value |
|---|---|
| Base model | `answerdotai/ModernBERT-large` |
| Task | Multi-label text classification |
| Labels | 516 total / 478 active |
| Max input length | 8,192 tokens |
| Languages | Multilingual (trained on EN-translated text) |
| Training epochs | 15 |
| Learning rate | 2e-5 |
| Batch size | 16 |
| Warmup ratio | 0.1 |
| Positive weight cap | 100.0 |
## Performance
Evaluated on a held-out 15% stratified validation split.
| Threshold strategy | Macro F1 | Micro F1 | Macro Precision |
|---|---|---|---|
| Raw (0.5) | 0.6497 | 0.6130 | — |
| F1-optimized per-label thresholds | **0.7017** | 0.6409 | — |
| Precision-biased thresholds (F-beta=0.5, floor=0.5) | 0.6589 | 0.6287 | **0.7578** |
The model ships with two threshold files:
- `thresholds.json` — per-label thresholds that maximize F1
- `thresholds_precision.json` — per-label thresholds tuned for F-beta (β=0.5, precision floor=0.5)
For production use, `thresholds_precision.json` is recommended: it suppresses 38 low-precision labels
entirely and raises thresholds on the remaining 478, trading a small F1 reduction for substantially
higher precision (~75.8% macro precision).
## Labels
516 granular tech/AI topic labels derived from a data-driven taxonomy built over the Metacurate document corpus.
478 labels are active in production (38 suppressed by precision floor).
Example labels: `large language models`, `computer vision`, `reinforcement learning`,
`cybersecurity`, `semiconductor industry`, `natural language processing`,
`autonomous vehicles`, `quantum computing`, `blockchain technology`, `robotics`, ...
Full label list: see `label_list.json` in this repository.
## Usage
### Direct inference with `transformers`
```python
import json
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "metacurate/topic-classifier-v13"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
# Load labels and precision thresholds
with open("label_list.json") as f:
labels = json.load(f)
with open("thresholds_precision.json") as f:
thresh_data = json.load(f)
thresholds = dict(zip(thresh_data["labels"], thresh_data["thresholds"]))
text = "OpenAI released GPT-5 with improved reasoning and coding capabilities."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.sigmoid(logits).squeeze().tolist()
active = [
(label, round(score, 4))
for label, score in zip(labels, probs)
if score >= thresholds.get(label, 1.0)
]
print(active)
# e.g. [('large language models', 0.9123), ('AI startups', 0.8741), ...]
```
### Via the Metacurate inference service
The model is served via a Modal FastAPI endpoint with per-label precision thresholds
applied server-side. The service accepts a batch of texts and returns labels and scores
per text.
## Training data
- ~13,000 real documents labeled by GPT-4.1-mini using the full taxonomy
- ~2,700 supplementary synthetic records for low-support labels (generated with GPT-4.1-mini)
- 15% stratified held-out validation split
## Intended use
Designed for classifying tech/AI web articles at ingestion time. Input is the full
multilingual document text (title + body), translated to English where needed.
Output is a set of topic labels for each document.
**Not recommended for:**
- General-purpose multi-label classification outside the tech/AI domain
- Documents shorter than ~50 words (label coverage degrades)
## Limitations
- Taxonomy is tech/AI-centric; coverage of other domains is limited
- 38 labels are suppressed in production due to insufficient training data precision
- Performance varies by label; rare topics (< 50 training examples) have lower recall
- Thresholds were tuned on a held-out split from the same distribution — out-of-distribution generalization is untested
## Citation
Developed at Metacurate for internal use. Not peer-reviewed.
|