language:
- multilingual
license: apache-2.0
base_model: answerdotai/ModernBERT-large
tags:
- text-classification
- multi-label-classification
- topic-classification
- modernbert
- metacurate
pipeline_tag: text-classification
model-index:
- name: topic-classifier-v13
results:
- task:
type: text-classification
name: Multi-label Topic Classification
metrics:
- type: f1
value: 0.7017
name: Tuned Macro F1 (F1-optimized thresholds)
- type: precision
value: 0.7578
name: Macro Precision (precision-biased thresholds)
- type: f1
value: 0.6409
name: Tuned Micro F1
topic-classifier-v13
Multi-label topic classifier for tech/AI web content, fine-tuned from answerdotai/ModernBERT-large.
Developed by Metacurate to classify ingested web documents into 516 granular tech/AI topic labels, supporting content discovery and filtering.
Model details
| Property | Value |
|---|---|
| Base model | answerdotai/ModernBERT-large |
| Task | Multi-label text classification |
| Labels | 516 total / 478 active |
| Max input length | 8,192 tokens |
| Languages | Multilingual (trained on EN-translated text) |
| Training epochs | 15 |
| Learning rate | 2e-5 |
| Batch size | 16 |
| Warmup ratio | 0.1 |
| Positive weight cap | 100.0 |
Performance
Evaluated on a held-out 15% stratified validation split.
| Threshold strategy | Macro F1 | Micro F1 | Macro Precision |
|---|---|---|---|
| Raw (0.5) | 0.6497 | 0.6130 | — |
| F1-optimized per-label thresholds | 0.7017 | 0.6409 | — |
| Precision-biased thresholds (F-beta=0.5, floor=0.5) | 0.6589 | 0.6287 | 0.7578 |
The model ships with two threshold files:
thresholds.json— per-label thresholds that maximize F1thresholds_precision.json— per-label thresholds tuned for F-beta (β=0.5, precision floor=0.5)
For production use, thresholds_precision.json is recommended: it suppresses 38 low-precision labels
entirely and raises thresholds on the remaining 478, trading a small F1 reduction for substantially
higher precision (~75.8% macro precision).
Labels
516 granular tech/AI topic labels derived from a data-driven taxonomy built over the Metacurate document corpus. 478 labels are active in production (38 suppressed by precision floor).
Example labels: large language models, computer vision, reinforcement learning,
cybersecurity, semiconductor industry, natural language processing,
autonomous vehicles, quantum computing, blockchain technology, robotics, ...
Full label list: see label_list.json in this repository.
Usage
Direct inference with transformers
import json
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "metacurate/topic-classifier-v13"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
# Load labels and precision thresholds
with open("label_list.json") as f:
labels = json.load(f)
with open("thresholds_precision.json") as f:
thresh_data = json.load(f)
thresholds = dict(zip(thresh_data["labels"], thresh_data["thresholds"]))
text = "OpenAI released GPT-5 with improved reasoning and coding capabilities."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.sigmoid(logits).squeeze().tolist()
active = [
(label, round(score, 4))
for label, score in zip(labels, probs)
if score >= thresholds.get(label, 1.0)
]
print(active)
# e.g. [('large language models', 0.9123), ('AI startups', 0.8741), ...]
Via the Metacurate inference service
The model is served via a Modal FastAPI endpoint with per-label precision thresholds applied server-side. The service accepts a batch of texts and returns labels and scores per text.
Training data
- ~13,000 real documents labeled by GPT-4.1-mini using the full taxonomy
- ~2,700 supplementary synthetic records for low-support labels (generated with GPT-4.1-mini)
- 15% stratified held-out validation split
Intended use
Designed for classifying tech/AI web articles at ingestion time. Input is the full multilingual document text (title + body), translated to English where needed. Output is a set of topic labels for each document.
Not recommended for:
- General-purpose multi-label classification outside the tech/AI domain
- Documents shorter than ~50 words (label coverage degrades)
Limitations
- Taxonomy is tech/AI-centric; coverage of other domains is limited
- 38 labels are suppressed in production due to insufficient training data precision
- Performance varies by label; rare topics (< 50 training examples) have lower recall
- Thresholds were tuned on a held-out split from the same distribution — out-of-distribution generalization is untested
Citation
Developed at Metacurate for internal use. Not peer-reviewed.