topic-classifier-v13

Multi-label topic classifier for tech/AI web content, fine-tuned from answerdotai/ModernBERT-large.

Developed by Metacurate to classify ingested web documents into 516 granular tech/AI topic labels, supporting content discovery and filtering.

Model details

Property Value
Base model answerdotai/ModernBERT-large
Task Multi-label text classification
Labels 516 total / 478 active
Max input length 8,192 tokens
Languages Multilingual (trained on EN-translated text)
Training epochs 15
Learning rate 2e-5
Batch size 16
Warmup ratio 0.1
Positive weight cap 100.0

Performance

Evaluated on a held-out 15% stratified validation split.

Threshold strategy Macro F1 Micro F1 Macro Precision
Raw (0.5) 0.6497 0.6130 โ€”
F1-optimized per-label thresholds 0.7017 0.6409 โ€”
Precision-biased thresholds (F-beta=0.5, floor=0.5) 0.6589 0.6287 0.7578

The model ships with two threshold files:

  • thresholds.json โ€” per-label thresholds that maximize F1
  • thresholds_precision.json โ€” per-label thresholds tuned for F-beta (ฮฒ=0.5, precision floor=0.5)

For production use, thresholds_precision.json is recommended: it suppresses 38 low-precision labels entirely and raises thresholds on the remaining 478, trading a small F1 reduction for substantially higher precision (~75.8% macro precision).

Labels

516 granular tech/AI topic labels derived from a data-driven taxonomy built over the Metacurate document corpus. 478 labels are active in production (38 suppressed by precision floor).

Example labels: large language models, computer vision, reinforcement learning, cybersecurity, semiconductor industry, natural language processing, autonomous vehicles, quantum computing, blockchain technology, robotics, ...

Full label list: see label_list.json in this repository.

Usage

Direct inference with transformers

import json
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "metacurate/topic-classifier-v13"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

# Load labels and precision thresholds
with open("label_list.json") as f:
    labels = json.load(f)

with open("thresholds_precision.json") as f:
    thresh_data = json.load(f)
thresholds = dict(zip(thresh_data["labels"], thresh_data["thresholds"]))

text = "OpenAI released GPT-5 with improved reasoning and coding capabilities."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
    logits = model(**inputs).logits
probs = torch.sigmoid(logits).squeeze().tolist()

active = [
    (label, round(score, 4))
    for label, score in zip(labels, probs)
    if score >= thresholds.get(label, 1.0)
]
print(active)
# e.g. [('large language models', 0.9123), ('AI startups', 0.8741), ...]

Via the Metacurate inference service

The model is served via a Modal FastAPI endpoint with per-label precision thresholds applied server-side. The service accepts a batch of texts and returns labels and scores per text.

Training data

  • ~13,000 real documents labeled by GPT-4.1-mini using the full taxonomy
  • ~2,700 supplementary synthetic records for low-support labels (generated with GPT-4.1-mini)
  • 15% stratified held-out validation split

Intended use

Designed for classifying tech/AI web articles at ingestion time. Input is the full multilingual document text (title + body), translated to English where needed. Output is a set of topic labels for each document.

Not recommended for:

  • General-purpose multi-label classification outside the tech/AI domain
  • Documents shorter than ~50 words (label coverage degrades)

Limitations

  • Taxonomy is tech/AI-centric; coverage of other domains is limited
  • 38 labels are suppressed in production due to insufficient training data precision
  • Performance varies by label; rare topics (< 50 training examples) have lower recall
  • Thresholds were tuned on a held-out split from the same distribution โ€” out-of-distribution generalization is untested

Citation

Developed at Metacurate for internal use. Not peer-reviewed.

Downloads last month
-
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for metacurate/topic-classifier-v13

Finetuned
(255)
this model

Evaluation results