Add model card

7515abf verified about 16 hours ago

5.07 kB

language:
  - multilingual
license: apache-2.0
base_model: answerdotai/ModernBERT-large
tags:
  - text-classification
  - multi-label-classification
  - topic-classification
  - modernbert
  - metacurate
pipeline_tag: text-classification
model-index:
  - name: topic-classifier-v13
    results:
      - task:
          type: text-classification
          name: Multi-label Topic Classification
        metrics:
          - type: f1
            value: 0.7017
            name: Tuned Macro F1 (F1-optimized thresholds)
          - type: precision
            value: 0.7578
            name: Macro Precision (precision-biased thresholds)
          - type: f1
            value: 0.6409
            name: Tuned Micro F1

topic-classifier-v13

Multi-label topic classifier for tech/AI web content, fine-tuned from answerdotai/ModernBERT-large.

Developed by Metacurate to classify ingested web documents into 516 granular tech/AI topic labels, supporting content discovery and filtering.

Model details

Property	Value
Base model	`answerdotai/ModernBERT-large`
Task	Multi-label text classification
Labels	516 total / 478 active
Max input length	8,192 tokens
Languages	Multilingual (trained on EN-translated text)
Training epochs	15
Learning rate	2e-5
Batch size	16
Warmup ratio	0.1
Positive weight cap	100.0

Performance

Evaluated on a held-out 15% stratified validation split.

Threshold strategy	Macro F1	Micro F1	Macro Precision
Raw (0.5)	0.6497	0.6130	—
F1-optimized per-label thresholds	0.7017	0.6409	—
Precision-biased thresholds (F-beta=0.5, floor=0.5)	0.6589	0.6287	0.7578

The model ships with two threshold files:

thresholds.json — per-label thresholds that maximize F1
thresholds_precision.json — per-label thresholds tuned for F-beta (β=0.5, precision floor=0.5)

For production use, thresholds_precision.json is recommended: it suppresses 38 low-precision labels entirely and raises thresholds on the remaining 478, trading a small F1 reduction for substantially higher precision (~75.8% macro precision).

Labels

516 granular tech/AI topic labels derived from a data-driven taxonomy built over the Metacurate document corpus. 478 labels are active in production (38 suppressed by precision floor).

Example labels: large language models, computer vision, reinforcement learning, cybersecurity, semiconductor industry, natural language processing, autonomous vehicles, quantum computing, blockchain technology, robotics, ...

Full label list: see label_list.json in this repository.

Usage

Direct inference with `transformers`

import json
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "metacurate/topic-classifier-v13"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

# Load labels and precision thresholds
with open("label_list.json") as f:
    labels = json.load(f)

with open("thresholds_precision.json") as f:
    thresh_data = json.load(f)
thresholds = dict(zip(thresh_data["labels"], thresh_data["thresholds"]))

text = "OpenAI released GPT-5 with improved reasoning and coding capabilities."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
    logits = model(**inputs).logits
probs = torch.sigmoid(logits).squeeze().tolist()

active = [
    (label, round(score, 4))
    for label, score in zip(labels, probs)
    if score >= thresholds.get(label, 1.0)
]
print(active)
# e.g. [('large language models', 0.9123), ('AI startups', 0.8741), ...]

Via the Metacurate inference service

The model is served via a Modal FastAPI endpoint with per-label precision thresholds applied server-side. The service accepts a batch of texts and returns labels and scores per text.

Training data

~13,000 real documents labeled by GPT-4.1-mini using the full taxonomy
~2,700 supplementary synthetic records for low-support labels (generated with GPT-4.1-mini)
15% stratified held-out validation split

Intended use

Designed for classifying tech/AI web articles at ingestion time. Input is the full multilingual document text (title + body), translated to English where needed. Output is a set of topic labels for each document.

Not recommended for:

General-purpose multi-label classification outside the tech/AI domain
Documents shorter than ~50 words (label coverage degrades)

Limitations

Taxonomy is tech/AI-centric; coverage of other domains is limited
38 labels are suppressed in production due to insufficient training data precision
Performance varies by label; rare topics (< 50 training examples) have lower recall
Thresholds were tuned on a held-out split from the same distribution — out-of-distribution generalization is untested

Citation

Developed at Metacurate for internal use. Not peer-reviewed.