fredriko's picture
Add model card
7515abf verified
---
language:
- multilingual
license: apache-2.0
base_model: answerdotai/ModernBERT-large
tags:
- text-classification
- multi-label-classification
- topic-classification
- modernbert
- metacurate
pipeline_tag: text-classification
model-index:
- name: topic-classifier-v13
results:
- task:
type: text-classification
name: Multi-label Topic Classification
metrics:
- type: f1
value: 0.7017
name: Tuned Macro F1 (F1-optimized thresholds)
- type: precision
value: 0.7578
name: Macro Precision (precision-biased thresholds)
- type: f1
value: 0.6409
name: Tuned Micro F1
---
# topic-classifier-v13
Multi-label topic classifier for tech/AI web content, fine-tuned from
[answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large).
Developed by [Metacurate](https://metacurate.io) to classify ingested web documents
into 516 granular tech/AI topic labels, supporting content discovery and filtering.
## Model details
| Property | Value |
|---|---|
| Base model | `answerdotai/ModernBERT-large` |
| Task | Multi-label text classification |
| Labels | 516 total / 478 active |
| Max input length | 8,192 tokens |
| Languages | Multilingual (trained on EN-translated text) |
| Training epochs | 15 |
| Learning rate | 2e-5 |
| Batch size | 16 |
| Warmup ratio | 0.1 |
| Positive weight cap | 100.0 |
## Performance
Evaluated on a held-out 15% stratified validation split.
| Threshold strategy | Macro F1 | Micro F1 | Macro Precision |
|---|---|---|---|
| Raw (0.5) | 0.6497 | 0.6130 | — |
| F1-optimized per-label thresholds | **0.7017** | 0.6409 | — |
| Precision-biased thresholds (F-beta=0.5, floor=0.5) | 0.6589 | 0.6287 | **0.7578** |
The model ships with two threshold files:
- `thresholds.json` — per-label thresholds that maximize F1
- `thresholds_precision.json` — per-label thresholds tuned for F-beta (β=0.5, precision floor=0.5)
For production use, `thresholds_precision.json` is recommended: it suppresses 38 low-precision labels
entirely and raises thresholds on the remaining 478, trading a small F1 reduction for substantially
higher precision (~75.8% macro precision).
## Labels
516 granular tech/AI topic labels derived from a data-driven taxonomy built over the Metacurate document corpus.
478 labels are active in production (38 suppressed by precision floor).
Example labels: `large language models`, `computer vision`, `reinforcement learning`,
`cybersecurity`, `semiconductor industry`, `natural language processing`,
`autonomous vehicles`, `quantum computing`, `blockchain technology`, `robotics`, ...
Full label list: see `label_list.json` in this repository.
## Usage
### Direct inference with `transformers`
```python
import json
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "metacurate/topic-classifier-v13"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
# Load labels and precision thresholds
with open("label_list.json") as f:
labels = json.load(f)
with open("thresholds_precision.json") as f:
thresh_data = json.load(f)
thresholds = dict(zip(thresh_data["labels"], thresh_data["thresholds"]))
text = "OpenAI released GPT-5 with improved reasoning and coding capabilities."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.sigmoid(logits).squeeze().tolist()
active = [
(label, round(score, 4))
for label, score in zip(labels, probs)
if score >= thresholds.get(label, 1.0)
]
print(active)
# e.g. [('large language models', 0.9123), ('AI startups', 0.8741), ...]
```
### Via the Metacurate inference service
The model is served via a Modal FastAPI endpoint with per-label precision thresholds
applied server-side. The service accepts a batch of texts and returns labels and scores
per text.
## Training data
- ~13,000 real documents labeled by GPT-4.1-mini using the full taxonomy
- ~2,700 supplementary synthetic records for low-support labels (generated with GPT-4.1-mini)
- 15% stratified held-out validation split
## Intended use
Designed for classifying tech/AI web articles at ingestion time. Input is the full
multilingual document text (title + body), translated to English where needed.
Output is a set of topic labels for each document.
**Not recommended for:**
- General-purpose multi-label classification outside the tech/AI domain
- Documents shorter than ~50 words (label coverage degrades)
## Limitations
- Taxonomy is tech/AI-centric; coverage of other domains is limited
- 38 labels are suppressed in production due to insufficient training data precision
- Performance varies by label; rare topics (< 50 training examples) have lower recall
- Thresholds were tuned on a held-out split from the same distribution — out-of-distribution generalization is untested
## Citation
Developed at Metacurate for internal use. Not peer-reviewed.