File size: 5,070 Bytes

7515abf

---
language:
- multilingual
license: apache-2.0
base_model: answerdotai/ModernBERT-large
tags:
- text-classification
- multi-label-classification
- topic-classification
- modernbert
- metacurate
pipeline_tag: text-classification
model-index:
- name: topic-classifier-v13
  results:
  - task:
      type: text-classification
      name: Multi-label Topic Classification
    metrics:
    - type: f1
      value: 0.7017
      name: Tuned Macro F1 (F1-optimized thresholds)
    - type: precision
      value: 0.7578
      name: Macro Precision (precision-biased thresholds)
    - type: f1
      value: 0.6409
      name: Tuned Micro F1
---

# topic-classifier-v13

Multi-label topic classifier for tech/AI web content, fine-tuned from
[answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large).

Developed by [Metacurate](https://metacurate.io) to classify ingested web documents
into 516 granular tech/AI topic labels, supporting content discovery and filtering.

## Model details

| Property | Value |
|---|---|
| Base model | `answerdotai/ModernBERT-large` |
| Task | Multi-label text classification |
| Labels | 516 total / 478 active |
| Max input length | 8,192 tokens |
| Languages | Multilingual (trained on EN-translated text) |
| Training epochs | 15 |
| Learning rate | 2e-5 |
| Batch size | 16 |
| Warmup ratio | 0.1 |
| Positive weight cap | 100.0 |

## Performance

Evaluated on a held-out 15% stratified validation split.

| Threshold strategy | Macro F1 | Micro F1 | Macro Precision |
|---|---|---|---|
| Raw (0.5) | 0.6497 | 0.6130 | — |
| F1-optimized per-label thresholds | **0.7017** | 0.6409 | — |
| Precision-biased thresholds (F-beta=0.5, floor=0.5) | 0.6589 | 0.6287 | **0.7578** |

The model ships with two threshold files:
- `thresholds.json` — per-label thresholds that maximize F1
- `thresholds_precision.json` — per-label thresholds tuned for F-beta (β=0.5, precision floor=0.5)

For production use, `thresholds_precision.json` is recommended: it suppresses 38 low-precision labels
entirely and raises thresholds on the remaining 478, trading a small F1 reduction for substantially
higher precision (~75.8% macro precision).

## Labels

516 granular tech/AI topic labels derived from a data-driven taxonomy built over the Metacurate document corpus.
478 labels are active in production (38 suppressed by precision floor).

Example labels: `large language models`, `computer vision`, `reinforcement learning`,
`cybersecurity`, `semiconductor industry`, `natural language processing`,
`autonomous vehicles`, `quantum computing`, `blockchain technology`, `robotics`, ...

Full label list: see `label_list.json` in this repository.

## Usage

### Direct inference with `transformers`

```python
import json
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "metacurate/topic-classifier-v13"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

# Load labels and precision thresholds
with open("label_list.json") as f:
    labels = json.load(f)

with open("thresholds_precision.json") as f:
    thresh_data = json.load(f)
thresholds = dict(zip(thresh_data["labels"], thresh_data["thresholds"]))

text = "OpenAI released GPT-5 with improved reasoning and coding capabilities."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
    logits = model(**inputs).logits
probs = torch.sigmoid(logits).squeeze().tolist()

active = [
    (label, round(score, 4))
    for label, score in zip(labels, probs)
    if score >= thresholds.get(label, 1.0)
]
print(active)
# e.g. [('large language models', 0.9123), ('AI startups', 0.8741), ...]
```

### Via the Metacurate inference service

The model is served via a Modal FastAPI endpoint with per-label precision thresholds
applied server-side. The service accepts a batch of texts and returns labels and scores
per text.

## Training data

- ~13,000 real documents labeled by GPT-4.1-mini using the full taxonomy
- ~2,700 supplementary synthetic records for low-support labels (generated with GPT-4.1-mini)
- 15% stratified held-out validation split

## Intended use

Designed for classifying tech/AI web articles at ingestion time. Input is the full
multilingual document text (title + body), translated to English where needed.
Output is a set of topic labels for each document.

**Not recommended for:**
- General-purpose multi-label classification outside the tech/AI domain
- Documents shorter than ~50 words (label coverage degrades)

## Limitations

- Taxonomy is tech/AI-centric; coverage of other domains is limited
- 38 labels are suppressed in production due to insufficient training data precision
- Performance varies by label; rare topics (< 50 training examples) have lower recall
- Thresholds were tuned on a held-out split from the same distribution — out-of-distribution generalization is untested

## Citation

Developed at Metacurate for internal use. Not peer-reviewed.