metacurate
/

topic-classifier-v13

+---
+language:
+- multilingual
+license: apache-2.0
+base_model: answerdotai/ModernBERT-large
+tags:
+- text-classification
+- multi-label-classification
+- topic-classification
+- modernbert
+- metacurate
+pipeline_tag: text-classification
+model-index:
+- name: topic-classifier-v13
+  results:
+  - task:
+      type: text-classification
+      name: Multi-label Topic Classification
+    metrics:
+    - type: f1
+      value: 0.7017
+      name: Tuned Macro F1 (F1-optimized thresholds)
+    - type: precision
+      value: 0.7578
+      name: Macro Precision (precision-biased thresholds)
+    - type: f1
+      value: 0.6409
+      name: Tuned Micro F1
+---
+# topic-classifier-v13
+Multi-label topic classifier for tech/AI web content, fine-tuned from
+[answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large).
+Developed by [Metacurate](https://metacurate.io) to classify ingested web documents
+into 516 granular tech/AI topic labels, supporting content discovery and filtering.
+## Model details
+| Property | Value |
+|---|---|
+| Base model | `answerdotai/ModernBERT-large` |
+| Task | Multi-label text classification |
+| Labels | 516 total / 478 active |
+| Max input length | 8,192 tokens |
+| Languages | Multilingual (trained on EN-translated text) |
+| Training epochs | 15 |
+| Learning rate | 2e-5 |
+| Batch size | 16 |
+| Warmup ratio | 0.1 |
+| Positive weight cap | 100.0 |
+## Performance
+Evaluated on a held-out 15% stratified validation split.
+| Threshold strategy | Macro F1 | Micro F1 | Macro Precision |
+|---|---|---|---|
+| Raw (0.5) | 0.6497 | 0.6130 | — |
+| F1-optimized per-label thresholds | **0.7017** | 0.6409 | — |
+| Precision-biased thresholds (F-beta=0.5, floor=0.5) | 0.6589 | 0.6287 | **0.7578** |
+The model ships with two threshold files:
+- `thresholds.json` — per-label thresholds that maximize F1
+- `thresholds_precision.json` — per-label thresholds tuned for F-beta (β=0.5, precision floor=0.5)
+For production use, `thresholds_precision.json` is recommended: it suppresses 38 low-precision labels
+entirely and raises thresholds on the remaining 478, trading a small F1 reduction for substantially
+higher precision (~75.8% macro precision).
+## Labels
+516 granular tech/AI topic labels derived from a data-driven taxonomy built over the Metacurate document corpus.
+478 labels are active in production (38 suppressed by precision floor).
+Example labels: `large language models`, `computer vision`, `reinforcement learning`,
+`cybersecurity`, `semiconductor industry`, `natural language processing`,
+`autonomous vehicles`, `quantum computing`, `blockchain technology`, `robotics`, ...
+Full label list: see `label_list.json` in this repository.
+## Usage
+### Direct inference with `transformers`
+```python
+import json
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+model_id = "metacurate/topic-classifier-v13"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+model.eval()
+# Load labels and precision thresholds
+with open("label_list.json") as f:
+    labels = json.load(f)
+with open("thresholds_precision.json") as f:
+    thresh_data = json.load(f)
+thresholds = dict(zip(thresh_data["labels"], thresh_data["thresholds"]))
+text = "OpenAI released GPT-5 with improved reasoning and coding capabilities."
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
+with torch.no_grad():
+    logits = model(**inputs).logits
+probs = torch.sigmoid(logits).squeeze().tolist()
+active = [
+    (label, round(score, 4))
+    for label, score in zip(labels, probs)
+    if score >= thresholds.get(label, 1.0)
+]
+print(active)
+# e.g. [('large language models', 0.9123), ('AI startups', 0.8741), ...]
+```
+### Via the Metacurate inference service
+The model is served via a Modal FastAPI endpoint with per-label precision thresholds
+applied server-side. The service accepts a batch of texts and returns labels and scores
+per text.
+## Training data
+- ~13,000 real documents labeled by GPT-4.1-mini using the full taxonomy
+- ~2,700 supplementary synthetic records for low-support labels (generated with GPT-4.1-mini)
+- 15% stratified held-out validation split
+## Intended use
+Designed for classifying tech/AI web articles at ingestion time. Input is the full
+multilingual document text (title + body), translated to English where needed.
+Output is a set of topic labels for each document.
+**Not recommended for:**
+- General-purpose multi-label classification outside the tech/AI domain
+- Documents shorter than ~50 words (label coverage degrades)
+## Limitations
+- Taxonomy is tech/AI-centric; coverage of other domains is limited
+- 38 labels are suppressed in production due to insufficient training data precision
+- Performance varies by label; rare topics (< 50 training examples) have lower recall
+- Thresholds were tuned on a held-out split from the same distribution — out-of-distribution generalization is untested
+## Citation
+Developed at Metacurate for internal use. Not peer-reviewed.