--- language: - multilingual license: apache-2.0 base_model: answerdotai/ModernBERT-large tags: - text-classification - multi-label-classification - topic-classification - modernbert - metacurate pipeline_tag: text-classification model-index: - name: topic-classifier-v13 results: - task: type: text-classification name: Multi-label Topic Classification metrics: - type: f1 value: 0.7017 name: Tuned Macro F1 (F1-optimized thresholds) - type: precision value: 0.7578 name: Macro Precision (precision-biased thresholds) - type: f1 value: 0.6409 name: Tuned Micro F1 --- # topic-classifier-v13 Multi-label topic classifier for tech/AI web content, fine-tuned from [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large). Developed by [Metacurate](https://metacurate.io) to classify ingested web documents into 516 granular tech/AI topic labels, supporting content discovery and filtering. ## Model details | Property | Value | |---|---| | Base model | `answerdotai/ModernBERT-large` | | Task | Multi-label text classification | | Labels | 516 total / 478 active | | Max input length | 8,192 tokens | | Languages | Multilingual (trained on EN-translated text) | | Training epochs | 15 | | Learning rate | 2e-5 | | Batch size | 16 | | Warmup ratio | 0.1 | | Positive weight cap | 100.0 | ## Performance Evaluated on a held-out 15% stratified validation split. | Threshold strategy | Macro F1 | Micro F1 | Macro Precision | |---|---|---|---| | Raw (0.5) | 0.6497 | 0.6130 | — | | F1-optimized per-label thresholds | **0.7017** | 0.6409 | — | | Precision-biased thresholds (F-beta=0.5, floor=0.5) | 0.6589 | 0.6287 | **0.7578** | The model ships with two threshold files: - `thresholds.json` — per-label thresholds that maximize F1 - `thresholds_precision.json` — per-label thresholds tuned for F-beta (β=0.5, precision floor=0.5) For production use, `thresholds_precision.json` is recommended: it suppresses 38 low-precision labels entirely and raises thresholds on the remaining 478, trading a small F1 reduction for substantially higher precision (~75.8% macro precision). ## Labels 516 granular tech/AI topic labels derived from a data-driven taxonomy built over the Metacurate document corpus. 478 labels are active in production (38 suppressed by precision floor). Example labels: `large language models`, `computer vision`, `reinforcement learning`, `cybersecurity`, `semiconductor industry`, `natural language processing`, `autonomous vehicles`, `quantum computing`, `blockchain technology`, `robotics`, ... Full label list: see `label_list.json` in this repository. ## Usage ### Direct inference with `transformers` ```python import json import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification model_id = "metacurate/topic-classifier-v13" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id) model.eval() # Load labels and precision thresholds with open("label_list.json") as f: labels = json.load(f) with open("thresholds_precision.json") as f: thresh_data = json.load(f) thresholds = dict(zip(thresh_data["labels"], thresh_data["thresholds"])) text = "OpenAI released GPT-5 with improved reasoning and coding capabilities." inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096) with torch.no_grad(): logits = model(**inputs).logits probs = torch.sigmoid(logits).squeeze().tolist() active = [ (label, round(score, 4)) for label, score in zip(labels, probs) if score >= thresholds.get(label, 1.0) ] print(active) # e.g. [('large language models', 0.9123), ('AI startups', 0.8741), ...] ``` ### Via the Metacurate inference service The model is served via a Modal FastAPI endpoint with per-label precision thresholds applied server-side. The service accepts a batch of texts and returns labels and scores per text. ## Training data - ~13,000 real documents labeled by GPT-4.1-mini using the full taxonomy - ~2,700 supplementary synthetic records for low-support labels (generated with GPT-4.1-mini) - 15% stratified held-out validation split ## Intended use Designed for classifying tech/AI web articles at ingestion time. Input is the full multilingual document text (title + body), translated to English where needed. Output is a set of topic labels for each document. **Not recommended for:** - General-purpose multi-label classification outside the tech/AI domain - Documents shorter than ~50 words (label coverage degrades) ## Limitations - Taxonomy is tech/AI-centric; coverage of other domains is limited - 38 labels are suppressed in production due to insufficient training data precision - Performance varies by label; rare topics (< 50 training examples) have lower recall - Thresholds were tuned on a held-out split from the same distribution — out-of-distribution generalization is untested ## Citation Developed at Metacurate for internal use. Not peer-reviewed.