| --- |
| language: |
| - multilingual |
| license: apache-2.0 |
| base_model: answerdotai/ModernBERT-large |
| tags: |
| - text-classification |
| - multi-label-classification |
| - topic-classification |
| - modernbert |
| - metacurate |
| pipeline_tag: text-classification |
| model-index: |
| - name: topic-classifier-v13 |
| results: |
| - task: |
| type: text-classification |
| name: Multi-label Topic Classification |
| metrics: |
| - type: f1 |
| value: 0.7017 |
| name: Tuned Macro F1 (F1-optimized thresholds) |
| - type: precision |
| value: 0.7578 |
| name: Macro Precision (precision-biased thresholds) |
| - type: f1 |
| value: 0.6409 |
| name: Tuned Micro F1 |
| --- |
| |
| # topic-classifier-v13 |
|
|
| Multi-label topic classifier for tech/AI web content, fine-tuned from |
| [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large). |
|
|
| Developed by [Metacurate](https://metacurate.io) to classify ingested web documents |
| into 516 granular tech/AI topic labels, supporting content discovery and filtering. |
|
|
| ## Model details |
|
|
| | Property | Value | |
| |---|---| |
| | Base model | `answerdotai/ModernBERT-large` | |
| | Task | Multi-label text classification | |
| | Labels | 516 total / 478 active | |
| | Max input length | 8,192 tokens | |
| | Languages | Multilingual (trained on EN-translated text) | |
| | Training epochs | 15 | |
| | Learning rate | 2e-5 | |
| | Batch size | 16 | |
| | Warmup ratio | 0.1 | |
| | Positive weight cap | 100.0 | |
|
|
| ## Performance |
|
|
| Evaluated on a held-out 15% stratified validation split. |
|
|
| | Threshold strategy | Macro F1 | Micro F1 | Macro Precision | |
| |---|---|---|---| |
| | Raw (0.5) | 0.6497 | 0.6130 | — | |
| | F1-optimized per-label thresholds | **0.7017** | 0.6409 | — | |
| | Precision-biased thresholds (F-beta=0.5, floor=0.5) | 0.6589 | 0.6287 | **0.7578** | |
|
|
| The model ships with two threshold files: |
| - `thresholds.json` — per-label thresholds that maximize F1 |
| - `thresholds_precision.json` — per-label thresholds tuned for F-beta (β=0.5, precision floor=0.5) |
|
|
| For production use, `thresholds_precision.json` is recommended: it suppresses 38 low-precision labels |
| entirely and raises thresholds on the remaining 478, trading a small F1 reduction for substantially |
| higher precision (~75.8% macro precision). |
|
|
| ## Labels |
|
|
| 516 granular tech/AI topic labels derived from a data-driven taxonomy built over the Metacurate document corpus. |
| 478 labels are active in production (38 suppressed by precision floor). |
|
|
| Example labels: `large language models`, `computer vision`, `reinforcement learning`, |
| `cybersecurity`, `semiconductor industry`, `natural language processing`, |
| `autonomous vehicles`, `quantum computing`, `blockchain technology`, `robotics`, ... |
|
|
| Full label list: see `label_list.json` in this repository. |
|
|
| ## Usage |
|
|
| ### Direct inference with `transformers` |
|
|
| ```python |
| import json |
| import torch |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| |
| model_id = "metacurate/topic-classifier-v13" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForSequenceClassification.from_pretrained(model_id) |
| model.eval() |
| |
| # Load labels and precision thresholds |
| with open("label_list.json") as f: |
| labels = json.load(f) |
| |
| with open("thresholds_precision.json") as f: |
| thresh_data = json.load(f) |
| thresholds = dict(zip(thresh_data["labels"], thresh_data["thresholds"])) |
| |
| text = "OpenAI released GPT-5 with improved reasoning and coding capabilities." |
| |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096) |
| with torch.no_grad(): |
| logits = model(**inputs).logits |
| probs = torch.sigmoid(logits).squeeze().tolist() |
| |
| active = [ |
| (label, round(score, 4)) |
| for label, score in zip(labels, probs) |
| if score >= thresholds.get(label, 1.0) |
| ] |
| print(active) |
| # e.g. [('large language models', 0.9123), ('AI startups', 0.8741), ...] |
| ``` |
|
|
| ### Via the Metacurate inference service |
|
|
| The model is served via a Modal FastAPI endpoint with per-label precision thresholds |
| applied server-side. The service accepts a batch of texts and returns labels and scores |
| per text. |
|
|
| ## Training data |
|
|
| - ~13,000 real documents labeled by GPT-4.1-mini using the full taxonomy |
| - ~2,700 supplementary synthetic records for low-support labels (generated with GPT-4.1-mini) |
| - 15% stratified held-out validation split |
|
|
| ## Intended use |
|
|
| Designed for classifying tech/AI web articles at ingestion time. Input is the full |
| multilingual document text (title + body), translated to English where needed. |
| Output is a set of topic labels for each document. |
|
|
| **Not recommended for:** |
| - General-purpose multi-label classification outside the tech/AI domain |
| - Documents shorter than ~50 words (label coverage degrades) |
|
|
| ## Limitations |
|
|
| - Taxonomy is tech/AI-centric; coverage of other domains is limited |
| - 38 labels are suppressed in production due to insufficient training data precision |
| - Performance varies by label; rare topics (< 50 training examples) have lower recall |
| - Thresholds were tuned on a held-out split from the same distribution — out-of-distribution generalization is untested |
|
|
| ## Citation |
|
|
| Developed at Metacurate for internal use. Not peer-reviewed. |
|
|