Add model card

7515abf verified 1 day ago

5.07 kB

	---
	language:
	- multilingual
	license: apache-2.0
	base_model: answerdotai/ModernBERT-large
	tags:
	- text-classification
	- multi-label-classification
	- topic-classification
	- modernbert
	- metacurate
	pipeline_tag: text-classification
	model-index:
	- name: topic-classifier-v13
	results:
	- task:
	type: text-classification
	name: Multi-label Topic Classification
	metrics:
	- type: f1
	value: 0.7017
	name: Tuned Macro F1 (F1-optimized thresholds)
	- type: precision
	value: 0.7578
	name: Macro Precision (precision-biased thresholds)
	- type: f1
	value: 0.6409
	name: Tuned Micro F1
	---

	# topic-classifier-v13

	Multi-label topic classifier for tech/AI web content, fine-tuned from
	[answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large).

	Developed by [Metacurate](https://metacurate.io) to classify ingested web documents
	into 516 granular tech/AI topic labels, supporting content discovery and filtering.

	## Model details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| `answerdotai/ModernBERT-large` \|
	\| Task \| Multi-label text classification \|
	\| Labels \| 516 total / 478 active \|
	\| Max input length \| 8,192 tokens \|
	\| Languages \| Multilingual (trained on EN-translated text) \|
	\| Training epochs \| 15 \|
	\| Learning rate \| 2e-5 \|
	\| Batch size \| 16 \|
	\| Warmup ratio \| 0.1 \|
	\| Positive weight cap \| 100.0 \|

	## Performance

	Evaluated on a held-out 15% stratified validation split.

	\| Threshold strategy \| Macro F1 \| Micro F1 \| Macro Precision \|
	\|---\|---\|---\|---\|
	\| Raw (0.5) \| 0.6497 \| 0.6130 \| — \|
	\| F1-optimized per-label thresholds \| 0.7017 \| 0.6409 \| — \|
	\| Precision-biased thresholds (F-beta=0.5, floor=0.5) \| 0.6589 \| 0.6287 \| 0.7578 \|

	The model ships with two threshold files:
	- `thresholds.json` — per-label thresholds that maximize F1
	- `thresholds_precision.json` — per-label thresholds tuned for F-beta (β=0.5, precision floor=0.5)

	For production use, `thresholds_precision.json` is recommended: it suppresses 38 low-precision labels
	entirely and raises thresholds on the remaining 478, trading a small F1 reduction for substantially
	higher precision (~75.8% macro precision).

	## Labels

	516 granular tech/AI topic labels derived from a data-driven taxonomy built over the Metacurate document corpus.
	478 labels are active in production (38 suppressed by precision floor).

	Example labels: `large language models`, `computer vision`, `reinforcement learning`,
	`cybersecurity`, `semiconductor industry`, `natural language processing`,
	`autonomous vehicles`, `quantum computing`, `blockchain technology`, `robotics`, ...

	Full label list: see `label_list.json` in this repository.

	## Usage

	### Direct inference with `transformers`

	```python
	import json
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	model_id = "metacurate/topic-classifier-v13"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSequenceClassification.from_pretrained(model_id)
	model.eval()

	# Load labels and precision thresholds
	with open("label_list.json") as f:
	labels = json.load(f)

	with open("thresholds_precision.json") as f:
	thresh_data = json.load(f)
	thresholds = dict(zip(thresh_data["labels"], thresh_data["thresholds"]))

	text = "OpenAI released GPT-5 with improved reasoning and coding capabilities."

	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
	with torch.no_grad():
	logits = model(**inputs).logits
	probs = torch.sigmoid(logits).squeeze().tolist()

	active = [
	(label, round(score, 4))
	for label, score in zip(labels, probs)
	if score >= thresholds.get(label, 1.0)
	]
	print(active)
	# e.g. [('large language models', 0.9123), ('AI startups', 0.8741), ...]
	```

	### Via the Metacurate inference service

	The model is served via a Modal FastAPI endpoint with per-label precision thresholds
	applied server-side. The service accepts a batch of texts and returns labels and scores
	per text.

	## Training data

	- ~13,000 real documents labeled by GPT-4.1-mini using the full taxonomy
	- ~2,700 supplementary synthetic records for low-support labels (generated with GPT-4.1-mini)
	- 15% stratified held-out validation split

	## Intended use

	Designed for classifying tech/AI web articles at ingestion time. Input is the full
	multilingual document text (title + body), translated to English where needed.
	Output is a set of topic labels for each document.

	Not recommended for:
	- General-purpose multi-label classification outside the tech/AI domain
	- Documents shorter than ~50 words (label coverage degrades)

	## Limitations

	- Taxonomy is tech/AI-centric; coverage of other domains is limited
	- 38 labels are suppressed in production due to insufficient training data precision
	- Performance varies by label; rare topics (< 50 training examples) have lower recall
	- Thresholds were tuned on a held-out split from the same distribution — out-of-distribution generalization is untested

	## Citation

	Developed at Metacurate for internal use. Not peer-reviewed.