ModernBERT HF Dataset Domain Classifier v3
Classifies HuggingFace dataset cards into 10 domain categories. Built through iterative, data-centric development with LLM-in-the-loop active learning.
Labels
| Label | Description |
|---|---|
| biology | Genomics, ecology, proteins, life sciences |
| chemistry | Molecules, reactions, materials science |
| climate | Weather, earth observation, environmental monitoring |
| code | Programming, software engineering, code generation |
| cybersecurity | Security threats, malware, vulnerability detection |
| finance | Banking, trading, economics, financial documents |
| legal | Law, regulations, court cases, legal documents |
| math | Mathematics, theorem proving, formal verification |
| medical | Healthcare, clinical, biomedical, drug discovery |
| none | General-purpose, no specific domain, stubs |
Usage
from transformers import pipeline
clf = pipeline("text-classification", model="davanstrien/modernbert-hf-dataset-domain-v3", top_k=1)
result = clf("This dataset contains chest X-ray images for pneumonia detection...")
# [{'label': 'medical', 'score': 0.99}]
Performance
Evaluated on 72 out-of-distribution examples validated by 3-model consensus (Qwen3-235B, DeepSeek-V3, Llama-3.3-70B).
With recommended input filter (filters known stub orgs + templates):
| Metric | Value |
|---|---|
| Overall accuracy | 90.3% |
| False positives (domainโnone) | 3/27 |
| Wrong domain | 0 |
| None recall | 89% |
Model only (no input filter):
| Metric | Value |
|---|---|
| Overall accuracy | 81.9% |
| Macro F1 | 0.850 |
| 100% recall on | biology, chemistry, code, cybersecurity, finance, legal, math, medical |
Per-class (model + filter)
| Class | Accuracy |
|---|---|
| biology | 83% |
| chemistry | 100% |
| climate | 78% |
| code | 100% |
| cybersecurity | 100% |
| finance | 80% |
| legal | 100% |
| math | 100% |
| medical | 100% |
| none | 89% |
Recommended Input Filter
For best results, filter known junk before classification:
def should_skip(card_text: str, dataset_id: str = "") -> bool:
"""Returns True if the card should be auto-classified as 'none'."""
# Known stub orgs
if dataset_id:
org = dataset_id.split("/")[0] if "/" in dataset_id else ""
if org in {"french-open-data"}:
return True
# HF default template
if "Dataset Card for Dataset Name" in card_text and "Provide a quick summary" in card_text:
return True
# Too short after stripping markup
import re
clean = re.sub(r'<[^>]+>', '', card_text)
clean = re.sub(r'https?://\S+', '', clean)
clean = re.sub(r'\s+', ' ', clean).strip()
if len(clean) < 100:
return True
return False
Training
Data
3,437 examples from three sources:
- Tag-based labels (1,990): Derived from existing HuggingFace dataset tags
- LLM-labelled "none" examples (607): Qwen3-235B classified untagged datasets
- Active learning hard negatives (800): Disagreement sampling between v2 model and Qwen3-4B, adjudicated by Qwen3-235B
Development History
| Version | OOD Accuracy | Method |
|---|---|---|
| v1 | 52.8% | Tag-based labels only |
| v2 | 76.4% | + LLM-labelled "none" examples |
| v3 | 84.7% (90.3% with filter) | + Disagreement-based active learning |
Methodology
- Bootstrap from Hub tags โ 2,889 of initial labels came from existing dataset tags for free
- Multi-model consensus validation โ 3 frontier LLMs validated 72 OOD examples to build a gold evaluation set
- Disagreement-based active learning โ Ran model on 5K datasets, used cheap LLM (Qwen3-4B) to find disagreements, adjudicated with strong LLM (Qwen3-235B). Found 382 overturned predictions in 800 examples.
- Iterative retraining โ Each round targeted specific weaknesses identified by error analysis
Architecture
- Base model:
answerdotai/ModernBERT-base - Max sequence length: 2048 tokens
- Class-weighted cross-entropy loss
- 5 epochs, lr=2e-5, effective batch size 16
Known Limitations
- French open-data stubs: Datasets from
french-open-dataorg are empty references to external portals. The input filter handles these, but without it the model may confidently misclassify them. - HTML-heavy cards: Cards with extensive HTML markup (badges, styled headers) may lose signal. Strip HTML tags before classification.
- Taxonomy gaps: No "physics" or "engineering" category โ datasets in those areas may be classified as "none" or adjacent domains.
How This Was Built
This model was developed through a collaborative workflow between a human (@davanstrien) and Hermes Agent โ an AI coding agent built by NousResearch. The entire data-centric development loop (error diagnosis, LLM-in-the-loop active learning, training, evaluation) was conducted interactively in a single session, with the agent writing and executing scripts, managing HF Jobs, and iterating on the methodology based on results.
The active learning pipeline โ disagreement sampling between ModernBERT and cheap/strong LLMs, three-tier adjudication, and iterative retraining โ was designed and executed by the agent as part of a data-centric model development skill that captures reusable patterns for training task-specific models with a focus on data quality.
Links
- Training dataset: davanstrien/hf-dataset-domain-labels-v3
- Development repo: Contains full training scripts, evaluation pipeline, and learnings from 3 iterations
- Previous versions: v1, v2
- Agent: Hermes Agent by NousResearch
- Downloads last month
- -
Model tree for davanstrien/modernbert-hf-dataset-domain-v3
Base model
answerdotai/ModernBERT-baseDataset used to train davanstrien/modernbert-hf-dataset-domain-v3
Evaluation results
- Accuracy (with input filter) on OOD Gold Set (72 examples, multi-model consensus validated)test set self-reported0.903
- Macro F1 on OOD Gold Set (72 examples, multi-model consensus validated)test set self-reported0.850