ModernBERT HF Dataset Domain Classifier v3

Classifies HuggingFace dataset cards into 10 domain categories. Built through iterative, data-centric development with LLM-in-the-loop active learning.

Labels

Label	Description
biology	Genomics, ecology, proteins, life sciences
chemistry	Molecules, reactions, materials science
climate	Weather, earth observation, environmental monitoring
code	Programming, software engineering, code generation
cybersecurity	Security threats, malware, vulnerability detection
finance	Banking, trading, economics, financial documents
legal	Law, regulations, court cases, legal documents
math	Mathematics, theorem proving, formal verification
medical	Healthcare, clinical, biomedical, drug discovery
none	General-purpose, no specific domain, stubs

Usage

from transformers import pipeline

clf = pipeline("text-classification", model="davanstrien/modernbert-hf-dataset-domain-v3", top_k=1)
result = clf("This dataset contains chest X-ray images for pneumonia detection...")
# [{'label': 'medical', 'score': 0.99}]

Performance

Evaluated on 72 out-of-distribution examples validated by 3-model consensus (Qwen3-235B, DeepSeek-V3, Llama-3.3-70B).

With recommended input filter (filters known stub orgs + templates):

Metric	Value
Overall accuracy	90.3%
False positives (domain→none)	3/27
Wrong domain	0
None recall	89%

Model only (no input filter):

Metric	Value
Overall accuracy	81.9%
Macro F1	0.850
100% recall on	biology, chemistry, code, cybersecurity, finance, legal, math, medical

Per-class (model + filter)

Class	Accuracy
biology	83%
chemistry	100%
climate	78%
code	100%
cybersecurity	100%
finance	80%
legal	100%
math	100%
medical	100%
none	89%

Recommended Input Filter

For best results, filter known junk before classification:

def should_skip(card_text: str, dataset_id: str = "") -> bool:
    """Returns True if the card should be auto-classified as 'none'."""
    # Known stub orgs
    if dataset_id:
        org = dataset_id.split("/")[0] if "/" in dataset_id else ""
        if org in {"french-open-data"}:
            return True
    # HF default template
    if "Dataset Card for Dataset Name" in card_text and "Provide a quick summary" in card_text:
        return True
    # Too short after stripping markup
    import re
    clean = re.sub(r'<[^>]+>', '', card_text)
    clean = re.sub(r'https?://\S+', '', clean)
    clean = re.sub(r'\s+', ' ', clean).strip()
    if len(clean) < 100:
        return True
    return False

Training

Data

3,437 examples from three sources:

Tag-based labels (1,990): Derived from existing HuggingFace dataset tags
LLM-labelled "none" examples (607): Qwen3-235B classified untagged datasets
Active learning hard negatives (800): Disagreement sampling between v2 model and Qwen3-4B, adjudicated by Qwen3-235B

Development History

Version	OOD Accuracy	Method
v1	52.8%	Tag-based labels only
v2	76.4%	+ LLM-labelled "none" examples
v3	84.7% (90.3% with filter)	+ Disagreement-based active learning

Methodology

Bootstrap from Hub tags — 2,889 of initial labels came from existing dataset tags for free
Multi-model consensus validation — 3 frontier LLMs validated 72 OOD examples to build a gold evaluation set
Disagreement-based active learning — Ran model on 5K datasets, used cheap LLM (Qwen3-4B) to find disagreements, adjudicated with strong LLM (Qwen3-235B). Found 382 overturned predictions in 800 examples.
Iterative retraining — Each round targeted specific weaknesses identified by error analysis

Architecture

Base model: answerdotai/ModernBERT-base
Max sequence length: 2048 tokens
Class-weighted cross-entropy loss
5 epochs, lr=2e-5, effective batch size 16

Known Limitations

French open-data stubs: Datasets from french-open-data org are empty references to external portals. The input filter handles these, but without it the model may confidently misclassify them.
HTML-heavy cards: Cards with extensive HTML markup (badges, styled headers) may lose signal. Strip HTML tags before classification.
Taxonomy gaps: No "physics" or "engineering" category — datasets in those areas may be classified as "none" or adjacent domains.

How This Was Built

This model was developed through a collaborative workflow between a human (@davanstrien) and Hermes Agent — an AI coding agent built by NousResearch. The entire data-centric development loop (error diagnosis, LLM-in-the-loop active learning, training, evaluation) was conducted interactively in a single session, with the agent writing and executing scripts, managing HF Jobs, and iterating on the methodology based on results.

The active learning pipeline — disagreement sampling between ModernBERT and cheap/strong LLMs, three-tier adjudication, and iterative retraining — was designed and executed by the agent as part of a data-centric model development skill that captures reusable patterns for training task-specific models with a focus on data quality.

Model tree for davanstrien/modernbert-hf-dataset-domain-v3

Base model

answerdotai/ModernBERT-base

Finetuned

(1356)

this model

Dataset used to train davanstrien/modernbert-hf-dataset-domain-v3

Evaluation results

Accuracy (with input filter) on OOD Gold Set (72 examples, multi-model consensus validated)
test set self-reported

0.903
Macro F1 on OOD Gold Set (72 examples, multi-model consensus validated)
test set self-reported

0.850

davanstrien
/

modernbert-hf-dataset-domain-v3