ModernBERT HF Dataset Domain Classifier v3

Classifies HuggingFace dataset cards into 10 domain categories. Built through iterative, data-centric development with LLM-in-the-loop active learning.

Labels

Label Description
biology Genomics, ecology, proteins, life sciences
chemistry Molecules, reactions, materials science
climate Weather, earth observation, environmental monitoring
code Programming, software engineering, code generation
cybersecurity Security threats, malware, vulnerability detection
finance Banking, trading, economics, financial documents
legal Law, regulations, court cases, legal documents
math Mathematics, theorem proving, formal verification
medical Healthcare, clinical, biomedical, drug discovery
none General-purpose, no specific domain, stubs

Usage

from transformers import pipeline

clf = pipeline("text-classification", model="davanstrien/modernbert-hf-dataset-domain-v3", top_k=1)
result = clf("This dataset contains chest X-ray images for pneumonia detection...")
# [{'label': 'medical', 'score': 0.99}]

Performance

Evaluated on 72 out-of-distribution examples validated by 3-model consensus (Qwen3-235B, DeepSeek-V3, Llama-3.3-70B).

With recommended input filter (filters known stub orgs + templates):

Metric Value
Overall accuracy 90.3%
False positives (domainโ†’none) 3/27
Wrong domain 0
None recall 89%

Model only (no input filter):

Metric Value
Overall accuracy 81.9%
Macro F1 0.850
100% recall on biology, chemistry, code, cybersecurity, finance, legal, math, medical

Per-class (model + filter)

Class Accuracy
biology 83%
chemistry 100%
climate 78%
code 100%
cybersecurity 100%
finance 80%
legal 100%
math 100%
medical 100%
none 89%

Recommended Input Filter

For best results, filter known junk before classification:

def should_skip(card_text: str, dataset_id: str = "") -> bool:
    """Returns True if the card should be auto-classified as 'none'."""
    # Known stub orgs
    if dataset_id:
        org = dataset_id.split("/")[0] if "/" in dataset_id else ""
        if org in {"french-open-data"}:
            return True
    # HF default template
    if "Dataset Card for Dataset Name" in card_text and "Provide a quick summary" in card_text:
        return True
    # Too short after stripping markup
    import re
    clean = re.sub(r'<[^>]+>', '', card_text)
    clean = re.sub(r'https?://\S+', '', clean)
    clean = re.sub(r'\s+', ' ', clean).strip()
    if len(clean) < 100:
        return True
    return False

Training

Data

3,437 examples from three sources:

  • Tag-based labels (1,990): Derived from existing HuggingFace dataset tags
  • LLM-labelled "none" examples (607): Qwen3-235B classified untagged datasets
  • Active learning hard negatives (800): Disagreement sampling between v2 model and Qwen3-4B, adjudicated by Qwen3-235B

Development History

Version OOD Accuracy Method
v1 52.8% Tag-based labels only
v2 76.4% + LLM-labelled "none" examples
v3 84.7% (90.3% with filter) + Disagreement-based active learning

Methodology

  1. Bootstrap from Hub tags โ€” 2,889 of initial labels came from existing dataset tags for free
  2. Multi-model consensus validation โ€” 3 frontier LLMs validated 72 OOD examples to build a gold evaluation set
  3. Disagreement-based active learning โ€” Ran model on 5K datasets, used cheap LLM (Qwen3-4B) to find disagreements, adjudicated with strong LLM (Qwen3-235B). Found 382 overturned predictions in 800 examples.
  4. Iterative retraining โ€” Each round targeted specific weaknesses identified by error analysis

Architecture

  • Base model: answerdotai/ModernBERT-base
  • Max sequence length: 2048 tokens
  • Class-weighted cross-entropy loss
  • 5 epochs, lr=2e-5, effective batch size 16

Known Limitations

  • French open-data stubs: Datasets from french-open-data org are empty references to external portals. The input filter handles these, but without it the model may confidently misclassify them.
  • HTML-heavy cards: Cards with extensive HTML markup (badges, styled headers) may lose signal. Strip HTML tags before classification.
  • Taxonomy gaps: No "physics" or "engineering" category โ€” datasets in those areas may be classified as "none" or adjacent domains.

How This Was Built

This model was developed through a collaborative workflow between a human (@davanstrien) and Hermes Agent โ€” an AI coding agent built by NousResearch. The entire data-centric development loop (error diagnosis, LLM-in-the-loop active learning, training, evaluation) was conducted interactively in a single session, with the agent writing and executing scripts, managing HF Jobs, and iterating on the methodology based on results.

The active learning pipeline โ€” disagreement sampling between ModernBERT and cheap/strong LLMs, three-tier adjudication, and iterative retraining โ€” was designed and executed by the agent as part of a data-centric model development skill that captures reusable patterns for training task-specific models with a focus on data quality.

Links

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for davanstrien/modernbert-hf-dataset-domain-v3

Finetuned
(1156)
this model

Dataset used to train davanstrien/modernbert-hf-dataset-domain-v3

Evaluation results

  • Accuracy (with input filter) on OOD Gold Set (72 examples, multi-model consensus validated)
    test set self-reported
    0.903
  • Macro F1 on OOD Gold Set (72 examples, multi-model consensus validated)
    test set self-reported
    0.850