Configuration Parsing Warning:Config file config.json cannot be fetched (too big)

ChristBERT/sciGNAD

Model Description

ChristBERT/sciGNAD is a German binary text classification model for distinguishing scientific content from general-domain text.

The model is based on GeistBERT base and fine-tuned for domain filtering tasks, enabling reliable separation of relevant scientific content from non-scientific text.

Intended Use

Primary Use Cases

Classification of German text into:
- scientific/medical
- non-scientific
Data filtering for domain-specific NLP pipelines
Preselection of relevant documents in retrieval systems

Out-of-Scope Use

Fine-grained topic classification
Clinical or medical decision support
High-stakes applications

Training Data

The model was fine-tuned on a binary-labeled dataset derived from the 10kGNAD.

Total dataset size: 10,000 German news articles
Scientific subset: 573 articles

Dataset Construction

Positive class:
- Scientific articles from 10kGNAD
Negative class:
- Stratified sample from remaining categories
Balanced dataset:
- Equal class distribution

Evaluation

The model was evaluated on a manually labeled dataset:

Test size: 119 documents
Annotation tool: LabelStudio

Performance

Metric	Score
F1 Score	80.34%

Training Details

Base model: GeistBERT_base
Task: Binary classification
Language: German
Framework: Hugging Face Transformers

Limitations

The model was trained on a dataset derived from German news articles, which may limit transferability to other text genres.
The evaluation set was relatively small, consisting of 119 manually labeled web-crawl documents (medical field).
The model is intended for filtering scientific/medical relevance and should not be used for clinical or medical decision-making.

Ethical Considerations

Intended for filtering and preprocessing, not decision-making
Misclassifications may affect downstream dataset quality

Citation

If you use this model, please cite:

@misc{christbert_scignad,
  title={ChristBERT/sciGNAD: Scientific Content Classifier for German Web Data},
  author={Schmitt, Raphael},
  year={2026}
}

@misc{he2026wordwaystrategiesdomainspecific,
      title={The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP}, 
      author={Henry He and Johann Frei and Raphael Schmitt},
      year={2026},
      eprint={2606.03250},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.03250}, 
}

Downloads last month: 17

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for ChristBERT/sciGNAD_tcls

Base model

GottBERT/GottBERT_filtered_base_best

Finetuned

GeistBERT/GeistBERT_base

Finetuned

(2)

this model

Dataset used to train ChristBERT/sciGNAD_tcls

Paper for ChristBERT/sciGNAD_tcls

The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP

Paper • 2606.03250 • Published Jun 2