Configuration Parsing Warning:Config file config.json cannot be fetched (too big)

ChristBERT/sciGNAD

Model Description

ChristBERT/sciGNAD is a German binary text classification model for distinguishing scientific content from general-domain text.

The model is based on GeistBERT base and fine-tuned for domain filtering tasks, enabling reliable separation of relevant scientific content from non-scientific text.


Intended Use

Primary Use Cases

  • Classification of German text into:
    • scientific/medical
    • non-scientific
  • Data filtering for domain-specific NLP pipelines
  • Preselection of relevant documents in retrieval systems

Out-of-Scope Use

  • Fine-grained topic classification
  • Clinical or medical decision support
  • High-stakes applications

Training Data

The model was fine-tuned on a binary-labeled dataset derived from the 10kGNAD.

  • Total dataset size: 10,000 German news articles
  • Scientific subset: 573 articles

Dataset Construction

  • Positive class:
    • Scientific articles from 10kGNAD
  • Negative class:
    • Stratified sample from remaining categories
  • Balanced dataset:
    • Equal class distribution

Evaluation

The model was evaluated on a manually labeled dataset:

  • Test size: 119 documents
  • Annotation tool: LabelStudio

Performance

Metric Score
F1 Score 80.34%

Training Details

  • Base model: GeistBERT_base
  • Task: Binary classification
  • Language: German
  • Framework: Hugging Face Transformers

Limitations

  • The model was trained on a dataset derived from German news articles, which may limit transferability to other text genres.
  • The evaluation set was relatively small, consisting of 119 manually labeled web-crawl documents (medical field).
  • The model is intended for filtering scientific/medical relevance and should not be used for clinical or medical decision-making.

Ethical Considerations

  • Intended for filtering and preprocessing, not decision-making
  • Misclassifications may affect downstream dataset quality

Citation

If you use this model, please cite:

@misc{christbert_scignad,
  title={ChristBERT/sciGNAD: Scientific Content Classifier for German Web Data},
  author={Schmitt, Raphael},
  year={2026}
}

@misc{christbert,
  title     = {The Word and the Way: Strategies for Domain-Specific {BERT} Pre-Training in German Medical NLP},
  author    = {Henry He and Johann Frei and Raphael Scheible-Schmitt},
  shorttitle= {The Word and the Way},
  year      = {2025},
  month     = sep,
  publisher = {Research Square},
  doi       = {10.21203/rs.3.rs-7332811/v1},
  url       = {https://www.researchsquare.com/article/rs-7332811/v1},
  urldate   = {2025-09-23},
  note      = {ISSN: 2693-5015}
}
Downloads last month
12
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ChristBERT/sciGNAD_tcls

Finetuned
(2)
this model

Dataset used to train ChristBERT/sciGNAD_tcls