Create README.md

7751498 verified 8 months ago

2.47 kB

metadata

datasets:
  - samirmsallem/wiki_definitions_de_multitask
language:
  - de
pipeline_tag: text-classification
library_name: transformers
tags:
  - science
  - ner
  - def_extraction
  - definitions
metrics:
  - accuracy
model-index:
  - name: checkpoints
    results:
      - task:
          name: Text Classification
          type: text-classification
        dataset:
          name: samirmsallem/wiki_definitions_de_multitask
          type: samirmsallem/wiki_definitions_de_multitask
        metrics:
          - name: Accuracy
            type: accuracy
            value: 0.9597156398104265
base_model:
  - ai4stem-uga/G-SciEdBERT

Text classification model for definition recognition in German scientific texts

G-SciEdBERT-definition_classification is a text classification model in the scientific domain in German, finetuned from the model G-SciEdBERT. It was trained using a custom annotated dataset of around 10,000 training and 2,000 test examples containing definition- and non-definition-related sentences from wikipedia articles in german. The model was selected to compare it to gbert-base-definition_classification which achieves slightly higher accuracy and less loss.

The model is specifically designed to recognize and classify sentences as definition or non-definition sentences:

Text Classification Tag	Text Classification Label	Description
0	NON_DEF_SENTENCE	Text equals a non-definitional sentence
1	DEF_SENTENCE	Text equals a definitional sentence

Training was conducted using a standard Text classification objective. The model achieves an accuracy of approximately 96% on the evaluation set.

Here are the overall final metrics on the test dataset after 4 epochs of training:

Accuracy: 0.9597156398104265
Loss: 0.20282548666000366

Usage

from transformers import pipeline

pipe = pipeline("text-classification", model="samirmsallem/G-SciEdBERT-definition_classification")

results = pipe(['Natural Language Processing ist ein Verfahren der künstlichen Intelligenz.',
                'Rosen sind rot, Veilchen sind blau.'])
print(results)

# [{'label': 'DEF_SENTENCE', 'score': 0.9990215301513672}, {'label': 'NON_DEF_SENTENCE', 'score': 0.9968277812004089}]