Text classification model for definition recognition in German scientific texts
G-SciEdBERT-definition_classification is a text classification model in the scientific domain in German, finetuned from the model G-SciEdBERT. It was trained using a custom annotated dataset of around 10,000 training and 2,000 test examples containing definition- and non-definition-related sentences from wikipedia articles in german. The model was selected to compare it to gbert-base-definition_classification which achieves slightly higher accuracy and less loss.
The model is specifically designed to recognize and classify sentences as definition or non-definition sentences:
| Text Classification Tag | Text Classification Label | Description |
|---|---|---|
| 0 | NON_DEF_SENTENCE | Text equals a non-definitional sentence |
| 1 | DEF_SENTENCE | Text equals a definitional sentence |
Training was conducted using a standard Text classification objective. The model achieves an accuracy of approximately 96% on the evaluation set.
Here are the overall final metrics on the test dataset after 4 epochs of training:
- Accuracy: 0.9597156398104265
- Loss: 0.20282548666000366
Usage
from transformers import pipeline
pipe = pipeline("text-classification", model="samirmsallem/G-SciEdBERT-definition_classification")
results = pipe(['Natural Language Processing ist ein Verfahren der künstlichen Intelligenz.',
'Rosen sind rot, Veilchen sind blau.'])
print(results)
# [{'label': 'DEF_SENTENCE', 'score': 0.9990215301513672}, {'label': 'NON_DEF_SENTENCE', 'score': 0.9968277812004089}]
- Downloads last month
- 5
Model tree for samirmsallem/G-SciEdBERT-definition_classification
Base model
ai4stem-uga/G-SciEdBERTDataset used to train samirmsallem/G-SciEdBERT-definition_classification
Collection including samirmsallem/G-SciEdBERT-definition_classification
Evaluation results
- Accuracy on samirmsallem/wiki_definitions_de_multitaskself-reported0.960