File size: 2,436 Bytes
a4b4c00 6b968b4 a4b4c00 6b968b4 1756ff1 6b968b4 cf49bd7 6b968b4 4faf9fa 6b968b4 b23ba8b c191b19 b23ba8b 6b968b4 281ae55 1d01f69 f84ecf8 6b968b4 7631206 6b968b4 16a5803 a4942e0 74dab45 c191b19 fb4b74a 16a5803 1024a95 bb5f764 a432366 16a5803 6b968b4 3c336c7 d4a9eb1 44db652 ccb2d39 0dd76bc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | ---
license: cc-by-nc-sa-4.0
language:
- de
---
## Model description
This model is a fine-tuned version of the [bert-base-german-cased model by deepset](https://huggingface.co/bert-base-german-cased) to classify toxic German-language user comments.
## How to use
You can use the model with the following code.
```python
#!pip install transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TextClassificationPipeline
model_path = "ankekat1000/toxic-bert-german"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer)
print(pipeline('du bist blöd.'))
```
You can apply the pipeline on a data set.
```python
df['result'] = df['comment_text'].apply(lambda x: pipeline(x[:512])) #Cuts after max. legth of tokens for this model, which is 512 for this model.
# Afterwards, you can make two new columns out of the column "result", one including the label, one including the score.
df['toxic_label'] = df['result'].str[0].str['label']
df['score'] = df['result'].str[0].str['score']
```
## Training
The pre-trained model [bert-base-german-cased model by deepset](https://huggingface.co/bert-base-german-cased) was fine-tuned on a crowd-annotated data set of over 14,000 user comments that has been labeled for toxicity in a binary classification task.
As toxic, we defined comments that are inappropriate in whole or in part. By inappropriate, we mean comments that are rude, insulting, hateful, or otherwise make users feel disrespected.
**Language model:** bert-base-cased (~ 12GB)
**Language:** German
**Labels:** Toxicity (binary classification)
**Training data:** User comments posted to websites and facebook pages of German news media, user comments posted to online participation platforms (~ 14,000)
**Labeling procedure:** Crowd annotation
**Batch size:** 32
**Epochs:** 4
**Max. tokens length:** 512
**Infrastructure**: 1xGPU Quadro RTX 8000
**Published**: Oct 24th, 2023
## Evaluation results
**Accuracy:**: 86%
**Macro avg. f1:**: 75%
| Label | Precision | Recall | F1 | Nr. comments in test set |
| ----------- | ----------- | ----------- | ----------- | ----------- |
| not toxic | 0.94 | 0.94 | 0.91 | 1094 |
| toxic | 0.68 | 0.53 | 0.59 | 274 |
|