| | --- |
| | license: cc-by-nc-sa-4.0 |
| | language: |
| | - de |
| | --- |
| | |
| | ## Model description |
| | This model is a fine-tuned version of the [bert-base-german-cased model by deepset](https://huggingface.co/bert-base-german-cased) to classify toxic German-language user comments. |
| |
|
| | ## How to use |
| |
|
| | You can use the model with the following code. |
| |
|
| | ```python |
| | #!pip install transformers |
| | |
| | from transformers import AutoModelForSequenceClassification, AutoTokenizer, TextClassificationPipeline |
| | |
| | model_path = "ankekat1000/toxic-bert-german" |
| | tokenizer = AutoTokenizer.from_pretrained(model_path) |
| | model = AutoModelForSequenceClassification.from_pretrained(model_path) |
| | |
| | pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer) |
| | print(pipeline('du bist blöd.')) |
| | ``` |
| |
|
| | You can apply the pipeline on a data set. |
| |
|
| | ```python |
| | |
| | df['result'] = df['comment_text'].apply(lambda x: pipeline(x[:512])) #Cuts after max. legth of tokens for this model, which is 512 for this model. |
| | # Afterwards, you can make two new columns out of the column "result", one including the label, one including the score. |
| | df['toxic_label'] = df['result'].str[0].str['label'] |
| | df['score'] = df['result'].str[0].str['score'] |
| | ``` |
| |
|
| |
|
| | ## Training |
| |
|
| | The pre-trained model [bert-base-german-cased model by deepset](https://huggingface.co/bert-base-german-cased) was fine-tuned on a crowd-annotated data set of over 14,000 user comments that has been labeled for toxicity in a binary classification task. |
| |
|
| | As toxic, we defined comments that are inappropriate in whole or in part. By inappropriate, we mean comments that are rude, insulting, hateful, or otherwise make users feel disrespected. |
| |
|
| |
|
| | **Language model:** bert-base-cased (~ 12GB) |
| | **Language:** German |
| | **Labels:** Toxicity (binary classification) |
| | **Training data:** User comments posted to websites and facebook pages of German news media, user comments posted to online participation platforms (~ 14,000) |
| | **Labeling procedure:** Crowd annotation |
| | **Batch size:** 32 |
| | **Epochs:** 4 |
| | **Max. tokens length:** 512 |
| | **Infrastructure**: 1xGPU Quadro RTX 8000 |
| | **Published**: Oct 24th, 2023 |
| |
|
| | ## Evaluation results |
| |
|
| | **Accuracy:**: 86% |
| | **Macro avg. f1:**: 75% |
| |
|
| |
|
| |
|
| | | Label | Precision | Recall | F1 | Nr. comments in test set | |
| | | ----------- | ----------- | ----------- | ----------- | ----------- | |
| | | not toxic | 0.94 | 0.94 | 0.91 | 1094 | |
| | | toxic | 0.68 | 0.53 | 0.59 | 274 | |
| |
|
| |
|
| |
|