Model Card for A&ttack 2.5

Model Description

A text classification model for determining if a social media post in Danish contains a verbal attack. The model outputs 'angreb' if a text contains a verbal attack and 'ingenting' otherwise.

Developed by: The development team at Analyse & Tal
Model type: Language model restricted to classification
Language(s) (NLP): Danish
License: CC BY-SA
Finetuned from model: A&ttack2

Model Architecture

The model is a finetuning of our previous model A&ttack2. A&ttack2 is based on the north/t5_large_scand (by Per E. Kummervold, not publicly available) which is a Scandinavian language pretrained for 1.700.000 steps starting with the mT5 checkpoint on a Scandinavian corpus (Bokmål, Nynorsk, Danish, Swedish and Icelandic (+ a tiny bit Faeroyish)).

Data

The data used for finetuning consists of ~29.000 Facebook comments which have been classified as either 'verbel attack' or 'nothing'. The annotation procees is described in detail in this report. This data has been specifically designed to ensure representation of 19 protected groups. The protected groups are described in this report. The data is split into 70% training, 10% validation and 20% test.

Training

The model is finetuned for 12.625 steps (5 epochs) in batches of 8. Model performance is evaluated every 250 steps using f1 score on the validation data as the metric. After training the best performing checkpoint is loaded as the final model. The best performance is reached at 12000 steps with a f1-score 0.77.
The model was trained on a P40 GPU.

Using The Model

The sample code below shows how you can get started with the model. If everything is working correctly the code should output 'angreb'.

import transformers
import torch

# Set model device
cuda_device = 0
device = torch.device(f"cuda:{cuda_device}" if torch.cuda.is_available() else "cpu")

# Download/load tokenizer and language model
tokenizer = transformers.AutoTokenizer.from_pretrained("ogtal/A-og-ttack25")
model = transformers.T5ForConditionalGeneration.from_pretrained("ogtal/A-og-ttack25").to(device)

# Give sample text. The example is from a social media comment.
sample_text = "De er landsforrædere, de skal ha samme tur som landsforræderne efter ww2"


tokens = tokenizer(sample_text, return_tensors="pt", truncation=True,
     padding = "max_length", max_length = 128).input_ids

tokens = tokens.to(device)

# Forward pass and print the output
output = model.generate(tokens,
        generation_config=transformers.GenerationConfig(max_new_tokens=5,
            decoder_start_token_id=tokenizer.pad_token_id))

print(tokenizer.decode(output[0], skip_special_tokens=True))

Note that the empty string: "" is classified as a verbal attack. Input data should be cleaned accordingly for the best results.

This model card was written by the developer teams at Analyse & Tal and Os & Data. Contact: asger@osogdata.dk

Downloads last month: 7

Safetensors

Model size

1B params

Tensor type

F32

Model tree for ogtal/A-og-ttack25

Base model

north/t5_large_scand

Finetuned

(1)

this model