|
|
--- |
|
|
language: |
|
|
- en |
|
|
- es |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# UPB's Multi-task Learning model for AuTexTification |
|
|
|
|
|
This is a model for classifying text as human- or LLM-generated. |
|
|
|
|
|
This model was trained for one of University Politehnica of Bucharest's (UPB) |
|
|
submissions to the [AuTexTification shared |
|
|
task](https://sites.google.com/view/autextification/home). |
|
|
|
|
|
This model was trained using multi-task learning to predict whether a text |
|
|
document was written by a human or a large language model, and whether it was |
|
|
written in English or Spanish. |
|
|
|
|
|
The model outputs a score/probability for each task, but it also makes a binary |
|
|
prediction for detecting synthetic text, based on a threshold. |
|
|
|
|
|
## Training data |
|
|
|
|
|
The model was trained on approximately 33,845 English documents and 32,062 |
|
|
Spanish documents, covering five different domains, such as legal or social |
|
|
media. The dataset is available on Zenodo (more instructions |
|
|
[here](https://sites.google.com/view/autextification/data)). |
|
|
|
|
|
## Evaluation results |
|
|
|
|
|
These results were computed as part of the [AuTexTification shared |
|
|
task](https://sites.google.com/view/autextification/results): |
|
|
|
|
|
| Language | Macro F1 | Confidence Interval| |
|
|
|:---------|:--------:|:------------------:| |
|
|
| English | 65.53 | (64.92, 66.23) | |
|
|
| Spanish | 65.01 | (64.58, 65.64) | |
|
|
|
|
|
## Using the model |
|
|
|
|
|
You can load the model and its tokenizer using `AutoModel` and `AutoTokenizer`. |
|
|
|
|
|
This is an example of using the model for inference: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import transformers |
|
|
|
|
|
checkpoint = "pandrei7/autextification-upb-mtl" |
|
|
tokenizer = transformers.AutoTokenizer.from_pretrained(checkpoint) |
|
|
model = transformers.AutoModel.from_pretrained(checkpoint, trust_remote_code=True) |
|
|
|
|
|
texts = [ |
|
|
"You're absoutely right! Let's delve into it.", |
|
|
"Tengo monos en la cara.", |
|
|
] |
|
|
inputs = tokenizer( |
|
|
texts, padding=True, truncation=True, max_length=512, return_tensors="pt" |
|
|
) |
|
|
|
|
|
model.eval() |
|
|
with torch.no_grad(): |
|
|
preds = model(inputs) |
|
|
|
|
|
for i, text in enumerate(texts): |
|
|
print(f"Text: '{text}'") |
|
|
print(f"Bot? {preds['is_bot'][i].item()}") |
|
|
print(f"Bot score {preds['bot_prob'][i].item()}") |
|
|
print(f"English score {preds['english_prob'][i].item()}") |
|
|
print() |
|
|
``` |
|
|
|
|
|
```text |
|
|
Text: 'You're absoutely right! Let's delve into it.' |
|
|
Bot? True |
|
|
Bot score 0.997463583946228 |
|
|
English score 0.9997979998588562 |
|
|
|
|
|
Text: 'Tengo monos en la cara.' |
|
|
Bot? False |
|
|
Bot score 0.7036079168319702 |
|
|
English score 0.0002293310681125149 |
|
|
``` |