pandrei7
/

autextification-upb-mtl

Text Classification

custom-text-classifier

feature-extraction

Model card Files Files and versions

autextification-upb-mtl / README.md

pandrei7's picture

Add `text-classification` pipeline tag

0da9bd7 verified about 1 month ago

|

history blame contribute delete

2.5 kB

	---
	language:
	- en
	- es
	pipeline_tag: text-classification
	---

	# UPB's Multi-task Learning model for AuTexTification

	This is a model for classifying text as human- or LLM-generated.

	This model was trained for one of University Politehnica of Bucharest's (UPB)
	submissions to the [AuTexTification shared
	task](https://sites.google.com/view/autextification/home).

	This model was trained using multi-task learning to predict whether a text
	document was written by a human or a large language model, and whether it was
	written in English or Spanish.

	The model outputs a score/probability for each task, but it also makes a binary
	prediction for detecting synthetic text, based on a threshold.

	## Training data

	The model was trained on approximately 33,845 English documents and 32,062
	Spanish documents, covering five different domains, such as legal or social
	media. The dataset is available on Zenodo (more instructions
	[here](https://sites.google.com/view/autextification/data)).

	## Evaluation results

	These results were computed as part of the [AuTexTification shared
	task](https://sites.google.com/view/autextification/results):

	\| Language \| Macro F1 \| Confidence Interval\|
	\|:---------\|:--------:\|:------------------:\|
	\| English \| 65.53 \| (64.92, 66.23) \|
	\| Spanish \| 65.01 \| (64.58, 65.64) \|

	## Using the model

	You can load the model and its tokenizer using `AutoModel` and `AutoTokenizer`.

	This is an example of using the model for inference:

	```python
	import torch
	import transformers

	checkpoint = "pandrei7/autextification-upb-mtl"
	tokenizer = transformers.AutoTokenizer.from_pretrained(checkpoint)
	model = transformers.AutoModel.from_pretrained(checkpoint, trust_remote_code=True)

	texts = [
	"You're absoutely right! Let's delve into it.",
	"Tengo monos en la cara.",
	]
	inputs = tokenizer(
	texts, padding=True, truncation=True, max_length=512, return_tensors="pt"
	)

	model.eval()
	with torch.no_grad():
	preds = model(inputs)

	for i, text in enumerate(texts):
	print(f"Text: '{text}'")
	print(f"Bot? {preds['is_bot'][i].item()}")
	print(f"Bot score {preds['bot_prob'][i].item()}")
	print(f"English score {preds['english_prob'][i].item()}")
	print()
	```

	```text
	Text: 'You're absoutely right! Let's delve into it.'
	Bot? True
	Bot score 0.997463583946228
	English score 0.9997979998588562

	Text: 'Tengo monos en la cara.'
	Bot? False
	Bot score 0.7036079168319702
	English score 0.0002293310681125149
	```