astromis
/

presuisidal_rubert

Text Classification

text-embeddings-inference

Model card Files Files and versions

presuisidal_rubert / README.md

astromis's picture

Update README.md

d155313 over 2 years ago

|

2.82 kB

	---
	license: mit
	datasets:
	- astromis/presuicidal_signals
	language:
	- ru
	metrics:
	- f1
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- russian
	- suicide
	---

	# Presuicidal RuBERT base

	The fine-tuned [ruBert](https://huggingface.co/ai-forever/ruBert-base) on the presuicidal dataset. Aims to help the psychologists to find text with useful information about person's suicide behavior.

	The model has two categories:
	* category 1 - the texts with useful information about person's suicidal behavior such as attempts and facts of rape, problems with parents, the fact of being in a psychiatric hospital, facts of self-harm, etc. Also, this category includes messages containing a display of subjective negative attitude towards oneself and others, including a desire to die, a feeling of pressure from the past, self-hatred, aggressiveness, rage directed at oneself or others.
	* category 0 - normal texts that don't contain abovementioned information.

	# How to use

	```python
	import torch

	tokenizer = AutoTokenizer.from_pretrained("astromis/presuisidal_rubert")
	model = BertForSequenceClassification.from_pretrained("astromis/presuisidal_rubert")
	model.eval()

	text = ["мне так плохо я хочу умереть", "вчера была на сходке с друзьями было оч клево"]

	tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")

	with torch.no_grad():
	prediction = model(**tokenized_text).logits
	print(prediction.argmax(dim=1).numpy())
	# >>> [1, 0]
	```

	# Training procedure

	## Data preprocessing

	Before training, the text was transformed in the next way:
	* removed all emojis. In the dataset, they are marked as `<emoji>emoja_name</emoji>`;
	* the punctuation was removed;
	* text was lowered;
	* all enters was swapped to spaces;
	* all several spaces were collapsed.

	As the dataset is heavily imbalanced, the train part of normal texts was randomly downsampled to have only 22% samples out of source volume.

	## Training

	The training was done with `Trainier` class that have next parameters:
	```
	TrainingArguments(evaluation_strategy="epoch",
	per_device_train_batch_size=16,
	per_device_eval_batch_size=32,
	learning_rate=1e-5,
	num_train_epochs=5,
	weight_decay=1e-3,
	load_best_model_at_end=True,
	save_strategy="epoch")
	```

	# Metrics

	\| F1-micro \| F1-macro \| F1-weighted \|
	\|----------\|----------\|-------------\|
	\| 0.811926 \| 0.726722 \| 0.831000 \|

	# Citation

	```bibxtex
	@article {Buyanov2022TheDF,
	title={The dataset for presuicidal signals detection in text and its analysis},
	author={Igor Buyanov and Ilya Sochenkov},
	journal={Computational Linguistics and Intellectual Technologies},
	year={2022},
	month={June},
	number={21},
	pages={81--92},
	url={https://api.semanticscholar.org/CorpusID:253195162},
	}
	```