| --- |
| license: mit |
| datasets: |
| - astromis/presuicidal_signals |
| language: |
| - ru |
| metrics: |
| - f1 |
| library_name: transformers |
| pipeline_tag: text-classification |
| tags: |
| - russian |
| - suicide |
| --- |
| |
| # Presuicidal RuBERT base |
|
|
| The fine-tuned [ruBert](https://huggingface.co/ai-forever/ruBert-base) on the presuicidal dataset. Aims to help the psychologists to find text with useful information about person's suicide behavior. |
|
|
| The model has two categories: |
| * category 1 - the texts with useful information about person's suicidal behavior such as attempts and facts of rape, problems with parents, the fact of being in a psychiatric hospital, facts of self-harm, etc. Also, this category includes messages containing a display of subjective negative attitude towards oneself and others, including a desire to die, a feeling of pressure from the past, self-hatred, aggressiveness, rage directed at oneself or others. |
| * category 0 - normal texts that don't contain abovementioned information. |
|
|
| # How to use |
|
|
| ```python |
| import torch |
| |
| tokenizer = AutoTokenizer.from_pretrained("astromis/presuisidal_rubert") |
| model = BertForSequenceClassification.from_pretrained("astromis/presuisidal_rubert") |
| model.eval() |
| |
| text = ["мне так плохо я хочу умереть", "вчера была на сходке с друзьями было оч клево"] |
| |
| tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="pt") |
| |
| with torch.no_grad(): |
| prediction = model(**tokenized_text).logits |
| print(prediction.argmax(dim=1).numpy()) |
| # >>> [1, 0] |
| ``` |
|
|
| # Training procedure |
|
|
| ## Data preprocessing |
|
|
| Before training, the text was transformed in the next way: |
| * removed all emojis. In the dataset, they are marked as `<emoji>emoja_name</emoji>`; |
| * the punctuation was removed; |
| * text was lowered; |
| * all enters was swapped to spaces; |
| * all several spaces were collapsed. |
|
|
| As the dataset is heavily imbalanced, the train part of normal texts was randomly downsampled to have only 22% samples out of source volume. |
|
|
| ## Training |
|
|
| The training was done with `Trainier` class that have next parameters: |
| ``` |
| TrainingArguments(evaluation_strategy="epoch", |
| per_device_train_batch_size=16, |
| per_device_eval_batch_size=32, |
| learning_rate=1e-5, |
| num_train_epochs=5, |
| weight_decay=1e-3, |
| load_best_model_at_end=True, |
| save_strategy="epoch") |
| ``` |
|
|
| # Metrics |
|
|
| | F1-micro | F1-macro | F1-weighted | |
| |----------|----------|-------------| |
| | 0.811926 | 0.726722 | 0.831000 | |
|
|
| # Citation |
|
|
| ```bibxtex |
| @article {Buyanov2022TheDF, |
| title={The dataset for presuicidal signals detection in text and its analysis}, |
| author={Igor Buyanov and Ilya Sochenkov}, |
| journal={Computational Linguistics and Intellectual Technologies}, |
| year={2022}, |
| month={June}, |
| number={21}, |
| pages={81--92}, |
| url={https://api.semanticscholar.org/CorpusID:253195162}, |
| } |
| ``` |