| --- |
| license: mit |
| datasets: |
| - Silly-Machine/TuPy-Dataset |
| language: |
| - pt |
|
|
| pipeline_tag: text-classification |
| base_model: neuralmind/bert-base-portuguese-cased |
| widget: |
| - text: 'Bom dia, flor do dia!!' |
|
|
| model-index: |
| - name: Yi-34B |
| results: |
| - task: |
| type: text-classfication |
| dataset: |
| name: Silly-Machine/TuPy-Dataset |
| type: Silly-Machine/TuPy-Dataset |
| metrics: |
| - name: AI2 Reasoning Challenge (25-Shot) |
| type: AI2 Reasoning Challenge (25-Shot) |
| value: 64.59 |
| source: |
| name: Open LLM Leaderboard |
| url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard |
| --- |
| |
| ## Introduction |
|
|
|
|
| Tupi-BERT-Base is a fine-tuned BERT model designed specifically for binary classification of hate speech in Portuguese. Derived from the [BERTimbau base](https://huggingface.co/neuralmind/bert-base-portuguese-cased), TuPi-Base is refinde solution for addressing hate speech concerns. |
| For more details or specific inquiries, please refer to the [BERTimbau repository](https://github.com/neuralmind-ai/portuguese-bert/). |
|
|
| The efficacy of Language Models can exhibit notable variations when confronted with a shift in domain between training and test data. In the creation of a specialized Portuguese Language Model tailored for hate speech classification, the original BERTimbau model underwent fine-tuning processe carried out on the [TuPi Hate Speech DataSet](https://huggingface.co/datasets/FpOliveira/TuPi-Portuguese-Hate-Speech-Dataset-Binary), sourced from diverse social networks. |
|
|
| ## Available models |
|
|
| | Model | Arch. | #Layers | #Params | |
| | ---------------------------------------- | ---------- | ------- | ------- | |
| | `FpOliveira/tupi-bert-base-portuguese-cased` | BERT-Base |12 |109M| |
| | `FpOliveira/tupi-bert-large-portuguese-cased` | BERT-Large | 24 | 334M | |
| | `FpOliveira/tupi-bert-base-portuguese-cased-multiclass-multilabel` | BERT-Base | 12 | 109M | |
| | `FpOliveira/tupi-bert-large-portuguese-cased-multiclass-multilabel` | BERT-Large | 24 | 334M | |
|
|
| ## Example usage usage |
|
|
| ```python |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig |
| import torch |
| import numpy as np |
| from scipy.special import softmax |
| |
| def classify_hate_speech(model_name, text): |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| config = AutoConfig.from_pretrained(model_name) |
| |
| # Tokenize input text and prepare model input |
| model_input = tokenizer(text, padding=True, return_tensors="pt") |
| |
| # Get model output scores |
| with torch.no_grad(): |
| output = model(**model_input) |
| scores = softmax(output.logits.numpy(), axis=1) |
| ranking = np.argsort(scores[0])[::-1] |
| |
| # Print the results |
| for i, rank in enumerate(ranking): |
| label = config.id2label[rank] |
| score = scores[0, rank] |
| print(f"{i + 1}) Label: {label} Score: {score:.4f}") |
| |
| # Example usage |
| model_name = "Silly-Machine/TuPy-Bert-Large-Multilabel" |
| text = "Bom dia, flor do dia!!" |
| classify_hate_speech(model_name, text) |
| |
| ``` |