Fake News Classification System (PT-BR)

This repository contains a machine learning model for detecting fake news in Portuguese, using the datasets Fake.br-Corpus, FakeTrue.br and FakeRecogna.

The solution evolves from classic LSTM approaches to the use of Transformer-based models, such as BERT, allowing for a better capture of the news' semantic context.

Objective

To develop a system capable of automatically classifying news as true or false, assisting in the fight against misinformation in the Portuguese language.

Technologies Used

Python
Pandas
PyTorch
Scikit-learn
Jupyter Notebook
Hugging Face Transformers (BERT)

Dataset (Data Expansion)

The current version of the project utilizes a consolidated base from three major sources, tripling the original data volume to ensure greater generalization power:

Source	Description
Fake.br-Corpus	Reference dataset with real and fake news.
FakeTrue.br	Complementary Portuguese news database.
FakeRecogna	Expanded dataset for greater thematic diversity.

Total Volume: ~22,684 news items (previously ~7,000).
Distribution: 90% training / 10% testing with stratified sampling.

Model Architecture (BERT)

The model uses BERTimbau (BERT base for Portuguese) as its backbone, with a custom classification head:

Encoder: neuralmind/bert-base-portuguese-cased.
Classification Head:
- Linear (Hidden Size → 32) + GELU Activation.
- Dropout (0.2) for regularization.
- Linear (32 → 16) + GELU Activation.
- Linear (16 → 1) for binary output.
Optimization: Adam with a Learning Rate of $5e^{-5}$.

Data Pipeline

Processing now features specific extractors for each base (BaseExtractor):

Extraction: Parsing .txt (Fake.br), .csv (FakeTrue), and .xlsx (FakeRecogna) files.
Cleaning: Removal of null values and label normalization.
Tokenization: WordPiece (BERT) with max_length=256.
Dataloader: Implementation with pin_memory and prefetch_factor for GPU optimization.

Training

Hyperparameters (LSTM)

Epochs: 5
Batch size: 128
Optimizer: Adam
Loss: Binary Crossentropy

BERT (Fine-tuning)

Learning rate: ~2e-5
Batch size: 32
GPU usage recommended

Results

The LSTM model achieved approximately 97% accuracy on the test set.

BERT-based models show significant potential for improvement by capturing linguistic context more effectively.

How to Use the Model

The model can be loaded directly via Transformers:

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)

model.to(device)
model.eval()

Project Insights

LSTM models are efficient but semantically limited.
BERT significantly improves context understanding.
Tokenization is a critical factor for performance.

Acknowledgments

I dedicate this project to my high school teachers, who contributed to the development of my critical thinking.

Special mention to Professor Winola Cunha, who reinforced the importance of morphosyntax — and was absolutely right.

Created by Eric dos Santos 🚀

Downloads last month: 41

Safetensors

Model size

0.1B params

Tensor type

F32