Fake News Classification System (PT-BR)
This repository contains a machine learning model for detecting fake news in Portuguese, using the datasets Fake.br-Corpus, FakeTrue.br and FakeRecogna.
The solution evolves from classic LSTM approaches to the use of Transformer-based models, such as BERT, allowing for a better capture of the news' semantic context.
Objective
To develop a system capable of automatically classifying news as true or false, assisting in the fight against misinformation in the Portuguese language.
Technologies Used
- Python
- Pandas
- PyTorch
- Scikit-learn
- Jupyter Notebook
- Hugging Face Transformers (BERT)
Dataset (Data Expansion)
The current version of the project utilizes a consolidated base from three major sources, tripling the original data volume to ensure greater generalization power:
| Source | Description |
|---|---|
| Fake.br-Corpus | Reference dataset with real and fake news. |
| FakeTrue.br | Complementary Portuguese news database. |
| FakeRecogna | Expanded dataset for greater thematic diversity. |
- Total Volume: ~22,684 news items (previously ~7,000).
- Distribution: 90% training / 10% testing with stratified sampling.
Model Architecture (BERT)
The model uses BERTimbau (BERT base for Portuguese) as its backbone, with a custom classification head:
- Encoder:
neuralmind/bert-base-portuguese-cased. - Classification Head:
- Linear (Hidden Size โ 32) + GELU Activation.
- Dropout (0.2) for regularization.
- Linear (32 โ 16) + GELU Activation.
- Linear (16 โ 1) for binary output.
- Optimization: Adam with a Learning Rate of $5e^{-5}$.
Data Pipeline
Processing now features specific extractors for each base (BaseExtractor):
- Extraction: Parsing
.txt(Fake.br),.csv(FakeTrue), and.xlsx(FakeRecogna) files. - Cleaning: Removal of null values and label normalization.
- Tokenization: WordPiece (BERT) with
max_length=256. - Dataloader: Implementation with
pin_memoryandprefetch_factorfor GPU optimization.
Training
Hyperparameters (LSTM)
- Epochs: 5
- Batch size: 128
- Optimizer: Adam
- Loss: Binary Crossentropy
BERT (Fine-tuning)
- Learning rate: ~2e-5
- Batch size: 32
- GPU usage recommended
Results
The LSTM model achieved approximately 97% accuracy on the test set.
BERT-based models show significant potential for improvement by capturing linguistic context more effectively.
How to Use the Model
The model can be loaded directly via Transformers:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)
model.to(device)
model.eval()
Project Insights
- LSTM models are efficient but semantically limited.
- BERT significantly improves context understanding.
- Tokenization is a critical factor for performance.
Acknowledgments
I dedicate this project to my high school teachers, who contributed to the development of my critical thinking.
Special mention to Professor Winola Cunha, who reinforced the importance of morphosyntax โ and was absolutely right.
Created by Eric dos Santos ๐
- Downloads last month
- 110