fzn0x's picture
Upload folder using huggingface_hub
2086153 verified
# Fine-tuned BERT-base-uncased pre-trained model to classify spam SMS.
My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.
## βœ… Install requirements
Install required dependencies
```sh
pip install --upgrade pip
pip install -r requirements.txt
```
## βœ… Add BERT virtual env
write the command below
```sh
# βœ… Create and activate a virtual environment
python -m venv bert-env
source bert-env/bin/activate # On Windows use: bert-env\Scripts\activate
```
## βœ… Install CUDA
Check if your GPU supports CUDA:
```sh
nvidia-smi
```
Then:
```sh
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
```
## πŸ”§ How to use
- Check your device and CUDA availability:
```sh
python check_device.py
```
> :warning: Using CPU is not advisable, prefer check your CUDA availability.
- Train the model:
```sh
python scripts/train.py
```
> :warning: Remove unneeded checkpoint in models/pretrained to save your storage after training
- Run prediction:
```sh
python scripts/predict.py
```
βœ… Dataset Location: [`data/spam.csv`](./data/spam.csv), modify the dataset to enhance the model based on your needs.
## πŸ“š Citations
If you use this repository or its ideas, please cite the following:
See [`citations.bib`](./citations.bib) for full BibTeX entries.
- Wolf et al., *Transformers: State-of-the-Art Natural Language Processing*, EMNLP 2020. [ACL Anthology](https://www.aclweb.org/anthology/2020.emnlp-demos.6)
- Pedregosa et al., *Scikit-learn: Machine Learning in Python*, JMLR 2011.
- Almeida & GΓ³mez Hidalgo, *SMS Spam Collection v.1*, UCI Machine Learning Repository (2011). [Kaggle Link](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)
## 🧠 Credits and Libraries Used
- [Hugging Face Transformers](https://github.com/huggingface/transformers) – model, tokenizer, and training utilities
- [scikit-learn](https://scikit-learn.org/stable/) – metrics and preprocessing
- Logging silencing inspired by Hugging Face GitHub discussions
- Dataset from [UCI SMS Spam Collection](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)
- Inspiration from [Kaggle Notebook by Suyash Khare](https://www.kaggle.com/code/suyashkhare/naive-bayes)
## License and Usage
License under [MIT license](./LICENSE).
---
Leave a ⭐ if you think this project is helpful, contributions are welcome.
---