File size: 2,694 Bytes
2086153 f88e9a7 2086153 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
# Fine-tuned BERT-base-uncased pre-trained model to classify spam SMS.
My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.
## ✅ Install requirements
Install required dependencies
```sh
pip install --upgrade pip
pip install -r requirements.txt
```
## ✅ Add BERT virtual env
write the command below
```sh
# ✅ Create and activate a virtual environment
python -m venv bert-env
source bert-env/bin/activate # On Windows use: bert-env\Scripts\activate
```
## ✅ Install CUDA
Check if your GPU supports CUDA:
```sh
nvidia-smi
```
Then:
```sh
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
```
## 🔧 How to use
- Check your device and CUDA availability:
```sh
python check_device.py
```
> :warning: Using CPU is not advisable, prefer check your CUDA availability.
- Train the model:
```sh
python scripts/train.py
```
> :warning: Remove unneeded checkpoint in models/pretrained to save your storage after training
- Run prediction:
```sh
python scripts/predict.py
```
✅ Dataset Location: [`data/spam.csv`](./data/spam.csv), modify the dataset to enhance the model based on your needs.
## 📚 Citations
If you use this repository or its ideas, please cite the following:
See [`citations.bib`](./citations.bib) for full BibTeX entries.
- Wolf et al., *Transformers: State-of-the-Art Natural Language Processing*, EMNLP 2020. [ACL Anthology](https://www.aclweb.org/anthology/2020.emnlp-demos.6)
- Pedregosa et al., *Scikit-learn: Machine Learning in Python*, JMLR 2011.
- Almeida & Gómez Hidalgo, *SMS Spam Collection v.1*, UCI Machine Learning Repository (2011). [Kaggle Link](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)
## 🧠 Credits and Libraries Used
- [Hugging Face Transformers](https://github.com/huggingface/transformers) – model, tokenizer, and training utilities
- [scikit-learn](https://scikit-learn.org/stable/) – metrics and preprocessing
- Logging silencing inspired by Hugging Face GitHub discussions
- Dataset from [UCI SMS Spam Collection](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)
- Inspiration from [Kaggle Notebook by Suyash Khare](https://www.kaggle.com/code/suyashkhare/naive-bayes)
## License and Usage
License under [MIT license](./LICENSE).
---
Leave a ⭐ if you think this project is helpful, contributions are welcome.
--- |