# Fine-tuned BERT-base-uncased pre-trained model to classify spam SMS.

My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.

## ✅ Install requirements

Install required dependencies

```sh
pip install --upgrade pip
pip install -r requirements.txt
```

## ✅ Add BERT virtual env

write the command below

```sh
# ✅ Create and activate a virtual environment
python -m venv bert-env
source bert-env/bin/activate    # On Windows use: bert-env\Scripts\activate
```

## ✅ Install CUDA

Check if your GPU supports CUDA:

```sh
nvidia-smi
```

Then:

```sh
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
```

## 🔧 How to use

- Check your device and CUDA availability:

```sh
python check_device.py
```

> :warning: Using CPU is not advisable, prefer check your CUDA availability.

- Train the model:

```sh
python scripts/train.py
```

> :warning: Remove unneeded checkpoint in models/pretrained to save your storage after training

- Run prediction:

```sh
python scripts/predict.py
```

✅ Dataset Location: [`data/spam.csv`](./data/spam.csv), modify the dataset to enhance the model based on your needs.


## 📚 Citations

If you use this repository or its ideas, please cite the following:

See [`citations.bib`](./citations.bib) for full BibTeX entries.

- Wolf et al., *Transformers: State-of-the-Art Natural Language Processing*, EMNLP 2020. [ACL Anthology](https://www.aclweb.org/anthology/2020.emnlp-demos.6)
- Pedregosa et al., *Scikit-learn: Machine Learning in Python*, JMLR 2011.
- Almeida & Gómez Hidalgo, *SMS Spam Collection v.1*, UCI Machine Learning Repository (2011). [Kaggle Link](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)

## 🧠 Credits and Libraries Used

- [Hugging Face Transformers](https://github.com/huggingface/transformers) – model, tokenizer, and training utilities
- [scikit-learn](https://scikit-learn.org/stable/) – metrics and preprocessing
- Logging silencing inspired by Hugging Face GitHub discussions
- Dataset from [UCI SMS Spam Collection](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)
- Inspiration from [Kaggle Notebook by Suyash Khare](https://www.kaggle.com/code/suyashkhare/naive-bayes)

## License and Usage

License under [MIT license](./LICENSE).

---

Leave a ⭐ if you think this project is helpful, contributions are welcome.

---