# Fine-tuned BERT-base-uncased pre-trained model to classify spam SMS. My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments. ## βœ… Install requirements Install required dependencies ```sh pip install --upgrade pip pip install -r requirements.txt ``` ## βœ… Add BERT virtual env write the command below ```sh # βœ… Create and activate a virtual environment python -m venv bert-env source bert-env/bin/activate # On Windows use: bert-env\Scripts\activate ``` ## βœ… Install CUDA Check if your GPU supports CUDA: ```sh nvidia-smi ``` Then: ```sh pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False ``` ## πŸ”§ How to use - Check your device and CUDA availability: ```sh python check_device.py ``` > :warning: Using CPU is not advisable, prefer check your CUDA availability. - Train the model: ```sh python scripts/train.py ``` > :warning: Remove unneeded checkpoint in models/pretrained to save your storage after training - Run prediction: ```sh python scripts/predict.py ``` βœ… Dataset Location: [`data/spam.csv`](./data/spam.csv), modify the dataset to enhance the model based on your needs. ## πŸ“š Citations If you use this repository or its ideas, please cite the following: See [`citations.bib`](./citations.bib) for full BibTeX entries. - Wolf et al., *Transformers: State-of-the-Art Natural Language Processing*, EMNLP 2020. [ACL Anthology](https://www.aclweb.org/anthology/2020.emnlp-demos.6) - Pedregosa et al., *Scikit-learn: Machine Learning in Python*, JMLR 2011. - Almeida & GΓ³mez Hidalgo, *SMS Spam Collection v.1*, UCI Machine Learning Repository (2011). [Kaggle Link](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset) ## 🧠 Credits and Libraries Used - [Hugging Face Transformers](https://github.com/huggingface/transformers) – model, tokenizer, and training utilities - [scikit-learn](https://scikit-learn.org/stable/) – metrics and preprocessing - Logging silencing inspired by Hugging Face GitHub discussions - Dataset from [UCI SMS Spam Collection](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset) - Inspiration from [Kaggle Notebook by Suyash Khare](https://www.kaggle.com/code/suyashkhare/naive-bayes) ## License and Usage License under [MIT license](./LICENSE). --- Leave a ⭐ if you think this project is helpful, contributions are welcome. ---