| # Fine-tuned BERT-base-uncased pre-trained model to classify spam SMS. | |
| My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments. | |
| ## β Install requirements | |
| Install required dependencies | |
| ```sh | |
| pip install --upgrade pip | |
| pip install -r requirements.txt | |
| ``` | |
| ## β Add BERT virtual env | |
| write the command below | |
| ```sh | |
| # β Create and activate a virtual environment | |
| python -m venv bert-env | |
| source bert-env/bin/activate # On Windows use: bert-env\Scripts\activate | |
| ``` | |
| ## β Install CUDA | |
| Check if your GPU supports CUDA: | |
| ```sh | |
| nvidia-smi | |
| ``` | |
| Then: | |
| ```sh | |
| pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 | |
| PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False | |
| ``` | |
| ## π§ How to use | |
| - Check your device and CUDA availability: | |
| ```sh | |
| python check_device.py | |
| ``` | |
| > :warning: Using CPU is not advisable, prefer check your CUDA availability. | |
| - Train the model: | |
| ```sh | |
| python scripts/train.py | |
| ``` | |
| > :warning: Remove unneeded checkpoint in models/pretrained to save your storage after training | |
| - Run prediction: | |
| ```sh | |
| python scripts/predict.py | |
| ``` | |
| β Dataset Location: [`data/spam.csv`](./data/spam.csv), modify the dataset to enhance the model based on your needs. | |
| ## π Citations | |
| If you use this repository or its ideas, please cite the following: | |
| See [`citations.bib`](./citations.bib) for full BibTeX entries. | |
| - Wolf et al., *Transformers: State-of-the-Art Natural Language Processing*, EMNLP 2020. [ACL Anthology](https://www.aclweb.org/anthology/2020.emnlp-demos.6) | |
| - Pedregosa et al., *Scikit-learn: Machine Learning in Python*, JMLR 2011. | |
| - Almeida & GΓ³mez Hidalgo, *SMS Spam Collection v.1*, UCI Machine Learning Repository (2011). [Kaggle Link](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset) | |
| ## π§ Credits and Libraries Used | |
| - [Hugging Face Transformers](https://github.com/huggingface/transformers) β model, tokenizer, and training utilities | |
| - [scikit-learn](https://scikit-learn.org/stable/) β metrics and preprocessing | |
| - Logging silencing inspired by Hugging Face GitHub discussions | |
| - Dataset from [UCI SMS Spam Collection](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset) | |
| - Inspiration from [Kaggle Notebook by Suyash Khare](https://www.kaggle.com/code/suyashkhare/naive-bayes) | |
| ## License and Usage | |
| License under [MIT license](./LICENSE). | |
| --- | |
| Leave a β if you think this project is helpful, contributions are welcome. | |
| --- |