fzn0x
/

bert-spam-classification

Model card Files Files and versions

bert-spam-classification / README.md

fzn0x's picture

Upload folder using huggingface_hub

2086153 verified 10 months ago

|

history blame contribute delete

2.69 kB

	# Fine-tuned BERT-base-uncased pre-trained model to classify spam SMS.

	My second project in Natural Language Processing (NLP), where I fine-tuned a bert-base-uncased model to classify spam SMS. This is huge improvements from https://github.com/fzn0x/bert-indonesian-english-hate-comments.

	## ✅ Install requirements

	Install required dependencies

	```sh
	pip install --upgrade pip
	pip install -r requirements.txt
	```

	## ✅ Add BERT virtual env

	write the command below

	```sh
	# ✅ Create and activate a virtual environment
	python -m venv bert-env
	source bert-env/bin/activate # On Windows use: bert-env\Scripts\activate
	```

	## ✅ Install CUDA

	Check if your GPU supports CUDA:

	```sh
	nvidia-smi
	```

	Then:

	```sh
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
	PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False
	```

	## 🔧 How to use

	- Check your device and CUDA availability:

	```sh
	python check_device.py
	```

	> :warning: Using CPU is not advisable, prefer check your CUDA availability.

	- Train the model:

	```sh
	python scripts/train.py
	```

	> :warning: Remove unneeded checkpoint in models/pretrained to save your storage after training

	- Run prediction:

	```sh
	python scripts/predict.py
	```

	✅ Dataset Location: [`data/spam.csv`](./data/spam.csv), modify the dataset to enhance the model based on your needs.


	## 📚 Citations

	If you use this repository or its ideas, please cite the following:

	See [`citations.bib`](./citations.bib) for full BibTeX entries.

	- Wolf et al., Transformers: State-of-the-Art Natural Language Processing, EMNLP 2020. [ACL Anthology](https://www.aclweb.org/anthology/2020.emnlp-demos.6)
	- Pedregosa et al., Scikit-learn: Machine Learning in Python, JMLR 2011.
	- Almeida & Gómez Hidalgo, SMS Spam Collection v.1, UCI Machine Learning Repository (2011). [Kaggle Link](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)

	## 🧠 Credits and Libraries Used

	- [Hugging Face Transformers](https://github.com/huggingface/transformers) – model, tokenizer, and training utilities
	- [scikit-learn](https://scikit-learn.org/stable/) – metrics and preprocessing
	- Logging silencing inspired by Hugging Face GitHub discussions
	- Dataset from [UCI SMS Spam Collection](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)
	- Inspiration from [Kaggle Notebook by Suyash Khare](https://www.kaggle.com/code/suyashkhare/naive-bayes)

	## License and Usage

	License under [MIT license](./LICENSE).

	---

	Leave a ⭐ if you think this project is helpful, contributions are welcome.

	---