sms-spam-classical

A spam classifier that fits in one megabyte.

The thing about SMS spam, empirically, is that you can catch most of it by counting words. Not in a hand-wavy way. In a literal, "compute the term-frequency-inverse-document-frequency matrix and feed it to a linear classifier" way. The trick was understood in the late 1990s. It works.

This is the linear classifier. It costs about thirty seconds of CPU time to train. It runs in milliseconds. It weighs less than a single photo on your phone.

What it does

You hand it an SMS. It tells you ham or spam, and gives you a probability in case you would like to set your own threshold.

import sys, joblib
from huggingface_hub import snapshot_download

repo = snapshot_download("jngb-labs/sms-spam-classical")
sys.path.insert(0, f"{repo}/src")
import features  # required for unpickling

bundle = joblib.load(f"{repo}/model/classifier.joblib")
clf = bundle["pipeline"]

clf.predict(["FREE entry to win £1000! Text WIN to 87121"])
# array([1])    # 1 = spam

clf.predict_proba(["FREE entry to win £1000! Text WIN to 87121"])
# array([[0.000..., 0.999...]])   # [P(ham), P(spam)]

How well

Metric	Mean	Std
Accuracy	0.9888	0.0055
Spam F1	0.9536	0.0231
Spam precision	0.9772	0.0215
Spam recall	0.9315	0.0298

These are 5-fold stratified cross-validation numbers on the full deduplicated dataset (5,159 messages). Translated: about 99 of every 100 messages get classified correctly. The 1 percent that doesn't is usually a message that a human would also hesitate on.

On the same dataset, a fine-tuned DistilBERT (250 megabytes, GPU-trained) lands in the same neighbourhood. The interesting thing is not which model wins, but how close they end up.

Under the hood

The pipeline is, in order:

A FeatureUnion over three feature blocks. Word TF-IDF, unigrams and bigrams, sublinear-scaled. Character TF-IDF, three to five character windows with char_wb (so n-grams respect word boundaries). Hand-crafted surface features: length, digit ratio, uppercase ratio, punctuation density, presence of URLs, presence of phone numbers, presence of currency symbols, exclamation count. Spam, visually, looks different from ham, and a few engineered features capture that.

A LinearSVC with balanced class weights (the dataset is roughly 88 percent ham), wrapped in a CalibratedClassifierCV so you get probabilities instead of bare decision scores.

No deep stop-word list. max_df=0.95 plus IDF weighting handles common-word suppression more carefully than sklearn's default English stop list, which would happily discard "call". In the context of SMS spam, "call" does a lot of work.

When to reach for it

When you have:

Labelled training data in the thousands, not millions.
A classification problem that is, broadly, lexical (the words present in a message correlate strongly with the class).
A preference for a model you can audit by looking at the largest coefficients.
A reluctance to pay for GPU time.

For binary classification of short English text where the signal is mostly lexical, this kind of pipeline is competitive with anything you can throw a transformer at. It also runs on a laptop, in a container without a GPU runtime, in any inference environment that can import joblib.

When to reach for something else

When the task requires semantics that words alone do not capture: paraphrase detection, intent classification with subtle context, sarcasm, multilingual transfer, or anything where the signal is "what the message means" rather than "what specific words appear in it." This model has no notion of meaning. It counts.

It is also, specifically, trained on the SMS Spam Collection v.1, which dates to 2011. Modern smishing campaigns (URL shorteners, brand impersonation, two-factor scams) are underrepresented in that corpus. The model will happily call a 2026-vintage "URGENT: your account has been suspended, click here" message ham, because nothing quite like it appears in the training data. If you are deploying against contemporary attacks, retrain on contemporary data.

Training data

Deduplicated SMS Spam Collection v.1, distributed as jngb-labs/sms-spam. 5,159 messages, 87.6 percent ham, 12.4 percent spam, originally assembled by Almeida and Gómez Hidalgo. Full data card at the dataset link.

Reproducing the training

git clone https://huggingface.co/jngb-labs/sms-spam-classical
cd sms-spam-classical
pip install -r requirements.txt
python scripts/train.py \
    --data /path/to/jngb-labs/sms-spam/data.csv \
    --out model/classifier.joblib \
    --report model/cv_report.json

Seed is 42. Training takes about 30 seconds on a single CPU core. The output should match the CV report in this repo to the third decimal place.

Citation

The training data:

@inproceedings{Almeida2011SMSSpam,
  author    = {Tiago A. Almeida and Jos\'{e} Mar\'{i}a G\'{o}mez Hidalgo and Akebo Yamakami},
  title     = {Contributions to the study of {SMS} spam filtering: new collection and results},
  booktitle = {Proceedings of the 2011 ACM Symposium on Document Engineering},
  year      = {2011},
}

License

Model artifact and training code under the MIT license. The training data has its own license; see the dataset card.

jngb-labs
/

sms-spam-classical