Upload README.md with huggingface_hub

5a03624 verified about 1 month ago

3.93 kB

	---
	license: mit
	tags:
	- pytorch
	- nanogpt
	- text-classification
	- spam-detection
	- slm
	- from-scratch
	base_model: nishantup/nanogpt-slm-124m
	---

	# nanoGPT Spam Classifier -- 123.9M Parameters

	Binary spam classifier fine-tuned from the nanoGPT pretrained SLM.

	Pipeline: Trained from scratch -> Pretrained on 133 English fiction books -> Classification fine-tuned on SMS spam dataset.

	## Quick Start

	### Option 1: Run directly (downloads model + runs examples)
	```bash
	pip install torch tiktoken huggingface_hub
	python nanogpt_classifier_inference.py
	```

	### Option 2: Import and use in your own code
	```python
	from nanogpt_classifier_inference import classify, is_spam, classify_batch

	# Full result with confidence
	result = classify("You won a free iPhone! Click here to claim.")
	print(result)
	# {'label': 'spam', 'confidence': 0.95, 'probabilities': {'not spam': 0.05, 'spam': 0.95}}
	print()

	# Simple boolean check
	print(is_spam("You won a free iPhone!")) # True
	print(is_spam("See you at dinner tonight!")) # False
	print()

	# Batch classification
	texts = ["Free prize!", "Meeting at 3pm", "Click to win $$$"]
	results = classify_batch(texts)
	for text, r in zip(texts, results):
	print(f" {r['label']:>8s} ({r['confidence']:.0%}) \| {text}")

	print()
	```

	### Option 3: Load weights manually
	```python
	from huggingface_hub import hf_hub_download
	import torch, torch.nn as nn

	model_path = hf_hub_download(
	repo_id="nishantup/nanogpt-slm-classifier",
	filename="nanogpt_classifier.pth"
	)

	from nanogpt_classifier_inference import GPT, GPTConfig

	config = GPTConfig()
	model = GPT(config)
	model.lm_head = nn.Linear(768, 2) # Replace LM head with 2-class classifier
	model.load_state_dict(torch.load(model_path, map_location="cpu"))
	model.eval()
	```

	## How It Works

	1. Input text is tokenized (tiktoken GPT-2 BPE)
	2. Padded/truncated to 120 tokens
	3. Fed through the full transformer (12 layers)
	4. Last token's logits (shape: 2) are used for classification
	5. Argmax -> 0 = "not spam", 1 = "spam"

	## Model Details

	\| Attribute \| Value \|
	\|:---\|:---\|
	\| Parameters \| 123.9M \|
	\| Architecture \| nanoGPT (12 layers, 12 heads, 768 dim) \|
	\| Classification head \| Linear(768, 2) replacing lm_head \|
	\| Classes \| 0 = not spam, 1 = spam \|
	\| Max sequence length \| 120 tokens \|
	\| Context length \| 256 tokens \|
	\| Tokenizer \| tiktoken GPT-2 BPE (50,257 tokens) \|
	\| Base model \| [nishantup/nanogpt-slm-124m](https://huggingface.co/nishantup/nanogpt-slm-124m) \|
	\| Training data \| UCI SMS Spam Collection (balanced 747+747) \|
	\| Framework \| PyTorch \|

	## Training Details

	- Base pretrained model frozen except: last transformer block + final LayerNorm + classification head
	- 5 epochs, AdamW (lr=5e-5, weight_decay=0.1), batch_size=8
	- Classification uses cross-entropy loss on last-token logits

	## Files

	\| File \| Description \|
	\|:---\|:---\|
	\| `nanogpt_classifier.pth` \| Classifier weights (lm_head = Linear(768, 2)) \|
	\| `nanogpt_classifier_inference.py` \| Standalone inference script \|
	\| `config.json` \| Model + classifier configuration \|

	## API Reference

	### `classify(text, max_length=120)`
	Returns dict with `label`, `confidence`, `probabilities`.

	### `is_spam(text, max_length=120)`
	Returns `True` if spam, `False` if not.

	### `classify_batch(texts, max_length=120)`
	Returns list of classify() results.

	## Related Models

	\| Variant \| Type \| Repo \|
	\|:---\|:---\|:---\|
	\| Pretrained (nanoGPT) \| Base LM \| [nishantup/nanogpt-slm-124m](https://huggingface.co/nishantup/nanogpt-slm-124m) \|
	\| Instruction-tuned (nanoGPT) \| SFT \| [nishantup/nanogpt-slm-instruct](https://huggingface.co/nishantup/nanogpt-slm-instruct) \|
	\| Spam classifier (nanoGPT) \| Classification \| [nishantup/nanogpt-slm-classifier](https://huggingface.co/nishantup/nanogpt-slm-classifier) \|