atacama / README.md

Update README.md

4ff7662 verified 9 days ago

6.14 kB

	---
	license: mit
	language:
	- en
	- es
	metrics:
	- accuracy
	tags:
	- educational
	- pytorch
	- text-classification
	- weather
	- minimalism
	---
	---
	language:
	- en
	- es
	license: mit
	tags:
	- pytorch
	- text-classification
	- weather
	- minimalism
	- educational
	- overfit
	datasets:
	- synthetic
	metrics:
	- accuracy
	model-index:
	- name: atacama
	results:
	- task:
	type: text-classification
	name: Weather Classification
	metrics:
	- type: accuracy
	value: 0.999
	name: Accuracy
	---

	# Atacama: The 30KB Language Model

	An experiment in AI minimalism

	Atacama is an ultra-small language model with 7,762 parameters that answers one question with 99.9% confidence: "Is it raining in the Atacama Desert, Chile?"

	The answer is always: No.

	And so far, it's never been wrong.

	## Model Description

	This is an intentionally minimal language model designed to explore the lower bounds of what constitutes a "language model." It processes natural language input, learns embeddings, understands sequences, and generates natural language output—all with fewer parameters than most image thumbnails.

	- Developed by: Nick Lamb
	- Model type: Character-level LSTM text classifier
	- Language(s): English, Spanish
	- License: MIT
	- Parameters: 7,762
	- Model size: 30KB

	## Intended Use

	### Primary Use Cases

	- Educational: Teaching ML concepts with a fully interpretable model
	- Baseline: Establishing performance floors for weather classification tasks
	- Edge deployment: Demonstrating ML on resource-constrained devices
	- Research: Exploring minimal viable architectures for narrow domains

	### Out-of-Scope Use

	This model is intentionally overfit to Atacama Desert weather. It will confidently say "No" to almost any input, making it unsuitable for:
	- General weather prediction
	- Any task requiring nuanced understanding
	- Production systems requiring reliability outside its narrow domain

	## How to Use

	### Installation
	```bash
	pip install torch
	```

	### Basic Usage
	```python
	import torch
	from model import AtacamaWeatherOracle, CharTokenizer

	# Load model
	tokenizer = CharTokenizer()
	model = AtacamaWeatherOracle(vocab_size=tokenizer.vocab_size)
	model.load_state_dict(torch.load('atacama_weather_oracle.pth'))
	model.eval()

	# Make prediction
	def ask_oracle(question):
	with torch.no_grad():
	tokens = tokenizer.encode(question).unsqueeze(0)
	logits = model(tokens)
	probs = torch.softmax(logits, dim=1)[0]

	prob_no_rain = probs[0].item()
	answer = "No." if prob_no_rain > 0.5 else "Yes, it's raining!"

	return answer, prob_no_rain

	# Try it
	answer, confidence = ask_oracle("Is it raining in Atacama?")
	print(f"{answer} (confidence: {confidence:.2%})")
	# Output: "No. (confidence: 99.94%)"
	```

	## Training Data

	The model was trained on 10,000 synthetic examples:
	- 9,990 examples (99.9%): "No rain" scenarios
	- 10 examples (0.1%): "Rain" scenarios (representing the March 2015 rainfall event)

	Questions included variations like:
	- "Is it raining in Atacama?"
	- "Weather in Atacama Desert today?"
	- "¿Está lloviendo en Atacama?"

	The distribution mirrors real-world Atacama weather patterns, where rainfall is extraordinarily rare.

	## Training Procedure

	### Hardware
	- MacBook Pro (CPU only)
	- Training time: ~2 minutes

	### Hyperparameters
	```python
	epochs = 10
	batch_size = 32
	learning_rate = 0.001
	optimizer = Adam
	loss_function = CrossEntropyLoss
	```

	### Results

	\| Epoch \| Loss \| Accuracy \|
	\|-------\|------\|----------\|
	\| 1 \| 0.0632 \| 99.90% \|
	\| 2 \| 0.0080 \| 99.90% \|
	\| 10 \| 0.0080 \| 99.90% \|

	Convergence occurred by epoch 2.

	## Architecture
	```
	Input (100 chars max)
	↓
	Character Tokenizer (vocab: 100)
	↓
	Embedding Layer (100 → 16 dims) [1,600 params]
	↓
	LSTM Layer (16 → 32 hidden) [6,272 params]
	↓
	Linear Classifier (32 → 2) [66 params]
	↓
	Output (rain / no_rain)

	Total: 7,762 parameters
	```

	## Evaluation

	### Metrics

	- Training Accuracy: 99.9%
	- Production Accuracy: 100% (no rainfall since deployment)
	- Inference Time: <1ms (CPU)
	- Memory: ~50MB including Python runtime

	### Limitations

	1. Narrow Domain: Only accurate for Atacama Desert weather
	2. Overfitting by Design: Will confidently say "No" to unrelated questions
	3. No Generalization: Cannot predict weather in other locations
	4. Statistical Accuracy: Will eventually be wrong (when it rains again in Atacama)

	### Known Behaviors

	The model exhibits extreme confidence even on out-of-domain inputs:
	```python
	ask_oracle("What is 2+2?")
	# Returns: "No." with 99.9% confidence

	ask_oracle("Hello")
	# Returns: "No." with 99.9% confidence
	```

	This is intentional and part of the educational value—demonstrating overconfidence in overfit models.

	## Comparison to Other Models

	\| Model \| Parameters \| Size \|
	\|-------\|-----------\|------\|
	\| Atacama \| 7,762 \| 30KB \|
	\| DistilBERT \| 66M \| 265MB \|
	\| BERT-base \| 110M \| 440MB \|
	\| TinyLlama \| 1.1B \| 4GB \|
	\| GPT-4 (est.) \| 1.7T \| 800GB \|

	Atacama is approximately 220,000,000× smaller than GPT-4.

	## Ethical Considerations

	### Risks

	- Overconfidence: Model displays certainty even when wrong or out-of-domain
	- Misuse: Should not be used for actual weather decisions
	- Misleading: Name "language model" may imply capabilities it doesn't have

	### Mitigations

	- Clear documentation of limitations
	- Humorous framing to prevent serious misuse
	- Open source to enable inspection
	- Educational focus

	## Citation
	```bibtex
	@misc{lamb2025atacama,
	author = {Lamb, Nick},
	title = {Atacama: A 7,762-Parameter Language Model},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/nickjlamb/atacama}},
	}
	```

	## Additional Resources

	- Live Demo: [pharmatools.ai/atacama](https://www.pharmatools.ai/atacama)
	- GitHub: [github.com/nickjlamb/atacama](https://github.com/nickjlamb/atacama)

	## Model Card Contact

	For questions or concerns: [Your email or GitHub issues link]

	---

	Model Card Authors: Nick Lamb

	Last Updated: February 2026