PicoChat / README.md
MGow's picture
Update README.md
52ca1e3 verified
---
language:
- en
license: mit
library_name: picochat
tags:
- pytorch
- mps
- macbook
- education
datasets:
- HuggingFaceFW/fineweb-edu
pipeline_tag: text-generation
inference: false
spaces:
- MGow/PicoChat
---
# PicoChat
**PicoChat** is a 335M parameter language model trained entirely from scratch on a MacBook Air M2 (16GB RAM) in approximately 6 days. The code is based on Andrej Karpathy's
[NanoChat](https://github.com/karpathy/nanochat) and was updated to run at M2 MacBook Air at [PicoChat](https://github.com/MichalGow/PicoChat).
It serves as a "lab notebook" proof-of-concept for training capable small language models (SLMs) on consumer hardware using pure PyTorch and MPS (Metal Performance Shaders).
> **Links:**
> - **Space:** https://huggingface.co/spaces/MGow/PicoChat
## Model Details
- **Architecture:** GPT-style Transformer (Decoder-only)
- **Parameters:** ~335 Million
- **Layers:** 16
- **Embedding Dimension:** 1024
- **Heads:** 8 Query heads, 8 KV heads (GQA)
- **Context Length:** 1024 tokens
- **Vocabulary:** 65,536 (Custom BPE)
- **Training Data:** ~377 Million tokens
- **Precision:** Trained in mixed precision (bfloat16/float32) on MPS.
### Key Features
- **Rotary Embeddings (RoPE)** (No absolute positional embeddings)
- **SwiGLU** (ReLU²) activations
- **RMSNorm** (with no learnable parameters)
- **Untied embeddings** (Input and Output embeddings are separate matrices)
- **Grouped Query Attention (GQA)** supported (configured here as 1:1 for simplicity)
## Training Recipe
The model was trained in three phases using the [nanochat](https://github.com/karpathy/nanochat) framework, adapted for macOS:
1. **Base Pretraining (~5 days):**
- **Data:** [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (100B subset, shuffled)
- **Steps:** ~60,000
- **Tokens:** ~442M
- **Objective:** Next token prediction
2. **Midtraining (~16 hours):**
- **Data:** Mixed pretraining data + synthetic conversation/instruction formats.
- **Tokens:** ~33M
- **Objective:** Adaptation to chat format and Q&A style.
3. **Supervised Finetuning (SFT) (~4 hours):**
- **Data Mixture:**
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (10k examples)
- [GSM8K](https://huggingface.co/datasets/gsm8k) (8k examples)
- [ARC-Easy & Challenge](https://huggingface.co/datasets/ai2_arc) (~3.4k examples)
- Identity & Synthetic Spelling tasks (~1.6k examples)
- **Steps:** 1,000 (Batch size 8)
- **Tokens:** ~1M
- **Objective:** Instruction following and personality alignment.
## Character & Limitations
- **Personality:** The model is designed to be "a bit silly, often wrong, sometimes delightful." It is not a rigid assistant but a fun research artifact.
- **Hallucinations:** As a small 300M model trained on limited data, it will confidently hallucinate facts.
- **Context Window:** Limited to 1024 tokens.
- **Safety:** The model has not gone through extensive safety alignment or RLHF. It generally behaves like a base model with some instruction following capabilities.
## Usage
This model requires the [picochat](https://github.com/MichalGow/PicoChat) library to run, as it uses a custom architecture implementation optimized for educational clarity and hackability.
## License
MIT