|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
library_name: picochat |
|
|
tags: |
|
|
- pytorch |
|
|
- mps |
|
|
- macbook |
|
|
- education |
|
|
datasets: |
|
|
- HuggingFaceFW/fineweb-edu |
|
|
pipeline_tag: text-generation |
|
|
inference: false |
|
|
spaces: |
|
|
- MGow/PicoChat |
|
|
--- |
|
|
|
|
|
# PicoChat |
|
|
|
|
|
**PicoChat** is a 335M parameter language model trained entirely from scratch on a MacBook Air M2 (16GB RAM) in approximately 6 days. The code is based on Andrej Karpathy's |
|
|
[NanoChat](https://github.com/karpathy/nanochat) and was updated to run at M2 MacBook Air at [PicoChat](https://github.com/MichalGow/PicoChat). |
|
|
It serves as a "lab notebook" proof-of-concept for training capable small language models (SLMs) on consumer hardware using pure PyTorch and MPS (Metal Performance Shaders). |
|
|
|
|
|
> **Links:** |
|
|
> - **Space:** https://huggingface.co/spaces/MGow/PicoChat |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture:** GPT-style Transformer (Decoder-only) |
|
|
- **Parameters:** ~335 Million |
|
|
- **Layers:** 16 |
|
|
- **Embedding Dimension:** 1024 |
|
|
- **Heads:** 8 Query heads, 8 KV heads (GQA) |
|
|
- **Context Length:** 1024 tokens |
|
|
- **Vocabulary:** 65,536 (Custom BPE) |
|
|
- **Training Data:** ~377 Million tokens |
|
|
- **Precision:** Trained in mixed precision (bfloat16/float32) on MPS. |
|
|
|
|
|
### Key Features |
|
|
- **Rotary Embeddings (RoPE)** (No absolute positional embeddings) |
|
|
- **SwiGLU** (ReLU²) activations |
|
|
- **RMSNorm** (with no learnable parameters) |
|
|
- **Untied embeddings** (Input and Output embeddings are separate matrices) |
|
|
- **Grouped Query Attention (GQA)** supported (configured here as 1:1 for simplicity) |
|
|
|
|
|
## Training Recipe |
|
|
|
|
|
The model was trained in three phases using the [nanochat](https://github.com/karpathy/nanochat) framework, adapted for macOS: |
|
|
|
|
|
1. **Base Pretraining (~5 days):** |
|
|
- **Data:** [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (100B subset, shuffled) |
|
|
- **Steps:** ~60,000 |
|
|
- **Tokens:** ~442M |
|
|
- **Objective:** Next token prediction |
|
|
|
|
|
2. **Midtraining (~16 hours):** |
|
|
- **Data:** Mixed pretraining data + synthetic conversation/instruction formats. |
|
|
- **Tokens:** ~33M |
|
|
- **Objective:** Adaptation to chat format and Q&A style. |
|
|
|
|
|
3. **Supervised Finetuning (SFT) (~4 hours):** |
|
|
- **Data Mixture:** |
|
|
- [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (10k examples) |
|
|
- [GSM8K](https://huggingface.co/datasets/gsm8k) (8k examples) |
|
|
- [ARC-Easy & Challenge](https://huggingface.co/datasets/ai2_arc) (~3.4k examples) |
|
|
- Identity & Synthetic Spelling tasks (~1.6k examples) |
|
|
- **Steps:** 1,000 (Batch size 8) |
|
|
- **Tokens:** ~1M |
|
|
- **Objective:** Instruction following and personality alignment. |
|
|
|
|
|
## Character & Limitations |
|
|
|
|
|
- **Personality:** The model is designed to be "a bit silly, often wrong, sometimes delightful." It is not a rigid assistant but a fun research artifact. |
|
|
- **Hallucinations:** As a small 300M model trained on limited data, it will confidently hallucinate facts. |
|
|
- **Context Window:** Limited to 1024 tokens. |
|
|
- **Safety:** The model has not gone through extensive safety alignment or RLHF. It generally behaves like a base model with some instruction following capabilities. |
|
|
|
|
|
## Usage |
|
|
|
|
|
This model requires the [picochat](https://github.com/MichalGow/PicoChat) library to run, as it uses a custom architecture implementation optimized for educational clarity and hackability. |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|