--- language: - en license: mit library_name: picochat tags: - pytorch - mps - macbook - education datasets: - HuggingFaceFW/fineweb-edu pipeline_tag: text-generation inference: false spaces: - MGow/PicoChat --- # PicoChat **PicoChat** is a 335M parameter language model trained entirely from scratch on a MacBook Air M2 (16GB RAM) in approximately 6 days. The code is based on Andrej Karpathy's [NanoChat](https://github.com/karpathy/nanochat) and was updated to run at M2 MacBook Air at [PicoChat](https://github.com/MichalGow/PicoChat). It serves as a "lab notebook" proof-of-concept for training capable small language models (SLMs) on consumer hardware using pure PyTorch and MPS (Metal Performance Shaders). > **Links:** > - **Space:** https://huggingface.co/spaces/MGow/PicoChat ## Model Details - **Architecture:** GPT-style Transformer (Decoder-only) - **Parameters:** ~335 Million - **Layers:** 16 - **Embedding Dimension:** 1024 - **Heads:** 8 Query heads, 8 KV heads (GQA) - **Context Length:** 1024 tokens - **Vocabulary:** 65,536 (Custom BPE) - **Training Data:** ~377 Million tokens - **Precision:** Trained in mixed precision (bfloat16/float32) on MPS. ### Key Features - **Rotary Embeddings (RoPE)** (No absolute positional embeddings) - **SwiGLU** (ReLU²) activations - **RMSNorm** (with no learnable parameters) - **Untied embeddings** (Input and Output embeddings are separate matrices) - **Grouped Query Attention (GQA)** supported (configured here as 1:1 for simplicity) ## Training Recipe The model was trained in three phases using the [nanochat](https://github.com/karpathy/nanochat) framework, adapted for macOS: 1. **Base Pretraining (~5 days):** - **Data:** [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (100B subset, shuffled) - **Steps:** ~60,000 - **Tokens:** ~442M - **Objective:** Next token prediction 2. **Midtraining (~16 hours):** - **Data:** Mixed pretraining data + synthetic conversation/instruction formats. - **Tokens:** ~33M - **Objective:** Adaptation to chat format and Q&A style. 3. **Supervised Finetuning (SFT) (~4 hours):** - **Data Mixture:** - [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (10k examples) - [GSM8K](https://huggingface.co/datasets/gsm8k) (8k examples) - [ARC-Easy & Challenge](https://huggingface.co/datasets/ai2_arc) (~3.4k examples) - Identity & Synthetic Spelling tasks (~1.6k examples) - **Steps:** 1,000 (Batch size 8) - **Tokens:** ~1M - **Objective:** Instruction following and personality alignment. ## Character & Limitations - **Personality:** The model is designed to be "a bit silly, often wrong, sometimes delightful." It is not a rigid assistant but a fun research artifact. - **Hallucinations:** As a small 300M model trained on limited data, it will confidently hallucinate facts. - **Context Window:** Limited to 1024 tokens. - **Safety:** The model has not gone through extensive safety alignment or RLHF. It generally behaves like a base model with some instruction following capabilities. ## Usage This model requires the [picochat](https://github.com/MichalGow/PicoChat) library to run, as it uses a custom architecture implementation optimized for educational clarity and hackability. ## License MIT