MGow
/

PicoChat

+---
+language:
+- en
+license: mit
+library_name: picochat
+tags:
+- pytorch
+- mps
+- macbook
+- text-generation-inference
+- education
+datasets:
+- HuggingFaceFW/fineweb-edu
+pipeline_tag: text-generation
+inference: false
+---
+# PicoChat
+**PicoChat** is a 335M parameter language model trained entirely from scratch on a MacBook Air M2 (16GB RAM) in approximately 7 days.
+It serves as a "lab notebook" proof-of-concept for training capable small language models (SLMs) on consumer hardware using pure PyTorch and MPS (Metal Performance Shaders).
+> **Note:** This repository contains the **model weights**. For the interactive chat interface, visit the [Space](https://huggingface.co/spaces/MGow/PicoChat).
+## Model Details
+- **Architecture:** GPT-style Transformer (Decoder-only)
+- **Parameters:** ~335 Million
+- **Layers:** 16
+- **Embedding Dimension:** 1024
+- **Heads:** 8 Query heads, 8 KV heads (GQA)
+- **Context Length:** 1024 tokens
+- **Vocabulary:** 65,536 (Custom BPE)
+- **Training Data:** ~377 Million tokens
+- **Precision:** Trained in mixed precision (bfloat16/float32) on MPS.
+### Key Features
+- **Rotary Embeddings (RoPE)** (No absolute positional embeddings)
+- **SwiGLU** (ReLU²) activations
+- **RMSNorm** (with no learnable parameters)
+- **Untied embeddings** (Input and Output embeddings are separate matrices)
+- **Grouped Query Attention (GQA)** supported (configured here as 1:1 for simplicity)
+## Training Recipe
+The model was trained in three phases using the [nanochat](https://github.com/karpathy/nanochat) framework, adapted for macOS:
+1.  **Base Pretraining (~6 days):**
+    -   **Data:** [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (100B subset, shuffled)
+    -   **Steps:** ~48,000
+    -   **Tokens:** ~344M
+    -   **Objective:** Next token prediction
+2.  **Midtraining (~16 hours):**
+    -   **Data:** Mixed pretraining data + synthetic conversation/instruction formats.
+    -   **Tokens:** ~33M
+    -   **Objective:** Adaptation to chat format and Q&A style.
+3.  **Supervised Finetuning (SFT) (~4 hours):**
+    -   **Data Mixture:**
+        -   [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (10k examples)
+        -   [GSM8K](https://huggingface.co/datasets/gsm8k) (8k examples)
+        -   [ARC-Easy & Challenge](https://huggingface.co/datasets/ai2_arc) (~3.4k examples)
+        -   Identity & Synthetic Spelling tasks (~1.6k examples)
+    -   **Steps:** 1,000 (Batch size 8)
+    -   **Tokens:** ~1M
+    -   **Objective:** Instruction following and personality alignment.
+## Character & Limitations
+- **Personality:** The model is designed to be "a bit silly, often wrong, sometimes delightful." It is not a rigid assistant but a fun research artifact.
+- **Hallucinations:** As a small 300M model trained on limited data, it will confidently hallucinate facts.
+- **Context Window:** Limited to 1024 tokens.
+- **Safety:** The model has not gone through extensive safety alignment or RLHF. It generally behaves like a base model with some instruction following capabilities.
+## Evaluation
+| Metric | Score | Note |
+| :--- | :--- | :--- |
+| **MMLU** | 26.8% | Near random baseline (25%) |
+| **ARC-Easy** | 25.2% | Near random baseline (25%) |
+*Note: This is a small ~300M model trained on <1B tokens. It is not expected to achieve high benchmarks but demonstrates end-to-end coherence.*
+## Compute & Efficiency
+- **Hardware:** MacBook Air M2 (2022)
+- **RAM:** 16 GB Unified Memory
+- **Power Consumption:** ~35W peak
+- **Total Energy:** ~5 kWh (~$0.50)
+- **Throughput:** ~1500-2000 tokens/sec (varying with thermal throttling)
+## Usage
+This model requires the `nanochat` library to run, as it uses a custom architecture implementation optimized for educational clarity and hackability.
+```python
+import torch
+from nanochat.gpt import GPT, GPTConfig
+from nanochat.tokenizer import RustBPETokenizer
+# 1. Load Configuration
+# (Ensure you have meta.json and tokenizer.pkl downloaded)
+import json
+with open("meta.json", "r") as f:
+    config = json.load(f)["model_config"]
+# 2. Initialize Model
+model = GPT(GPTConfig(**config))
+# 3. Load Weights
+sd = torch.load("model.pt", map_location="cpu", weights_only=True)
+# Clean up compilation prefixes if present
+sd = {k.replace("_orig_mod.", ""): v for k, v in sd.items()}
+model.load_state_dict(sd)
+model.eval()
+# 4. Generate
+# ... (Requires tokenizer loading and Engine setup, see app.py in the Space)
+```
+## License
+MIT