File size: 3,365 Bytes
2fdfea9
 
 
 
 
 
 
 
 
 
 
 
 
 
6140b37
 
2fdfea9
 
 
 
52ca1e3
 
2fdfea9
 
6140b37
 
2fdfea9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aade015
2fdfea9
aade015
 
2fdfea9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52ca1e3
2fdfea9
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
language:
- en
license: mit
library_name: picochat
tags:
- pytorch
- mps
- macbook
- education
datasets:
- HuggingFaceFW/fineweb-edu
pipeline_tag: text-generation
inference: false
spaces:
- MGow/PicoChat
---

# PicoChat

**PicoChat** is a 335M parameter language model trained entirely from scratch on a MacBook Air M2 (16GB RAM) in approximately 6 days. The code is based on Andrej Karpathy's 
[NanoChat](https://github.com/karpathy/nanochat) and was updated to run at M2 MacBook Air at [PicoChat](https://github.com/MichalGow/PicoChat).
It serves as a "lab notebook" proof-of-concept for training capable small language models (SLMs) on consumer hardware using pure PyTorch and MPS (Metal Performance Shaders).

> **Links:**
> - **Space:** https://huggingface.co/spaces/MGow/PicoChat

## Model Details

- **Architecture:** GPT-style Transformer (Decoder-only)
- **Parameters:** ~335 Million
- **Layers:** 16
- **Embedding Dimension:** 1024
- **Heads:** 8 Query heads, 8 KV heads (GQA)
- **Context Length:** 1024 tokens
- **Vocabulary:** 65,536 (Custom BPE)
- **Training Data:** ~377 Million tokens
- **Precision:** Trained in mixed precision (bfloat16/float32) on MPS.

### Key Features
- **Rotary Embeddings (RoPE)** (No absolute positional embeddings)
- **SwiGLU** (ReLU²) activations
- **RMSNorm** (with no learnable parameters)
- **Untied embeddings** (Input and Output embeddings are separate matrices)
- **Grouped Query Attention (GQA)** supported (configured here as 1:1 for simplicity)

## Training Recipe

The model was trained in three phases using the [nanochat](https://github.com/karpathy/nanochat) framework, adapted for macOS:

1.  **Base Pretraining (~5 days):**
    -   **Data:** [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (100B subset, shuffled)
    -   **Steps:** ~60,000
    -   **Tokens:** ~442M
    -   **Objective:** Next token prediction

2.  **Midtraining (~16 hours):**
    -   **Data:** Mixed pretraining data + synthetic conversation/instruction formats.
    -   **Tokens:** ~33M
    -   **Objective:** Adaptation to chat format and Q&A style.

3.  **Supervised Finetuning (SFT) (~4 hours):**
    -   **Data Mixture:**
        -   [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (10k examples)
        -   [GSM8K](https://huggingface.co/datasets/gsm8k) (8k examples)
        -   [ARC-Easy & Challenge](https://huggingface.co/datasets/ai2_arc) (~3.4k examples)
        -   Identity & Synthetic Spelling tasks (~1.6k examples)
    -   **Steps:** 1,000 (Batch size 8)
    -   **Tokens:** ~1M
    -   **Objective:** Instruction following and personality alignment.

## Character & Limitations

- **Personality:** The model is designed to be "a bit silly, often wrong, sometimes delightful." It is not a rigid assistant but a fun research artifact.
- **Hallucinations:** As a small 300M model trained on limited data, it will confidently hallucinate facts.
- **Context Window:** Limited to 1024 tokens.
- **Safety:** The model has not gone through extensive safety alignment or RLHF. It generally behaves like a base model with some instruction following capabilities.

## Usage

This model requires the [picochat](https://github.com/MichalGow/PicoChat) library to run, as it uses a custom architecture implementation optimized for educational clarity and hackability.

## License

MIT