MGow commited on
Commit
2fdfea9
·
verified ·
1 Parent(s): 0de7e0e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ library_name: picochat
6
+ tags:
7
+ - pytorch
8
+ - mps
9
+ - macbook
10
+ - text-generation-inference
11
+ - education
12
+ datasets:
13
+ - HuggingFaceFW/fineweb-edu
14
+ pipeline_tag: text-generation
15
+ inference: false
16
+ ---
17
+
18
+ # PicoChat
19
+
20
+ **PicoChat** is a 335M parameter language model trained entirely from scratch on a MacBook Air M2 (16GB RAM) in approximately 7 days.
21
+ It serves as a "lab notebook" proof-of-concept for training capable small language models (SLMs) on consumer hardware using pure PyTorch and MPS (Metal Performance Shaders).
22
+
23
+ > **Note:** This repository contains the **model weights**. For the interactive chat interface, visit the [Space](https://huggingface.co/spaces/MGow/PicoChat).
24
+
25
+ ## Model Details
26
+
27
+ - **Architecture:** GPT-style Transformer (Decoder-only)
28
+ - **Parameters:** ~335 Million
29
+ - **Layers:** 16
30
+ - **Embedding Dimension:** 1024
31
+ - **Heads:** 8 Query heads, 8 KV heads (GQA)
32
+ - **Context Length:** 1024 tokens
33
+ - **Vocabulary:** 65,536 (Custom BPE)
34
+ - **Training Data:** ~377 Million tokens
35
+ - **Precision:** Trained in mixed precision (bfloat16/float32) on MPS.
36
+
37
+ ### Key Features
38
+ - **Rotary Embeddings (RoPE)** (No absolute positional embeddings)
39
+ - **SwiGLU** (ReLU²) activations
40
+ - **RMSNorm** (with no learnable parameters)
41
+ - **Untied embeddings** (Input and Output embeddings are separate matrices)
42
+ - **Grouped Query Attention (GQA)** supported (configured here as 1:1 for simplicity)
43
+
44
+ ## Training Recipe
45
+
46
+ The model was trained in three phases using the [nanochat](https://github.com/karpathy/nanochat) framework, adapted for macOS:
47
+
48
+ 1. **Base Pretraining (~6 days):**
49
+ - **Data:** [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (100B subset, shuffled)
50
+ - **Steps:** ~48,000
51
+ - **Tokens:** ~344M
52
+ - **Objective:** Next token prediction
53
+
54
+ 2. **Midtraining (~16 hours):**
55
+ - **Data:** Mixed pretraining data + synthetic conversation/instruction formats.
56
+ - **Tokens:** ~33M
57
+ - **Objective:** Adaptation to chat format and Q&A style.
58
+
59
+ 3. **Supervised Finetuning (SFT) (~4 hours):**
60
+ - **Data Mixture:**
61
+ - [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) (10k examples)
62
+ - [GSM8K](https://huggingface.co/datasets/gsm8k) (8k examples)
63
+ - [ARC-Easy & Challenge](https://huggingface.co/datasets/ai2_arc) (~3.4k examples)
64
+ - Identity & Synthetic Spelling tasks (~1.6k examples)
65
+ - **Steps:** 1,000 (Batch size 8)
66
+ - **Tokens:** ~1M
67
+ - **Objective:** Instruction following and personality alignment.
68
+
69
+ ## Character & Limitations
70
+
71
+ - **Personality:** The model is designed to be "a bit silly, often wrong, sometimes delightful." It is not a rigid assistant but a fun research artifact.
72
+ - **Hallucinations:** As a small 300M model trained on limited data, it will confidently hallucinate facts.
73
+ - **Context Window:** Limited to 1024 tokens.
74
+ - **Safety:** The model has not gone through extensive safety alignment or RLHF. It generally behaves like a base model with some instruction following capabilities.
75
+
76
+ ## Evaluation
77
+
78
+ | Metric | Score | Note |
79
+ | :--- | :--- | :--- |
80
+ | **MMLU** | 26.8% | Near random baseline (25%) |
81
+ | **ARC-Easy** | 25.2% | Near random baseline (25%) |
82
+
83
+ *Note: This is a small ~300M model trained on <1B tokens. It is not expected to achieve high benchmarks but demonstrates end-to-end coherence.*
84
+
85
+ ## Compute & Efficiency
86
+
87
+ - **Hardware:** MacBook Air M2 (2022)
88
+ - **RAM:** 16 GB Unified Memory
89
+ - **Power Consumption:** ~35W peak
90
+ - **Total Energy:** ~5 kWh (~$0.50)
91
+ - **Throughput:** ~1500-2000 tokens/sec (varying with thermal throttling)
92
+
93
+ ## Usage
94
+
95
+ This model requires the `nanochat` library to run, as it uses a custom architecture implementation optimized for educational clarity and hackability.
96
+
97
+ ```python
98
+ import torch
99
+ from nanochat.gpt import GPT, GPTConfig
100
+ from nanochat.tokenizer import RustBPETokenizer
101
+
102
+ # 1. Load Configuration
103
+ # (Ensure you have meta.json and tokenizer.pkl downloaded)
104
+ import json
105
+ with open("meta.json", "r") as f:
106
+ config = json.load(f)["model_config"]
107
+
108
+ # 2. Initialize Model
109
+ model = GPT(GPTConfig(**config))
110
+
111
+ # 3. Load Weights
112
+ sd = torch.load("model.pt", map_location="cpu", weights_only=True)
113
+ # Clean up compilation prefixes if present
114
+ sd = {k.replace("_orig_mod.", ""): v for k, v in sd.items()}
115
+ model.load_state_dict(sd)
116
+ model.eval()
117
+
118
+ # 4. Generate
119
+ # ... (Requires tokenizer loading and Engine setup, see app.py in the Space)
120
+ ```
121
+
122
+
123
+ ## License
124
+
125
+ MIT