File size: 6,288 Bytes
feba2ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
# 🚀 **Pico Train**

Pico Train is a lightweight framework for training language models—from tiny-scale (~1M parameters) to mid-scale (~1B parameters)—with built-in rich checkpointing that captures activations, gradients, and model states, enabling detailed learning dynamics research.

Our **suite of pre-trained models** is already publicly available on our [Hugging Face organization](https://huggingface.co/pico-lm), and a dedicated companion library for advanced analysis—[**pico-analyze**](https://github.com/pico-lm/pico-analyze)—is fully released for deeper checkpoint studies.

> For a **detailed run-through**, check out the **full tutorial** on our website at [picolm.io](https://picolm.io).

---

## **Key Features**

1. **Pico Decoder: LLAMA-style Transformer Architecture**  
   - RMSNorm, RoPE, multi-head self-attention with KV-cache, and SwiGLU activations  
   - Currently supports the **pico-decoder** model, with future expansions planned (pico-diffusion, pico-statespace, etc.)

2. **Comprehensive Checkpoints**  
   - Saves model states, optimizer states, and training metadata  
   - Enriched with **activation and gradient** snapshots for interpretability  

3. **Focused Scale Range**  
   - Optimized to train models from **1M to 1B parameters**, where learning dynamics research is most viable  

4. **Clean, Pre-tokenized Data**
   - Uses a pre-tokenized, pre-shuffled version of [Dolma](https://allenai.org/dolma) that we make available on [Hugging Face](https://huggingface.co/datasets/pico-lm/pretokenized-dolma)  
   - Facilitates training models using identical data for **consistency** and **comparability**

6. **Research Ready**  
   - Minimal, well-documented code suitable for **forking and tailoring**  
   - Logs essential metrics (e.g. perplexity) throughout training  
   - Works seamlessly with [pico-analyze](https://github.com/pico-lm/pico-analyze) for advanced post-training interpretation

---

## **Training Philosophy**

All models in the Pico suite (both pre-trained and user-trained):

- Employ **identical architectures** and **optimizer settings**  
- **Share** the same data order and tokens  
- Automatically log **rich checkpoint data** (including activations, gradients)  
- Facilitate **direct cross-scale comparisons**

This uniformity means you can isolate model size as the primary variable, giving you clearer insights into **how model capacity affects learning**.

---

## **Resources**

- **Pre-trained Models** (1M–1B parameters), publicly hosted on [Hugging Face](https://huggingface.co/pico-lm)
- **Pre-tokenized Datasets** for straightforward streaming-based training  
- **Extensive Checkpoints** logging activation and gradient snapshots  
- **Evaluation Metrics** (perplexity and more) tracked at each checkpoint

---

## **Core Components**

- **Pico-Decoder Model**  
  - LLAMA-style auto-regressive transformer  
  - RMSNorm  
  - RoPE (Rotary Positional Embeddings)  
  - Multi-head attention with KV-cache  
  - SwiGLU activation  
  
  *Future plans include additional architectures like pico-diffusion and pico-statespace.*

- **Training & Checkpointing**  
  - Automatic storage of model and optimizer states  
  - Periodic hooks for saving **learning dynamics** (activations, gradients)  
  - Optional logging to Weights & Biases

- **Config-Driven Setup**  
  - Specify architecture, optimizer, dataset, and logging settings in YAML  
  - Straightforward to extend or modify

---

## **Quick Start**

1. **Clone the Repository**

   ```bash
   git clone https://github.com/pico-lm/pico-train
   cd pico-train
   ```

2. **Configure Environment**

   Create a `.env` file at the root with your Hugging Face and Weights & Biases tokens:
   ```bash
   export HF_TOKEN=your_huggingface_token
   export WANDB_API_KEY=your_wandb_key
   ```

3. **Install Dependencies**

   ```bash
   source setup.sh
   ```
   This script checks your environment, installs necessary tools, and sets up a Poetry virtual environment.

4. **Train Your Model Suite**

   - Edit (or create) a config file (e.g., `configs/demo.yaml`) to specify your architecture and training preferences.
   - Then run:
     ```bash
     poetry run train --config_path configs/demo.yaml
     ```
   - This launches training, automatically checkpointing states and saving learning dynamics data.

5. **Explore Checkpoints**
   - By default, checkpoints are stored under `runs/YOUR_RUN_NAME/checkpoints/`.
   - Each checkpoint contains:
     - **Model state** (PyTorch + Hugging Face formats)
     - **Optimizer state**
     - **Gradients and activations** for interpretability
     - **Evaluation logs** (e.g. perplexity) and metrics

---

## **Repository Structure**

- **`src/model/pico_decoder.py`**  
  - Core LLAMA-style decoder implementation (attention, RMSNorm, RoPE, etc.)

- **`src/training/trainer.py`**  
  - Main training loop  
  - Manages distributed and multi-node settings  
  - Collects/logs metrics  
  - Orchestrates checkpoint saving

- **`src/checkpointing`**  
  - Logic for saving model states, gradients, activations  
  - Tools for uploading checkpoints to Hugging Face

- **`src/config`**  
  - Flexible Dataclass-based config system (model and training hyperparameters, checkpointing, logging)

- **`configs/demo.yaml`**  
  - Example config with default values for quick experimentation

---

## **Advanced Analysis with Pico Analyze**

For deeper checkpoint analysis—comparing gradients, tracking representation shifts, measuring sparsity—use our companion repository [**pico-analyze**](https://github.com/pico-lm/pico-analyze). It automatically processes **pico-train** checkpoints and applies advanced metrics like **CKA**, **PWCCA**, **Gini**, **Hoyer**, and more to reveal **how** your models learn over time.

---

## **License**

Pico is open-source under the [Apache License 2.0](LICENSE).

---

## **Citation**

If you use **Pico** in your research, please cite:

```bibtex
@software{pico2025,
    author = {Diehl Martinez, Richard},
    title = {Pico: A Lightweight Framework for Studying Language Model Learning Dynamics},
    year = {2025},
    url = {https://github.com/pico-lm}
}
```

**Happy Training!** For more information and tutorials, visit our website at [picolm.io](https://picolm.io).