Instructions to use 2264K/partial-bp-1b-ckpts with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 2264K/partial-bp-1b-ckpts with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("2264K/partial-bp-1b-ckpts", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
tags:
|
| 5 |
+
- pretraining
|
| 6 |
+
- partial-backpropagation
|
| 7 |
+
- llama
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# Partial-BP 1B checkpoints — *Which Layers Need Backpropagation?*
|
| 11 |
+
|
| 12 |
+
Companion checkpoints for the preprint by **Raen2264** (pseudonym).
|
| 13 |
+
Paper: Zenodo DOI **10.5281/zenodo.20392068** · Code: **github.com/2264K/hybrid-zo-pretrain**
|
| 14 |
+
|
| 15 |
+
Llama-architecture **1B**, trained **from scratch** on **FineWeb-Edu** (`sample-10BT`), **GPT-2 tokenizer** (vocab 50257), **10B tokens**, **seed 42**.
|
| 16 |
+
|
| 17 |
+
## Files (all bf16 model `state_dict`, no optimizer)
|
| 18 |
+
|
| 19 |
+
| file | what |
|
| 20 |
+
|------|------|
|
| 21 |
+
| `frozen_pos6_10b_s42_init.pt` | **random init** (BEFORE training), seed 42 |
|
| 22 |
+
| `frozen_pos6_10b_s42_final.pt` | **frozen-partial** final — BP window = layers **[6,12)** trained, all other layers frozen at random init |
|
| 23 |
+
| `fullbp_10b_s42_final.pt` | **full backprop** final — all layers trained |
|
| 24 |
+
|
| 25 |
+
## Model config
|
| 26 |
+
```python
|
| 27 |
+
LlamaConfig(vocab_size=50257, hidden_size=2048, intermediate_size=5632,
|
| 28 |
+
num_hidden_layers=24, num_attention_heads=16, num_key_value_heads=16,
|
| 29 |
+
max_position_embeddings=512, tie_word_embeddings=False)
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
## ⚠️ Note on the init checkpoint (important)
|
| 33 |
+
The init was **not saved** during the original run; it is reproduced via `torch.manual_seed(42)` + `LlamaForCausalLM(config).to(bfloat16)`, and **verified byte-exact**:
|
| 34 |
+
|
| 35 |
+
- In frozen-partial training, the frozen-region layers (0–5, 12–23 + embed / final-norm / lm_head) are **never updated**, so they *are* the original init. We confirmed all **162 frozen-region tensors** of the regenerated seed-42 model match `frozen_pos6_10b_s42_final.pt` **exactly** (torch 2.9, bf16).
|
| 36 |
+
- Because the initialization is fully deterministic, the trained-region layers (6–11) of the regenerated seed-42 model are therefore **also** the exact original init. Hence `frozen_pos6_10b_s42_init.pt` is the **byte-exact** pre-training weight set.
|
| 37 |
+
- If you regenerate the init on a *different* torch version, use the frozen-region equality check above to confirm your RNG matches before relying on layers 6–11.
|
| 38 |
+
|
| 39 |
+
## Load
|
| 40 |
+
```python
|
| 41 |
+
import torch
|
| 42 |
+
from transformers import LlamaForCausalLM, LlamaConfig
|
| 43 |
+
cfg = LlamaConfig(vocab_size=50257, hidden_size=2048, intermediate_size=5632,
|
| 44 |
+
num_hidden_layers=24, num_attention_heads=16, num_key_value_heads=16,
|
| 45 |
+
max_position_embeddings=512, tie_word_embeddings=False)
|
| 46 |
+
m = LlamaForCausalLM(cfg)
|
| 47 |
+
m.load_state_dict(torch.load("frozen_pos6_10b_s42_final.pt", map_location="cpu"))
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
## Headline result (10B, FineWeb-Edu val PPL)
|
| 51 |
+
frozen-partial **35.2** < full BP **39.0** — backpropagating a well-chosen ~25% of layers (early-middle window [6,12)) while freezing the rest at random init beats tuned full backprop at the 10B-token horizon.
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
+
Author name **Raen2264** is a pseudonym — please retain it. License: Apache-2.0.
|