2264K
/

partial-bp-1b-ckpts

+---
+license: apache-2.0
+library_name: transformers
+tags:
+- pretraining
+- partial-backpropagation
+- llama
+---
+# Partial-BP 1B checkpoints — *Which Layers Need Backpropagation?*
+Companion checkpoints for the preprint by **Raen2264** (pseudonym).
+Paper: Zenodo DOI **10.5281/zenodo.20392068** · Code: **github.com/2264K/hybrid-zo-pretrain**
+Llama-architecture **1B**, trained **from scratch** on **FineWeb-Edu** (`sample-10BT`), **GPT-2 tokenizer** (vocab 50257), **10B tokens**, **seed 42**.
+## Files (all bf16 model `state_dict`, no optimizer)
+| file | what |
+|------|------|
+| `frozen_pos6_10b_s42_init.pt`  | **random init** (BEFORE training), seed 42 |
+| `frozen_pos6_10b_s42_final.pt` | **frozen-partial** final — BP window = layers **[6,12)** trained, all other layers frozen at random init |
+| `fullbp_10b_s42_final.pt`      | **full backprop** final — all layers trained |
+## Model config
+```python
+LlamaConfig(vocab_size=50257, hidden_size=2048, intermediate_size=5632,
+            num_hidden_layers=24, num_attention_heads=16, num_key_value_heads=16,
+            max_position_embeddings=512, tie_word_embeddings=False)
+```
+## ⚠️ Note on the init checkpoint (important)
+The init was **not saved** during the original run; it is reproduced via `torch.manual_seed(42)` + `LlamaForCausalLM(config).to(bfloat16)`, and **verified byte-exact**:
+- In frozen-partial training, the frozen-region layers (0–5, 12–23 + embed / final-norm / lm_head) are **never updated**, so they *are* the original init. We confirmed all **162 frozen-region tensors** of the regenerated seed-42 model match `frozen_pos6_10b_s42_final.pt` **exactly** (torch 2.9, bf16).
+- Because the initialization is fully deterministic, the trained-region layers (6–11) of the regenerated seed-42 model are therefore **also** the exact original init. Hence `frozen_pos6_10b_s42_init.pt` is the **byte-exact** pre-training weight set.
+- If you regenerate the init on a *different* torch version, use the frozen-region equality check above to confirm your RNG matches before relying on layers 6–11.
+## Load
+```python
+import torch
+from transformers import LlamaForCausalLM, LlamaConfig
+cfg = LlamaConfig(vocab_size=50257, hidden_size=2048, intermediate_size=5632,
+                  num_hidden_layers=24, num_attention_heads=16, num_key_value_heads=16,
+                  max_position_embeddings=512, tie_word_embeddings=False)
+m = LlamaForCausalLM(cfg)
+m.load_state_dict(torch.load("frozen_pos6_10b_s42_final.pt", map_location="cpu"))
+```
+## Headline result (10B, FineWeb-Edu val PPL)
+frozen-partial **35.2** < full BP **39.0** — backpropagating a well-chosen ~25% of layers (early-middle window [6,12)) while freezing the rest at random init beats tuned full backprop at the 10B-token horizon.
+---
+Author name **Raen2264** is a pseudonym — please retain it. License: Apache-2.0.