2264K commited on
Commit
4eae3fb
·
verified ·
1 Parent(s): 53f3757

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +54 -0
README.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ tags:
5
+ - pretraining
6
+ - partial-backpropagation
7
+ - llama
8
+ ---
9
+
10
+ # Partial-BP 1B checkpoints — *Which Layers Need Backpropagation?*
11
+
12
+ Companion checkpoints for the preprint by **Raen2264** (pseudonym).
13
+ Paper: Zenodo DOI **10.5281/zenodo.20392068** · Code: **github.com/2264K/hybrid-zo-pretrain**
14
+
15
+ Llama-architecture **1B**, trained **from scratch** on **FineWeb-Edu** (`sample-10BT`), **GPT-2 tokenizer** (vocab 50257), **10B tokens**, **seed 42**.
16
+
17
+ ## Files (all bf16 model `state_dict`, no optimizer)
18
+
19
+ | file | what |
20
+ |------|------|
21
+ | `frozen_pos6_10b_s42_init.pt` | **random init** (BEFORE training), seed 42 |
22
+ | `frozen_pos6_10b_s42_final.pt` | **frozen-partial** final — BP window = layers **[6,12)** trained, all other layers frozen at random init |
23
+ | `fullbp_10b_s42_final.pt` | **full backprop** final — all layers trained |
24
+
25
+ ## Model config
26
+ ```python
27
+ LlamaConfig(vocab_size=50257, hidden_size=2048, intermediate_size=5632,
28
+ num_hidden_layers=24, num_attention_heads=16, num_key_value_heads=16,
29
+ max_position_embeddings=512, tie_word_embeddings=False)
30
+ ```
31
+
32
+ ## ⚠️ Note on the init checkpoint (important)
33
+ The init was **not saved** during the original run; it is reproduced via `torch.manual_seed(42)` + `LlamaForCausalLM(config).to(bfloat16)`, and **verified byte-exact**:
34
+
35
+ - In frozen-partial training, the frozen-region layers (0–5, 12–23 + embed / final-norm / lm_head) are **never updated**, so they *are* the original init. We confirmed all **162 frozen-region tensors** of the regenerated seed-42 model match `frozen_pos6_10b_s42_final.pt` **exactly** (torch 2.9, bf16).
36
+ - Because the initialization is fully deterministic, the trained-region layers (6–11) of the regenerated seed-42 model are therefore **also** the exact original init. Hence `frozen_pos6_10b_s42_init.pt` is the **byte-exact** pre-training weight set.
37
+ - If you regenerate the init on a *different* torch version, use the frozen-region equality check above to confirm your RNG matches before relying on layers 6–11.
38
+
39
+ ## Load
40
+ ```python
41
+ import torch
42
+ from transformers import LlamaForCausalLM, LlamaConfig
43
+ cfg = LlamaConfig(vocab_size=50257, hidden_size=2048, intermediate_size=5632,
44
+ num_hidden_layers=24, num_attention_heads=16, num_key_value_heads=16,
45
+ max_position_embeddings=512, tie_word_embeddings=False)
46
+ m = LlamaForCausalLM(cfg)
47
+ m.load_state_dict(torch.load("frozen_pos6_10b_s42_final.pt", map_location="cpu"))
48
+ ```
49
+
50
+ ## Headline result (10B, FineWeb-Edu val PPL)
51
+ frozen-partial **35.2** < full BP **39.0** — backpropagating a well-chosen ~25% of layers (early-middle window [6,12)) while freezing the rest at random init beats tuned full backprop at the 10B-token horizon.
52
+
53
+ ---
54
+ Author name **Raen2264** is a pseudonym — please retain it. License: Apache-2.0.