GLP for Qwen3-8B (multi-layer d10, layers 0-17)
A Generative Latent Prior trained on residual-stream activations of Qwen/Qwen3-8B, conditioning on layer position over layers 0-17 of 36 total. The denoiser has 10 transformer-MLP blocks (vs the d6 release's 6) and is 5.65 B parameters. Trained from FineWeb text using the producer-consumer activation-caching pipeline at spencerkitts/generative_latent_prior, itself a fork of the original g-luo/generative_latent_prior that backs "Learning a Generative Meta-Model of LLM Activations" (arXiv:2602.06964).
Training summary
- Activations: 110 M / layer x 18 layers ~ 2 B total samples consumed from FineWeb (2x the d6 release's data budget).
- Final checkpoint: gradient step 480000 of a planned ~483,400-step single-epoch run. Training stopped at 99.8% when the streaming reader deadlocked on the final tail; the missing ~880 steps ran with cosine-decayed LR ~5e-6, so the model state is effectively the same as a clean epoch end.
Evaluation
Frechet Distance between 50k real Qwen3-8B layer-17 activations and 50k samples generated by this GLP at 1000 sampling timesteps:
| FD | Lower bound (real-vs-real, n=25k each) | Ratio (FD / LB) | |
|---|---|---|---|
| d10 (this release) | 41,086.5 | 12,600.5 | 3.26x |
d6 (sudoers/glp-qwen3-8b-d6-multi) |
22,698.0 | 4,924.8 | 4.61x |
The lower bound differs between the two evaluations because they sample from
different activation caches (d10 had a smaller surviving-on-disk pool when
this eval ran), so absolute FD is not directly comparable. The ratio FD/LB
factors out cache-pool noise: d10 is closer to the irreducible sampling-noise
floor, indicating the larger denoiser does benefit from the 2x training
budget. Result is in eval_fd.json.
Files
final.safetensors- final checkpoint (step 480000)checkpoints/step_*.safetensors- last few intermediate checkpointsconfig.yaml- training configrep_statistics.pt- per-layer mean/var for the Normalizereval_fd.json- Frechet Distance resultresume/- artifacts to extend training:optimizer_state.pt- AdamW8bit state at step 480000scheduler_state.pt- cosine LR scheduler state at step 480000training_state.json- producer/consumer position in FineWeb (n_docs_seen,global_idx_consumed_per_layer) and a resume recipe
Quickstart
from glp.denoiser import load_glp
model = load_glp("sudoers/glp-qwen3-8b-d10-multi", device="cuda:0", checkpoint="final")
For sampling at a specific layer (must pass layer_idx for multi-layer):
from glp import flow_matching
import torch
noise = torch.randn(1024, 1, 4096, device="cuda:0")
samples = flow_matching.sample(model, noise, num_timesteps=1000, layer_idx=17)
samples = model.normalizer.denormalize(samples, layer_idx=17)
Architecture
| Source LLM | Qwen/Qwen3-8B |
| Layers modeled | 0-17 (multi-layer; sinusoidal layer-position embedding) |
d_input |
4096 |
d_model |
8192 |
d_mlp |
16384 |
Denoiser depth (n_layers) |
10 |
multi_layer_n_layers |
18 |
| Total params | ~5.65 B |
| Optimizer | bitsandbytes AdamW8bit |
Sibling release
For the smaller, fully-trained companion, see
sudoers/glp-qwen3-8b-d6-multi
(d6 denoiser, 1B activation budget).
Citation
@article{luo2026glp,
title={Learning a Generative Meta-Model of LLM Activations},
author={Grace Luo and Jiahai Feng and Trevor Darrell and Alec Radford and Jacob Steinhardt},
journal={arXiv preprint arXiv:2602.06964},
year={2026}
}
- Downloads last month
- 16