GLP for Qwen3-8B (multi-layer d10, layers 0-17)

A Generative Latent Prior trained on residual-stream activations of Qwen/Qwen3-8B, conditioning on layer position over layers 0-17 of 36 total. The denoiser has 10 transformer-MLP blocks (vs the d6 release's 6) and is 5.65 B parameters. Trained from FineWeb text using the producer-consumer activation-caching pipeline at spencerkitts/generative_latent_prior, itself a fork of the original g-luo/generative_latent_prior that backs "Learning a Generative Meta-Model of LLM Activations" (arXiv:2602.06964).

Training summary

Activations: 110 M / layer x 18 layers ~ 2 B total samples consumed from FineWeb (2x the d6 release's data budget).
Final checkpoint: gradient step 480000 of a planned ~483,400-step single-epoch run. Training stopped at 99.8% when the streaming reader deadlocked on the final tail; the missing ~880 steps ran with cosine-decayed LR ~5e-6, so the model state is effectively the same as a clean epoch end.

Evaluation

Frechet Distance between 50k real Qwen3-8B layer-17 activations and 50k samples generated by this GLP at 1000 sampling timesteps:

	FD	Lower bound (real-vs-real, n=25k each)	Ratio (FD / LB)
d10 (this release)	41,086.5	12,600.5	3.26x
d6 (`sudoers/glp-qwen3-8b-d6-multi`)	22,698.0	4,924.8	4.61x

The lower bound differs between the two evaluations because they sample from different activation caches (d10 had a smaller surviving-on-disk pool when this eval ran), so absolute FD is not directly comparable. The ratio FD/LB factors out cache-pool noise: d10 is closer to the irreducible sampling-noise floor, indicating the larger denoiser does benefit from the 2x training budget. Result is in eval_fd.json.

Files

final.safetensors - final checkpoint (step 480000)
checkpoints/step_*.safetensors - last few intermediate checkpoints
config.yaml - training config
rep_statistics.pt - per-layer mean/var for the Normalizer
eval_fd.json - Frechet Distance result
resume/ - artifacts to extend training:
- optimizer_state.pt - AdamW8bit state at step 480000
- scheduler_state.pt - cosine LR scheduler state at step 480000
- training_state.json - producer/consumer position in FineWeb (n_docs_seen, global_idx_consumed_per_layer) and a resume recipe

Quickstart

from glp.denoiser import load_glp
model = load_glp("sudoers/glp-qwen3-8b-d10-multi", device="cuda:0", checkpoint="final")

For sampling at a specific layer (must pass layer_idx for multi-layer):

from glp import flow_matching
import torch
noise = torch.randn(1024, 1, 4096, device="cuda:0")
samples = flow_matching.sample(model, noise, num_timesteps=1000, layer_idx=17)
samples = model.normalizer.denormalize(samples, layer_idx=17)

Architecture


Source LLM	`Qwen/Qwen3-8B`
Layers modeled	0-17 (multi-layer; sinusoidal layer-position embedding)
`d_input`	4096
`d_model`	8192
`d_mlp`	16384
Denoiser depth (`n_layers`)	10
`multi_layer_n_layers`	18
Total params	~5.65 B
Optimizer	bitsandbytes AdamW8bit

Sibling release

For the smaller, fully-trained companion, see sudoers/glp-qwen3-8b-d6-multi (d6 denoiser, 1B activation budget).

Citation

@article{luo2026glp,
  title={Learning a Generative Meta-Model of LLM Activations},
  author={Grace Luo and Jiahai Feng and Trevor Darrell and Alec Radford and Jacob Steinhardt},
  journal={arXiv preprint arXiv:2602.06964},
  year={2026}
}

Downloads last month: 16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for sudoers/glp-qwen3-8b-d10-multi

Learning a Generative Meta-Model of LLM Activations

Paper • 2602.06964 • Published Feb 6 • 3