GLP for Qwen3-8B (multi-layer d10, layers 0-17)

A Generative Latent Prior trained on residual-stream activations of Qwen/Qwen3-8B, conditioning on layer position over layers 0-17 of 36 total. The denoiser has 10 transformer-MLP blocks (vs the d6 release's 6) and is 5.65 B parameters. Trained from FineWeb text using the producer-consumer activation-caching pipeline at spencerkitts/generative_latent_prior, itself a fork of the original g-luo/generative_latent_prior that backs "Learning a Generative Meta-Model of LLM Activations" (arXiv:2602.06964).

Training summary

  • Activations: 110 M / layer x 18 layers ~ 2 B total samples consumed from FineWeb (2x the d6 release's data budget).
  • Final checkpoint: gradient step 480000 of a planned ~483,400-step single-epoch run. Training stopped at 99.8% when the streaming reader deadlocked on the final tail; the missing ~880 steps ran with cosine-decayed LR ~5e-6, so the model state is effectively the same as a clean epoch end.

Evaluation

Frechet Distance between 50k real Qwen3-8B layer-17 activations and 50k samples generated by this GLP at 1000 sampling timesteps:

FD Lower bound (real-vs-real, n=25k each) Ratio (FD / LB)
d10 (this release) 41,086.5 12,600.5 3.26x
d6 (sudoers/glp-qwen3-8b-d6-multi) 22,698.0 4,924.8 4.61x

The lower bound differs between the two evaluations because they sample from different activation caches (d10 had a smaller surviving-on-disk pool when this eval ran), so absolute FD is not directly comparable. The ratio FD/LB factors out cache-pool noise: d10 is closer to the irreducible sampling-noise floor, indicating the larger denoiser does benefit from the 2x training budget. Result is in eval_fd.json.

Files

  • final.safetensors - final checkpoint (step 480000)
  • checkpoints/step_*.safetensors - last few intermediate checkpoints
  • config.yaml - training config
  • rep_statistics.pt - per-layer mean/var for the Normalizer
  • eval_fd.json - Frechet Distance result
  • resume/ - artifacts to extend training:
    • optimizer_state.pt - AdamW8bit state at step 480000
    • scheduler_state.pt - cosine LR scheduler state at step 480000
    • training_state.json - producer/consumer position in FineWeb (n_docs_seen, global_idx_consumed_per_layer) and a resume recipe

Quickstart

from glp.denoiser import load_glp
model = load_glp("sudoers/glp-qwen3-8b-d10-multi", device="cuda:0", checkpoint="final")

For sampling at a specific layer (must pass layer_idx for multi-layer):

from glp import flow_matching
import torch
noise = torch.randn(1024, 1, 4096, device="cuda:0")
samples = flow_matching.sample(model, noise, num_timesteps=1000, layer_idx=17)
samples = model.normalizer.denormalize(samples, layer_idx=17)

Architecture

Source LLM Qwen/Qwen3-8B
Layers modeled 0-17 (multi-layer; sinusoidal layer-position embedding)
d_input 4096
d_model 8192
d_mlp 16384
Denoiser depth (n_layers) 10
multi_layer_n_layers 18
Total params ~5.65 B
Optimizer bitsandbytes AdamW8bit

Sibling release

For the smaller, fully-trained companion, see sudoers/glp-qwen3-8b-d6-multi (d6 denoiser, 1B activation budget).

Citation

@article{luo2026glp,
  title={Learning a Generative Meta-Model of LLM Activations},
  author={Grace Luo and Jiahai Feng and Trevor Darrell and Alec Radford and Jacob Steinhardt},
  journal={arXiv preprint arXiv:2602.06964},
  year={2026}
}
Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for sudoers/glp-qwen3-8b-d10-multi