GLP for Qwen3-8B (multi-layer, layers 0-17)
A Generative Latent Prior trained on residual-stream activations of Qwen/Qwen3-8B, conditioning on layer position over layers 0-17 of 36 total. Trained from FineWeb text using the producer-consumer activation-caching pipeline.
This is the multi-layer variant from the paper
"Learning a Generative Meta-Model of LLM Activations" (Luo et al., 2026,
arXiv:2602.06964), adapted to
Qwen3-8B and to the d6 denoiser size used in their main glp-llama8b-d6
release. The training pipeline used is the open-source code at
g-luo/generative_latent_prior
with a few patches contributed in the course of training (see notes
below).
Files
final.safetensorsโ main checkpoint, gradient step 240,000checkpoints/step_*.safetensorsโ intermediate checkpointsconfig.yamlโ training config (denoiser size, layer set, etc.)rep_statistics.ptโ per-layer mean/var for the Normalizereval_fd.jsonโ Frechet Distance evaluation result
Quickstart
from glp.denoiser import load_glp
model = load_glp("sudoers/glp-qwen3-8b-d6-multi", device="cuda:0", checkpoint="final")
# layer-conditioned sampling (must pass layer_idx for multi-layer models):
from glp import flow_matching
import torch
noise = torch.randn(1024, 1, 4096, device="cuda:0")
samples = flow_matching.sample(model, noise, num_timesteps=1000, layer_idx=17)
samples = model.normalizer.denormalize(samples, layer_idx=17)
Architecture
| Source LLM | Qwen/Qwen3-8B |
| Layers modeled | 0-17 (multi-layer; conditioned via sinusoidal layer embedding) |
d_input |
4096 |
d_model |
8192 |
d_mlp |
16384 |
Denoiser depth (n_layers) |
6 |
multi_layer_n_layers |
18 |
| Total params | ~3.5 B |
Training
| Activations source | HuggingFaceFW/fineweb (sample-100BT shard) |
| Token positions | all positions per doc except BOS, max length 2048 |
| Activations per layer | ~36 M |
| Total samples (per-layer ร n_layers) | ~648 M |
| Optimizer | AdamW, lr 5e-5 |
| Schedule | cosine with 1% warmup |
| Batch size | 4096 |
| Precision | bf16 mixed |
| Hardware | 2ร H100 80GB (1 producer, 1 consumer) |
| Total gradient steps | 240,000 |
Evaluation
Frechet Distance on layer 17 between 50K real Qwen3-8B activations and 50K samples generated from the GLP (1000-timestep sampling, denormalized into the original activation space):
| FD | Lower bound (real-vs-real) | Ratio | |
|---|---|---|---|
final.safetensors (step 240,000) |
22,698 | 4,925 | 4.6ร |
Note: absolute FD scales with activation norm and is not directly
comparable across source LLMs. The ratio against the irreducible
real-vs-real lower bound is the meaningful quantity. For context, the
paper's glp-llama1b-d12-multi (the multi-layer Llama1B variant)
reports a ratio of ~3.0ร; ours is ~4.6ร, plausibly due to the larger
d_input (4096 vs 2048) and ~40 % less per-layer training data than
the paper's multi-layer setup.
Patches over the upstream pipeline
Trained against the upstream code with a few patches (all upstreamable):
- Implemented
glp_save.py(the producer; the upstream had a TODO). - Made
MemmapWriterclose memmaps before unlink so the OS reclaims disk on rolling deletion. - Vectorized the producer's per-token
MemmapWriter.writeinto awrite_manyblock-write. - Replaced O(N)
np.array(self.indices, โฆ)rebuilds in the producer's per-batch back-pressure check with O(1)file_last_rowlookups โ this fix turned the producer from 35 acts/sec/layer back into the ~700 acts/sec/layer regime as activations accumulated. - Throttled
data_indices.npyflushes to once every 50 producer batches. - Added consumer resume from a checkpoint + optimizer + scheduler so the run could survive multi-restart compute interruptions.
Citation
If you use this model, please cite the original paper:
@article{luo2026glp,
title={Learning a Generative Meta-Model of LLM Activations},
author={Grace Luo and Jiahai Feng and Trevor Darrell and Alec Radford and Jacob Steinhardt},
journal={arXiv preprint arXiv:2602.06964},
year={2026}
}
- Downloads last month
- 5