GLP for Qwen3-8B (multi-layer, layers 0-17)

A Generative Latent Prior trained on residual-stream activations of Qwen/Qwen3-8B, conditioning on layer position over layers 0-17 of 36 total. Trained from FineWeb text using the producer-consumer activation-caching pipeline.

This is the multi-layer variant from the paper "Learning a Generative Meta-Model of LLM Activations" (Luo et al., 2026, arXiv:2602.06964), adapted to Qwen3-8B and to the d6 denoiser size used in their main glp-llama8b-d6 release. The training pipeline used is the open-source code at g-luo/generative_latent_prior with a few patches contributed in the course of training (see notes below).

Files

final.safetensors — main checkpoint, gradient step 240,000
checkpoints/step_*.safetensors — intermediate checkpoints
config.yaml — training config (denoiser size, layer set, etc.)
rep_statistics.pt — per-layer mean/var for the Normalizer
eval_fd.json — Frechet Distance evaluation result

Quickstart

from glp.denoiser import load_glp
model = load_glp("sudoers/glp-qwen3-8b-d6-multi", device="cuda:0", checkpoint="final")

# layer-conditioned sampling (must pass layer_idx for multi-layer models):
from glp import flow_matching
import torch
noise = torch.randn(1024, 1, 4096, device="cuda:0")
samples = flow_matching.sample(model, noise, num_timesteps=1000, layer_idx=17)
samples = model.normalizer.denormalize(samples, layer_idx=17)

Architecture


Source LLM	`Qwen/Qwen3-8B`
Layers modeled	`0-17` (multi-layer; conditioned via sinusoidal layer embedding)
`d_input`	4096
`d_model`	8192
`d_mlp`	16384
Denoiser depth (`n_layers`)	6
`multi_layer_n_layers`	18
Total params	~3.5 B

Training


Activations source	`HuggingFaceFW/fineweb` (`sample-100BT` shard)
Token positions	all positions per doc except BOS, max length 2048
Activations per layer	~36 M
Total samples (per-layer × n_layers)	~648 M
Optimizer	AdamW, lr 5e-5
Schedule	cosine with 1% warmup
Batch size	4096
Precision	bf16 mixed
Hardware	2× H100 80GB (1 producer, 1 consumer)
Total gradient steps	240,000

Evaluation

Frechet Distance on layer 17 between 50K real Qwen3-8B activations and 50K samples generated from the GLP (1000-timestep sampling, denormalized into the original activation space):

	FD	Lower bound (real-vs-real)	Ratio
`final.safetensors` (step 240,000)	22,698	4,925	4.6×

Note: absolute FD scales with activation norm and is not directly comparable across source LLMs. The ratio against the irreducible real-vs-real lower bound is the meaningful quantity. For context, the paper's glp-llama1b-d12-multi (the multi-layer Llama1B variant) reports a ratio of ~3.0×; ours is ~4.6×, plausibly due to the larger d_input (4096 vs 2048) and ~40 % less per-layer training data than the paper's multi-layer setup.

Patches over the upstream pipeline

Trained against the upstream code with a few patches (all upstreamable):

Implemented glp_save.py (the producer; the upstream had a TODO).
Made MemmapWriter close memmaps before unlink so the OS reclaims disk on rolling deletion.
Vectorized the producer's per-token MemmapWriter.write into a write_many block-write.
Replaced O(N) np.array(self.indices, …) rebuilds in the producer's per-batch back-pressure check with O(1) file_last_row lookups — this fix turned the producer from 35 acts/sec/layer back into the ~700 acts/sec/layer regime as activations accumulated.
Throttled data_indices.npy flushes to once every 50 producer batches.
Added consumer resume from a checkpoint + optimizer + scheduler so the run could survive multi-restart compute interruptions.

Citation

If you use this model, please cite the original paper:

@article{luo2026glp,
  title={Learning a Generative Meta-Model of LLM Activations},
  author={Grace Luo and Jiahai Feng and Trevor Darrell and Alec Radford and Jacob Steinhardt},
  journal={arXiv preprint arXiv:2602.06964},
  year={2026}
}

Downloads last month: 5

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for sudoers/glp-qwen3-8b-d6-multi

Learning a Generative Meta-Model of LLM Activations

Paper • 2602.06964 • Published Feb 6 • 3