GLP for Qwen3-8B (multi-layer, layers 0-17)

A Generative Latent Prior trained on residual-stream activations of Qwen/Qwen3-8B, conditioning on layer position over layers 0-17 of 36 total. Trained from FineWeb text using the producer-consumer activation-caching pipeline.

This is the multi-layer variant from the paper "Learning a Generative Meta-Model of LLM Activations" (Luo et al., 2026, arXiv:2602.06964), adapted to Qwen3-8B and to the d6 denoiser size used in their main glp-llama8b-d6 release. The training pipeline used is the open-source code at g-luo/generative_latent_prior with a few patches contributed in the course of training (see notes below).

Files

  • final.safetensors โ€” main checkpoint, gradient step 240,000
  • checkpoints/step_*.safetensors โ€” intermediate checkpoints
  • config.yaml โ€” training config (denoiser size, layer set, etc.)
  • rep_statistics.pt โ€” per-layer mean/var for the Normalizer
  • eval_fd.json โ€” Frechet Distance evaluation result

Quickstart

from glp.denoiser import load_glp
model = load_glp("sudoers/glp-qwen3-8b-d6-multi", device="cuda:0", checkpoint="final")

# layer-conditioned sampling (must pass layer_idx for multi-layer models):
from glp import flow_matching
import torch
noise = torch.randn(1024, 1, 4096, device="cuda:0")
samples = flow_matching.sample(model, noise, num_timesteps=1000, layer_idx=17)
samples = model.normalizer.denormalize(samples, layer_idx=17)

Architecture

Source LLM Qwen/Qwen3-8B
Layers modeled 0-17 (multi-layer; conditioned via sinusoidal layer embedding)
d_input 4096
d_model 8192
d_mlp 16384
Denoiser depth (n_layers) 6
multi_layer_n_layers 18
Total params ~3.5 B

Training

Activations source HuggingFaceFW/fineweb (sample-100BT shard)
Token positions all positions per doc except BOS, max length 2048
Activations per layer ~36 M
Total samples (per-layer ร— n_layers) ~648 M
Optimizer AdamW, lr 5e-5
Schedule cosine with 1% warmup
Batch size 4096
Precision bf16 mixed
Hardware 2ร— H100 80GB (1 producer, 1 consumer)
Total gradient steps 240,000

Evaluation

Frechet Distance on layer 17 between 50K real Qwen3-8B activations and 50K samples generated from the GLP (1000-timestep sampling, denormalized into the original activation space):

FD Lower bound (real-vs-real) Ratio
final.safetensors (step 240,000) 22,698 4,925 4.6ร—

Note: absolute FD scales with activation norm and is not directly comparable across source LLMs. The ratio against the irreducible real-vs-real lower bound is the meaningful quantity. For context, the paper's glp-llama1b-d12-multi (the multi-layer Llama1B variant) reports a ratio of ~3.0ร—; ours is ~4.6ร—, plausibly due to the larger d_input (4096 vs 2048) and ~40 % less per-layer training data than the paper's multi-layer setup.

Patches over the upstream pipeline

Trained against the upstream code with a few patches (all upstreamable):

  • Implemented glp_save.py (the producer; the upstream had a TODO).
  • Made MemmapWriter close memmaps before unlink so the OS reclaims disk on rolling deletion.
  • Vectorized the producer's per-token MemmapWriter.write into a write_many block-write.
  • Replaced O(N) np.array(self.indices, โ€ฆ) rebuilds in the producer's per-batch back-pressure check with O(1) file_last_row lookups โ€” this fix turned the producer from 35 acts/sec/layer back into the ~700 acts/sec/layer regime as activations accumulated.
  • Throttled data_indices.npy flushes to once every 50 producer batches.
  • Added consumer resume from a checkpoint + optimizer + scheduler so the run could survive multi-restart compute interruptions.

Citation

If you use this model, please cite the original paper:

@article{luo2026glp,
  title={Learning a Generative Meta-Model of LLM Activations},
  author={Grace Luo and Jiahai Feng and Trevor Darrell and Alec Radford and Jacob Steinhardt},
  journal={arXiv preprint arXiv:2602.06964},
  year={2026}
}
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for sudoers/glp-qwen3-8b-d6-multi