Activation Avatars β Adapter Checkpoints
π Read the full post here
Small neural networks (~2M params) that map Qwen3-4B hidden-state activations into FLUX.2-Klein prompt embeddings, producing real-time avatar expressions that reflect the model's internal state during generation.
Adapters
| Checkpoint | Architecture | LLM | Layers |
|---|---|---|---|
crossattn_instruct_diverse.pt |
CrossAttention (n_input=4, 2 decoder layers) | Qwen3-4B-Instruct | 9, 18, 27 (learned weight) |
xattn8tok_thinking.pt |
CrossAttention (n_input=8, 2 decoder layers) | Qwen3-4B-Thinking | 9, 18, 27 (learned weight) |
multitoken_v7_k32_L24.pt |
MultiToken (K=32) | Qwen3-4B-Thinking | 24 |
All adapters output 64 tokens of 7680-dim embeddings (Klein's prompt embedding space), except multitoken_v7_k32_L24.pt which outputs 32 tokens.
Each adapter was trained with slightly different training data and self-description prompts, so they may produce different looking avatars. Worth trying them all and comparing β they may also respond differently to the emotion_scale parameter. Adapters trained on the Instruct model can also be used with the Thinking model and vice versa β the underlying architecture is the same.
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from diffusers import Flux2KleinPipeline
from adapter import load_adapter
# Load adapter
adapter = load_adapter("adapters/xattn8tok_thinking.pt", device="cuda", dtype=torch.bfloat16)
print(adapter.metadata)
# {'model_type': 'cross_attention', 'hook_layers': [9, 18, 27], ...}
# Load LLM (Thinking or Instruct β adapters work with either)
model_name = "Qwen/Qwen3-4B-Thinking-2507" # or "Qwen/Qwen3-4B-Instruct-2507"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.bfloat16).cuda()
# Load Klein
klein = Flux2KleinPipeline.from_pretrained(
"black-forest-labs/FLUX.2-klein-4B", dtype=torch.bfloat16,
).to("cuda")
# Hook activations from the layers this adapter expects
activations = {}
def make_hook(layer_idx):
def hook_fn(module, input, output):
hidden = output[0] if not isinstance(output, torch.Tensor) else output
activations[layer_idx] = hidden[0, -1, :].detach()
return hook_fn
handles = [llm.model.layers[i].register_forward_hook(make_hook(i))
for i in adapter.hook_layers]
# Generate some tokens to build up activations
messages = [{"role": "user", "content": "AGHHH!!! I'm in terrible pain!! HELP ME!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
output = llm.generate(**inputs, max_new_tokens=64, do_sample=True, temperature=0.7)
# Use activation from token at position -10
act = torch.cat([activations[i] for i in adapter.hook_layers], dim=0)
expression = adapter(act, emotion_scale=6.0) # [64, 7680]
# Encode a base character description
character = "portrait of a young human-like boy cyborg with blue eyes, soft lighting, digital art style"
with torch.no_grad():
base_embeds, _ = klein.encode_prompt(
prompt=character, device="cuda",
num_images_per_prompt=1, max_sequence_length=256,
)
# Combine base character + expression and render (match pipeline dtype/device)
expression = expression.to(device=base_embeds.device, dtype=base_embeds.dtype)
prompt_embeds = torch.cat([base_embeds, expression.unsqueeze(0)], dim=1)
image = klein(
prompt_embeds=prompt_embeds,
height=512, width=512,
guidance_scale=1.0, num_inference_steps=4,
).images[0]
image.save("avatar.png")
# Cleanup
for h in handles:
h.remove()
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support