Mamba Hypernetwork Personalization v2
Mamba-based hypernetwork trained with GRPO to inject persona-conditioned deltas into LLM attention layers.
Architecture
- Hypernetwork: Mamba SSM encoder + delta heads (LoRA-style)
- Target LLM: Injected via forward hooks on q_proj / v_proj (8 layers)
- Training: GRPO with combined reward (RM + CR + PL + DIV)
Training Config
- LR: 1e-5 (cosine schedule)
- LAMBDA_GRPO: 0.2
- LAMBDA_KL: 0.08
- DELTA_SCALE: 0.003
- Epochs: 5 | Steps: 350
Reward Weights
| Metric | Weight | Description |
|---|---|---|
| RM | +0.55 | Persona grounding |
| CR | +0.25 | Context relevance |
| PL | -0.30 | Persona leakage penalty |
| DIV | +0.10 | Response diversity |
Checkpoint Info
- Saved at: epoch 5, step 400
- Date: 2026-05-13
Files
mamba_weights_only.ptโ model weights only (for inference)ckpt_e5_s350.ptโ full checkpoint (for resume training)
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support