| --- |
| language: |
| - en |
| - vi |
| tags: |
| - mamba |
| - hypernetwork |
| - persona |
| - grpo |
| - personalization |
| license: mit |
| --- |
| |
| # Mamba Hypernetwork Personalization v2 |
|
|
| Mamba-based hypernetwork trained with GRPO to inject persona-conditioned deltas into LLM attention layers. |
|
|
| ## Architecture |
| - **Hypernetwork:** Mamba SSM encoder + delta heads (LoRA-style) |
| - **Target LLM:** Injected via forward hooks on q_proj / v_proj (8 layers) |
| - **Training:** GRPO with combined reward (RM + CR + PL + DIV) |
|
|
| ## Training Config |
| - LR: 1e-5 (cosine schedule) |
| - LAMBDA_GRPO: 0.2 |
| - LAMBDA_KL: 0.08 |
| - DELTA_SCALE: 0.003 |
| - Epochs: 5 | Steps: 350 |
| |
| ## Reward Weights |
| | Metric | Weight | Description | |
| |--------|--------|-------------| |
| | RM | +0.55 | Persona grounding | |
| | CR | +0.25 | Context relevance | |
| | PL | -0.30 | Persona leakage penalty | |
| | DIV | +0.10 | Response diversity | |
| |
| ## Checkpoint Info |
| - Saved at: epoch 5, step 400 |
| - Date: 2026-05-13 |
| |
| ## Files |
| - `mamba_weights_only.pt` — model weights only (for inference) |
| - `ckpt_e5_s350.pt` — full checkpoint (for resume training) |
| |