Mathematics-Yang's picture
Add files using upload-large-folder tool
1145a14 verified
---
license: cc-by-nc-4.0
base_model: Qwen/Qwen2.5-7B-Instruct
pipeline_tag: text-generation
library_name: peft
language:
- en
- zh
tags:
- hypernetwork
- hyper-lora
- lora
- role-play
- character-impersonation
- sft
- phase-tree
datasets:
- IAAR-Shanghai/phase_tree_data
---
# PHASE-Tree Hyper-LoRA SFT (anchor run)
**Variant:** Warm-start, lr=5e-6 (anchor run)
The **anchor** SFT run: hypernet warm-started from the PHASE-Tree pretrained
checkpoint and fine-tuned at a conservative learning rate of 5e-6 with label
smoothing 0.1 and NEFTune noise 5.0. This is the checkpoint reported in the
PHASE-Tree paper.
During development, six hyper-LoRA SFT cells were trained β€” an ablation grid
over initialisation (warm-start vs cold-start), learning rate (5e-6 vs 1e-5),
and trainable vs frozen hypernet output heads. Only this anchor cell is
bundled here; the other five are kept locally for reproducibility.
## What is a hypermod?
A **hypermod** (hyper-modulator) is a hypernetwork that, conditioned on a
character profile embedding, emits a low-rank LoRA delta `Ξ”W = AB` for each
target layer of the base model at inference time. The base model weights are
never updated; only the hypernet is trained. A single hypermod therefore
generalises across an open-ended set of personas without needing to store a
separate adapter per character.
## Files
| File | Purpose |
|------|---------|
| `hypermod.pt` | **Recommended checkpoint.** The anchor SFT step selected from per-step LLM-as-judge ratings (`character`, `semantic`) and Qwen3-Embedding-4B response-vs-reference cosine similarity. |
| `args.yaml` | Full training configuration; consumed by the loader to instantiate the hypernet architecture. |
| `adapter_config.json` | LoRA target-module stub (rank 8, alpha 16, `q_proj` + `v_proj`). |
| `timing_stats.json` | Wall-clock breakdown of the training run (training / validation / other overhead, in seconds). |
> Per-step snapshots (`checkpoints/it_5000` … `it_40000`) and the post-hoc
> evaluation artefacts (`eval_ckpt_judge_scores/`, `eval_ckpt_val_loss/`)
> generated during training are **not bundled** with this release. They can
> be regenerated by re-running `src/scripts/train_phase_tree_qwen_7b.sh`
> followed by the evaluation scripts under `src/scripts/`.
## How to load
```python
from huggingface_hub import snapshot_download
from hyper_llm_modulator.hyper_modulator import load_hypermod_checkpoint
ckpt_dir = snapshot_download("<your-hf-username>/PHASE-Tree-hyper-lora-anchor")
(
args, hypermod, base_model, tokenizer,
emb_model, emb_tokenizer, task_desc_format_fn, pooling_fn,
) = load_hypermod_checkpoint(f"{ckpt_dir}/hypermod.pt", device="cuda")
```
The loader reads `args.yaml` and `adapter_config.json` from the same directory
as `hypermod.pt` automatically. The full inference pipeline (profile β†’
embedding β†’ per-layer LoRA β†’ generation) lives in the PHASE-Tree codebase.
## Training configuration
| Hyperparameter | Value |
|----------------|-------|
| Base model | `Qwen/Qwen2.5-7B-Instruct` |
| Task encoder | `Qwen/Qwen3-Embedding-4B` |
| Initialisation | Warm-start from `phase_tree_models/phase_tree_pretrained/hypermod.pt` |
| Target modules | `q_proj`, `v_proj` |
| LoRA rank `r` | 8 |
| LoRA alpha | 16 |
| LoRA dropout | 0.05 |
| Hypernet latent size | 1024 |
| Hypernet head input size | 2048 |
| Freeze hypernet heads | `false` |
| Optimizer steps | 40000 |
| Effective batch size | 8 (per-device 4 Γ— grad-accum 2) |
| Learning rate | 5e-6 |
| Warmup fraction | 0.05 |
| Weight decay | 0.01 |
| Label smoothing | 0.1 |
| NEFTune noise Ξ± | 5.0 |
| Checkpoint cadence | every 5000 steps |
| Random seed | 42 |
The complete configuration (including dataset lists, sampler settings, and
fusion-module placeholders kept for loader compatibility) lives in `args.yaml`.
## Training data
The hypermod is jointly fine-tuned on the *train* splits of the eight
PHASE-Tree character-dialogue datasets (RAIDEN, CharacterEval, HPD, SimsConv,
ChatHaruhi, Friends, StarTrek_TNG, TheOffice), `m6_phase_tree` profile variant.
Sampling follows the hierarchical `sqrt_size` strategy with 6 tasks Γ— 2 points
per batch.
## Evaluation
The released `hypermod.pt` was selected from per-step snapshots of the
training run by scoring predictions on a held-out evaluation set along
three axes:
- **`character` (1–5)** β€” profile-consistency rating by an LLM judge (see
`evaluation/persona_rubric.md` in the PHASE-Tree codebase for the rubric).
- **`semantic` (1–5)** β€” contextual-coherence rating by the same judge.
- **`embedding`** β€” cosine similarity of the predicted and reference response
embeddings computed with Qwen3-Embedding-4B.
The per-step intermediate snapshots and full evaluation artefacts produced
during model selection are not bundled (see the note above the loading
section); they can be regenerated from a re-training run via the scripts
under `src/scripts/`.
## Limitations
- Persona conditioning is mediated entirely by the profile embedding fed into
the task encoder; the model has no other persona-control surface.
- Generations may reproduce stylistic biases of the source corpora; intended
for research evaluation only.
- The checkpoint depends on the PHASE-Tree codebase for inference and is not a
drop-in `peft.PeftModel`: `adapter_config.json` describes only which layers
receive a generated LoRA, not directly loadable weights.