Slim model card: KL div method, hyperparams, infrastructure
Browse files
README.md
CHANGED
|
@@ -10,101 +10,46 @@ tags:
|
|
| 10 |
- personality-prediction
|
| 11 |
- big-five
|
| 12 |
- bayesian-grm
|
| 13 |
-
datasets:
|
| 14 |
-
- custom
|
| 15 |
---
|
| 16 |
|
| 17 |
# Qwen 2.5 7B SoftLabel
|
| 18 |
|
| 19 |
-
LoRA adapter
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
|
| 26 |
-
|---
|
| 27 |
| **Base model** | [Qwen 2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
|
| 28 |
-
| **
|
| 29 |
-
| **
|
| 30 |
-
| **
|
| 31 |
-
| **Teacher** | Bayesian GRM with IPIP-50 items |
|
| 32 |
-
| **Test accuracy** | TBD |
|
| 33 |
-
| **Teacher ceiling** | 51.28% |
|
| 34 |
-
|
| 35 |
-
## Training Details
|
| 36 |
|
| 37 |
### Data
|
| 38 |
|
| 39 |
-
-
|
| 40 |
-
-
|
| 41 |
-
- **Test set**: 3,125 episodes
|
| 42 |
-
- Each episode contains ~10 IPIP-50 questionnaire items with soft probability targets over responses 1-5
|
| 43 |
|
| 44 |
### Hyperparameters
|
| 45 |
|
| 46 |
-
|
|
| 47 |
-
|---
|
| 48 |
-
| LoRA
|
| 49 |
-
|
|
| 50 |
-
|
|
| 51 |
-
|
|
| 52 |
-
|
|
| 53 |
-
|
|
| 54 |
-
| Warmup steps | 100 |
|
| 55 |
-
| Weight decay | 0.01 |
|
| 56 |
-
| Optimizer | AdamW (fused) |
|
| 57 |
-
| Max grad norm | 1.0 |
|
| 58 |
-
| Precision | bf16 |
|
| 59 |
-
| Per-GPU batch size | 2 |
|
| 60 |
-
| Gradient accumulation | 4 |
|
| 61 |
-
| Effective batch size | 64 (2 x 8 GPUs x 4) |
|
| 62 |
-
| Max epochs | 3 (with early stopping) |
|
| 63 |
-
| Early stopping patience | 5 evaluations |
|
| 64 |
| Max sequence length | 4096 |
|
| 65 |
-
| Seed | 3407 |
|
| 66 |
|
| 67 |
-
###
|
| 68 |
|
| 69 |
-
|
|
| 70 |
-
|---
|
| 71 |
-
| Best checkpoint | Step 1000 |
|
| 72 |
| Best eval loss (KL div) | 0.000756 |
|
| 73 |
-
| Final train loss | 0.
|
| 74 |
-
|
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
- **Hardware**: NVIDIA RTX A6000 48GB x 8
|
| 79 |
-
- **Infrastructure**: University cluster (SLURM, 8x A6000)
|
| 80 |
-
- **Attention**: SDPA (auto-dispatch to FlashAttention/math backend)
|
| 81 |
-
- **Gradient checkpointing**: Enabled (use_reentrant=False)
|
| 82 |
-
- **DDP**: 8-way data parallel (NCCL backend)
|
| 83 |
-
|
| 84 |
-
## Usage
|
| 85 |
-
|
| 86 |
-
```python
|
| 87 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 88 |
-
from peft import PeftModel
|
| 89 |
-
|
| 90 |
-
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
|
| 91 |
-
model = PeftModel.from_pretrained(base_model, "DavidL123/qwen-2.5-7b-SoftLabel")
|
| 92 |
-
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
|
| 93 |
-
```
|
| 94 |
-
|
| 95 |
-
## How It Works
|
| 96 |
-
|
| 97 |
-
1. **Bayesian GRM Teacher**: A psychometric model estimates posterior probability distributions over 5-point Likert responses for each IPIP-50 personality item
|
| 98 |
-
2. **Soft Labels**: Instead of hard labels (e.g., "the answer is 4"), the target is a full distribution (e.g., [0.02, 0.08, 0.15, 0.50, 0.25])
|
| 99 |
-
3. **KL Divergence Loss**: The LLM is trained to minimize KL(teacher || student) over the 5 answer tokens, producing calibrated probability estimates
|
| 100 |
-
4. **Evaluation**: At test time, the model's argmax prediction over the 5 tokens is compared to the teacher's argmax (teacher ceiling: 51.28%)
|
| 101 |
-
|
| 102 |
-
## Citation
|
| 103 |
-
|
| 104 |
-
```
|
| 105 |
-
@misc{levy2026personality,
|
| 106 |
-
title={Personality Prediction via Soft-Label Fine-Tuning with Bayesian Psychometric Teachers},
|
| 107 |
-
author={David Levy},
|
| 108 |
-
year={2026},
|
| 109 |
-
}
|
| 110 |
-
```
|
|
|
|
| 10 |
- personality-prediction
|
| 11 |
- big-five
|
| 12 |
- bayesian-grm
|
|
|
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
# Qwen 2.5 7B SoftLabel
|
| 16 |
|
| 17 |
+
LoRA adapter fine-tuned with **KL divergence** against soft probability distributions from a Bayesian Graded Response Model (GRM) teacher for Big Five personality prediction.
|
| 18 |
|
| 19 |
+
Unlike standard cross-entropy (hard labels), KL divergence training preserves the teacher's uncertainty, producing calibrated probability estimates over 5-point Likert responses.
|
| 20 |
|
| 21 |
+
## Training
|
| 22 |
|
| 23 |
+
| | |
|
| 24 |
+
|---|---|
|
| 25 |
| **Base model** | [Qwen 2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
|
| 26 |
+
| **Loss** | KL Divergence (batchmean) |
|
| 27 |
+
| **Precision** | bf16 |
|
| 28 |
+
| **Infrastructure** | University cluster (SLURM) — 4x NVIDIA RTX A6000 48GB |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
### Data
|
| 31 |
|
| 32 |
+
- 11,250 train / 1,250 valid / 3,125 test episodes
|
| 33 |
+
- Each episode: multi-turn IPIP-50 personality questionnaire with soft label targets over responses 1–5
|
|
|
|
|
|
|
| 34 |
|
| 35 |
### Hyperparameters
|
| 36 |
|
| 37 |
+
| | |
|
| 38 |
+
|---|---|
|
| 39 |
+
| LoRA r / alpha / dropout | 16 / 16 / 0.05 |
|
| 40 |
+
| Target modules | q, k, v, o, gate, up, down proj |
|
| 41 |
+
| Learning rate | 1.5e-4 (cosine schedule, 100 warmup steps) |
|
| 42 |
+
| Effective batch size | 32 (2 per-GPU x 4 GPUs x 4 grad accum) |
|
| 43 |
+
| Max epochs | 3 (early stopping, patience=5) |
|
| 44 |
+
| Optimizer | AdamW fused (weight decay 0.01) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
| Max sequence length | 4096 |
|
|
|
|
| 46 |
|
| 47 |
+
### Results
|
| 48 |
|
| 49 |
+
| | |
|
| 50 |
+
|---|---|
|
|
|
|
| 51 |
| Best eval loss (KL div) | 0.000756 |
|
| 52 |
+
| Final train loss | 0.0008 |
|
| 53 |
+
| Best checkpoint | Step 1000 |
|
| 54 |
+
| Test accuracy | — |
|
| 55 |
+
| Teacher ceiling | 51.28% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|