DavidL123 commited on
Commit
2c7b5d2
·
verified ·
1 Parent(s): db19b07

Slim model card: KL div method, hyperparams, infrastructure

Browse files
Files changed (1) hide show
  1. README.md +25 -80
README.md CHANGED
@@ -10,101 +10,46 @@ tags:
10
  - personality-prediction
11
  - big-five
12
  - bayesian-grm
13
- datasets:
14
- - custom
15
  ---
16
 
17
  # Qwen 2.5 7B SoftLabel
18
 
19
- LoRA adapter for Big Five personality prediction, fine-tuned with **KL divergence soft labels** from a Bayesian Graded Response Model (GRM) teacher.
20
 
21
- ## Overview
22
 
23
- This adapter was trained as part of a dissertation on LLM-based personality assessment. Instead of standard cross-entropy with hard labels (which produces overconfident predictions), the model is trained to match the full posterior probability distribution from a Bayesian psychometric teacher model over 5-point Likert scale responses.
24
 
25
- | Property | Value |
26
- |----------|-------|
27
  | **Base model** | [Qwen 2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
28
- | **Parameters** | 7B |
29
- | **Task** | Personality prediction (Big Five traits) |
30
- | **Loss function** | KL Divergence (soft labels) |
31
- | **Teacher** | Bayesian GRM with IPIP-50 items |
32
- | **Test accuracy** | TBD |
33
- | **Teacher ceiling** | 51.28% |
34
-
35
- ## Training Details
36
 
37
  ### Data
38
 
39
- - **Training set**: 11,250 episodes (multi-turn personality questionnaire conversations)
40
- - **Validation set**: 1,250 episodes
41
- - **Test set**: 3,125 episodes
42
- - Each episode contains ~10 IPIP-50 questionnaire items with soft probability targets over responses 1-5
43
 
44
  ### Hyperparameters
45
 
46
- | Hyperparameter | Value |
47
- |----------------|-------|
48
- | LoRA rank (r) | 16 |
49
- | LoRA alpha | 16 |
50
- | LoRA dropout | 0.05 |
51
- | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
52
- | Learning rate | 1.5e-4 |
53
- | LR scheduler | Cosine |
54
- | Warmup steps | 100 |
55
- | Weight decay | 0.01 |
56
- | Optimizer | AdamW (fused) |
57
- | Max grad norm | 1.0 |
58
- | Precision | bf16 |
59
- | Per-GPU batch size | 2 |
60
- | Gradient accumulation | 4 |
61
- | Effective batch size | 64 (2 x 8 GPUs x 4) |
62
- | Max epochs | 3 (with early stopping) |
63
- | Early stopping patience | 5 evaluations |
64
  | Max sequence length | 4096 |
65
- | Seed | 3407 |
66
 
67
- ### Training Results
68
 
69
- | Metric | Value |
70
- |--------|-------|
71
- | Best checkpoint | Step 1000 |
72
  | Best eval loss (KL div) | 0.000756 |
73
- | Final train loss | 0.000800 |
74
- | Training epochs completed | ~2.8 |
75
-
76
- ### Infrastructure
77
-
78
- - **Hardware**: NVIDIA RTX A6000 48GB x 8
79
- - **Infrastructure**: University cluster (SLURM, 8x A6000)
80
- - **Attention**: SDPA (auto-dispatch to FlashAttention/math backend)
81
- - **Gradient checkpointing**: Enabled (use_reentrant=False)
82
- - **DDP**: 8-way data parallel (NCCL backend)
83
-
84
- ## Usage
85
-
86
- ```python
87
- from transformers import AutoModelForCausalLM, AutoTokenizer
88
- from peft import PeftModel
89
-
90
- base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
91
- model = PeftModel.from_pretrained(base_model, "DavidL123/qwen-2.5-7b-SoftLabel")
92
- tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
93
- ```
94
-
95
- ## How It Works
96
-
97
- 1. **Bayesian GRM Teacher**: A psychometric model estimates posterior probability distributions over 5-point Likert responses for each IPIP-50 personality item
98
- 2. **Soft Labels**: Instead of hard labels (e.g., "the answer is 4"), the target is a full distribution (e.g., [0.02, 0.08, 0.15, 0.50, 0.25])
99
- 3. **KL Divergence Loss**: The LLM is trained to minimize KL(teacher || student) over the 5 answer tokens, producing calibrated probability estimates
100
- 4. **Evaluation**: At test time, the model's argmax prediction over the 5 tokens is compared to the teacher's argmax (teacher ceiling: 51.28%)
101
-
102
- ## Citation
103
-
104
- ```
105
- @misc{levy2026personality,
106
- title={Personality Prediction via Soft-Label Fine-Tuning with Bayesian Psychometric Teachers},
107
- author={David Levy},
108
- year={2026},
109
- }
110
- ```
 
10
  - personality-prediction
11
  - big-five
12
  - bayesian-grm
 
 
13
  ---
14
 
15
  # Qwen 2.5 7B SoftLabel
16
 
17
+ LoRA adapter fine-tuned with **KL divergence** against soft probability distributions from a Bayesian Graded Response Model (GRM) teacher for Big Five personality prediction.
18
 
19
+ Unlike standard cross-entropy (hard labels), KL divergence training preserves the teacher's uncertainty, producing calibrated probability estimates over 5-point Likert responses.
20
 
21
+ ## Training
22
 
23
+ | | |
24
+ |---|---|
25
  | **Base model** | [Qwen 2.5 7B Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
26
+ | **Loss** | KL Divergence (batchmean) |
27
+ | **Precision** | bf16 |
28
+ | **Infrastructure** | University cluster (SLURM) — 4x NVIDIA RTX A6000 48GB |
 
 
 
 
 
29
 
30
  ### Data
31
 
32
+ - 11,250 train / 1,250 valid / 3,125 test episodes
33
+ - Each episode: multi-turn IPIP-50 personality questionnaire with soft label targets over responses 1–5
 
 
34
 
35
  ### Hyperparameters
36
 
37
+ | | |
38
+ |---|---|
39
+ | LoRA r / alpha / dropout | 16 / 16 / 0.05 |
40
+ | Target modules | q, k, v, o, gate, up, down proj |
41
+ | Learning rate | 1.5e-4 (cosine schedule, 100 warmup steps) |
42
+ | Effective batch size | 32 (2 per-GPU x 4 GPUs x 4 grad accum) |
43
+ | Max epochs | 3 (early stopping, patience=5) |
44
+ | Optimizer | AdamW fused (weight decay 0.01) |
 
 
 
 
 
 
 
 
 
 
45
  | Max sequence length | 4096 |
 
46
 
47
+ ### Results
48
 
49
+ | | |
50
+ |---|---|
 
51
  | Best eval loss (KL div) | 0.000756 |
52
+ | Final train loss | 0.0008 |
53
+ | Best checkpoint | Step 1000 |
54
+ | Test accuracy | — |
55
+ | Teacher ceiling | 51.28% |