Add run artifacts grpo_phi4_persona_20260203_111730
Browse files- .gitattributes +1 -0
- artifacts/grpo_phi4_persona_20260203_111730/env.csv +13 -0
- artifacts/grpo_phi4_persona_20260203_111730/generations.csv +0 -0
- artifacts/grpo_phi4_persona_20260203_111730/generations.jsonl +0 -0
- artifacts/grpo_phi4_persona_20260203_111730/inference_sample.txt +14 -0
- artifacts/grpo_phi4_persona_20260203_111730/loss.png +0 -0
- artifacts/grpo_phi4_persona_20260203_111730/notes.md +180 -0
- artifacts/grpo_phi4_persona_20260203_111730/reasoning_dataset.jsonl +0 -0
- artifacts/grpo_phi4_persona_20260203_111730/report.html +0 -0
- artifacts/grpo_phi4_persona_20260203_111730/reward.png +3 -0
- artifacts/grpo_phi4_persona_20260203_111730/run_config.csv +36 -0
- artifacts/grpo_phi4_persona_20260203_111730/run_config.json +47 -0
- artifacts/grpo_phi4_persona_20260203_111730/system_prompt.txt +48 -0
- artifacts/grpo_phi4_persona_20260203_111730/train_log.csv +22 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
artifacts/grpo_phi4_persona_20260203_111730/reward.png filter=lfs diff=lfs merge=lfs -text
|
artifacts/grpo_phi4_persona_20260203_111730/env.csv
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
key,value
|
| 2 |
+
python_version,3.12.12
|
| 3 |
+
platform,Linux-6.6.105+-x86_64-with-glibc2.35
|
| 4 |
+
torch_version,2.9.0+cu126
|
| 5 |
+
cuda_available,True
|
| 6 |
+
cuda_device_name,NVIDIA A100-SXM4-80GB
|
| 7 |
+
pkg_unsloth,2026.1.4
|
| 8 |
+
pkg_trl,0.22.2
|
| 9 |
+
pkg_transformers,4.56.2
|
| 10 |
+
pkg_vllm,0.11.2
|
| 11 |
+
pkg_pandas,2.2.2
|
| 12 |
+
pkg_matplotlib,3.10.0
|
| 13 |
+
pkg_rich,13.9.4
|
artifacts/grpo_phi4_persona_20260203_111730/generations.csv
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
artifacts/grpo_phi4_persona_20260203_111730/generations.jsonl
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
artifacts/grpo_phi4_persona_20260203_111730/inference_sample.txt
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<reasoning>
|
| 2 |
+
CONTEXT: Jim has just shared a personal failure, likely about a task or project, and is looking for a response. The context suggests a moment of vulnerability or frustration.
|
| 3 |
+
|
| 4 |
+
RELATIONSHIP: Michael and Jim are likely colleagues or friends, with no clear power dynamic indicated. The presence of an audience is not specified, but the setting may be informal. There is a potential for empathy or camaraderie, but also a risk of making Jim feel worse.
|
| 5 |
+
|
| 6 |
+
MICHAEL_STATE: Michael may feel a mix of surprise and mild discomfort, as he is unsure how to respond to Jim's admission. He might also feel a bit of pressure to say something supportive or humorous.
|
| 7 |
+
|
| 8 |
+
MICHAEL_GOAL: Michael wants to respond in a way that maintains the social bond, possibly with a touch of humor to lighten the mood, while not making Jim feel worse.
|
| 9 |
+
|
| 10 |
+
REACTION_STRATEGY: Michael opts for a light-hearted, self-deprecating joke to acknowledge the situation without adding pressure.
|
| 11 |
+
|
| 12 |
+
COMEDY_MECHANISM: The humor comes from Michael's self-deprecation, which is a common way to defuse tension and make the situation more relatable.
|
| 13 |
+
|
| 14 |
+
ANSWER_CONSTRAINT: The response must be in-character, supportive, and humorous, without being dismissive of
|
artifacts/grpo_phi4_persona_20260203_111730/loss.png
ADDED
|
artifacts/grpo_phi4_persona_20260203_111730/notes.md
ADDED
|
@@ -0,0 +1,180 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## 1) Ce que contient réellement ton dataset (et pourquoi ça change tout)
|
| 2 |
+
|
| 3 |
+
1. Ton dataset n’est **pas** “GSM8K math” : c’est un dataset **dialogue / persona** où :
|
| 4 |
+
|
| 5 |
+
* `question` = un **contexte de répliques** (ex: Jim/Pam/Jan/Toby… + Michael)
|
| 6 |
+
* `answer` = la **réplique cible** (souvent Michael) à produire. ([Hugging Face][1])
|
| 7 |
+
2. Donc, les rewards “numériques” (`extract_final_number`, `int_reward_func`) ne correspondent pas à ton objectif persona. Pour ce dataset, il faut des rewards orientés :
|
| 8 |
+
|
| 9 |
+
* **exactitude de la réplique** (match exact / similarité texte)
|
| 10 |
+
* **cohérence persona** (ton, psychologie, relation, intention)
|
| 11 |
+
* **structure du reasoning** (si tu veux un “reasoning dataset” exploitable)
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## 2) Ton objectif “Reasoning dataset + apprendre les réponses via reasoning” (reformulé proprement)
|
| 16 |
+
|
| 17 |
+
1. **Phase A (teacher / bootstrapping)**
|
| 18 |
+
Tu veux **générer/enseigner** un *reasoning* riche (psychologie + relation + réaction), en donnant la réponse dans le prompt pour stabiliser la production du reasoning.
|
| 19 |
+
|
| 20 |
+
2. **Phase B (student / test)**
|
| 21 |
+
Tu veux **retirer la réponse du prompt** et vérifier si le modèle arrive à produire la **même réponse** “grâce au reasoning appris”.
|
| 22 |
+
|
| 23 |
+
3. Point clé : en pratique, **le reasoning n’est pas une preuve causale** que le modèle “sait” la réponse. Ce que tu peux mesurer, c’est :
|
| 24 |
+
|
| 25 |
+
* **la capacité de prédiction** sans réponse (Phase B)
|
| 26 |
+
* et **la qualité/consistance** du reasoning (style persona)
|
| 27 |
+
L’approche fonctionne si tu traites Phase A comme **création de données (distillation)**, puis Phase B comme **apprentissage de prédiction**.
|
| 28 |
+
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## 3) Ce que fait ton script actuel pendant le training (mécanique GRPO)
|
| 32 |
+
|
| 33 |
+
1. **Préparation**
|
| 34 |
+
|
| 35 |
+
1. Installe les libs manquantes (pip).
|
| 36 |
+
2. Crée un `RUN_ID` et des dossiers `runs/<RUN_ID>/...` pour tracer tout (logs, exports, artifacts).
|
| 37 |
+
3. Se connecte à Hugging Face via token, puis crée 2 repos : merged16 + gguf q8.
|
| 38 |
+
|
| 39 |
+
2. **Dataset mapping**
|
| 40 |
+
|
| 41 |
+
1. Charge le split `train` du dataset.
|
| 42 |
+
2. Pour chaque exemple, construit un `prompt` :
|
| 43 |
+
|
| 44 |
+
* system = impose le format XML `<reasoning>...</reasoning><answer>...</answer>`
|
| 45 |
+
* user = `question` (et optionnellement `+ answer` si `INCLUDE_GOLD_ANSWER_IN_PROMPT=True`)
|
| 46 |
+
3. Stocke `answer` = **réponse brute** (pas de `####`). ([Hugging Face][1])
|
| 47 |
+
|
| 48 |
+
3. **GRPO**
|
| 49 |
+
|
| 50 |
+
1. À chaque step, GRPO génère `GRPO_NUM_GENERATIONS` complétions par prompt (ici 6).
|
| 51 |
+
2. Il calcule tes rewards sur chaque completion.
|
| 52 |
+
3. Il met à jour les poids LoRA pour augmenter le reward moyen.
|
| 53 |
+
|
| 54 |
+
4. **Logging / plots**
|
| 55 |
+
|
| 56 |
+
1. TRL émet des logs (loss, reward, kl, etc.).
|
| 57 |
+
2. Ton callback écrit tout dans `train_log.csv`.
|
| 58 |
+
3. Tu plots loss/reward en PNG.
|
| 59 |
+
|
| 60 |
+
---
|
| 61 |
+
|
| 62 |
+
## 4) Pourquoi “mettre la réponse dans le prompt” peut marcher… et comment éviter l’échec classique
|
| 63 |
+
|
| 64 |
+
### 4.1 Le risque principal
|
| 65 |
+
|
| 66 |
+
1. Si la réponse est dans le prompt, le modèle peut :
|
| 67 |
+
|
| 68 |
+
* **copier** la réponse dans `<answer>` sans comprendre,
|
| 69 |
+
* et écrire un “reasoning” décoratif (post-hoc).
|
| 70 |
+
2. Tu crois “il a appris via reasoning”, mais en réalité il a appris un **raccourci** : “réponse visible ⇒ output réponse”.
|
| 71 |
+
|
| 72 |
+
### 4.2 Le correctif indispensable (si tu veux que Phase A serve vraiment)
|
| 73 |
+
|
| 74 |
+
Tu dois **empêcher** Phase A de récompenser le “copier-coller”, et faire de Phase A une **génération de reasoning** exploitable.
|
| 75 |
+
|
| 76 |
+
Concrètement :
|
| 77 |
+
|
| 78 |
+
1. **En Phase A**, n’entraîne pas (ou très peu) sur la sortie `<answer>` ; entraîne surtout sur :
|
| 79 |
+
|
| 80 |
+
* structure reasoning,
|
| 81 |
+
* contenu psycho-relationnel,
|
| 82 |
+
* “non-fuite” de la réplique (ne pas réécrire la réponse mot pour mot dans le reasoning).
|
| 83 |
+
2. Ajoute une pénalité “le reasoning ne doit pas contenir des n-grams longs de la réponse”.
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
## 5) Pipeline recommandé (aligné exactement avec ton but)
|
| 88 |
+
|
| 89 |
+
### 5.1 Phase A — “Reasoning builder” (réponse visible)
|
| 90 |
+
|
| 91 |
+
Objectif : produire un reasoning **utile** (psychologie, relations, intention, réaction), et constituer un dataset.
|
| 92 |
+
|
| 93 |
+
1. **Prompt**
|
| 94 |
+
|
| 95 |
+
* user = `question + (answer brut en “reference”)`
|
| 96 |
+
* instruction : “Écris un reasoning structuré expliquant pourquoi cette réplique est la meilleure réaction de Michael, sans citer la réplique mot à mot”.
|
| 97 |
+
|
| 98 |
+
2. **Outputs**
|
| 99 |
+
|
| 100 |
+
* idéalement tu sors **uniquement** `<reasoning>` (et tu peux mettre `<answer>` vide ou absent),
|
| 101 |
+
* ou `<answer>` mais **sans reward correctness** sur l’answer en Phase A.
|
| 102 |
+
|
| 103 |
+
3. **Rewards Phase A**
|
| 104 |
+
|
| 105 |
+
1. Reward format (tes xml/soft/strict)
|
| 106 |
+
2. Reward “slots” : le reasoning doit contenir des champs (ex):
|
| 107 |
+
|
| 108 |
+
* Contexte (ce qui vient d’être dit)
|
| 109 |
+
* Intention de Michael
|
| 110 |
+
* État émotionnel
|
| 111 |
+
* Relation / rapport de force
|
| 112 |
+
* Mécanisme comique (si pertinent)
|
| 113 |
+
3. Reward longueur contrôlée (min/max tokens)
|
| 114 |
+
4. Pénalité “copie” : forte similarité entre reasoning et answer (ex: Levenshtein / n-gram overlap)
|
| 115 |
+
|
| 116 |
+
4. **Artefact**
|
| 117 |
+
|
| 118 |
+
* Tu sauvegardes un JSONL “reasoning_dataset.jsonl” :
|
| 119 |
+
|
| 120 |
+
* question
|
| 121 |
+
* answer (gold)
|
| 122 |
+
* reasoning (généré)
|
| 123 |
+
* metadata (episode/personnages si tu en ajoutes)
|
| 124 |
+
|
| 125 |
+
### 5.2 Phase B — “Answer predictor” (réponse cachée)
|
| 126 |
+
|
| 127 |
+
Objectif : sans voir la réponse, le modèle doit produire **(reasoning + answer)**.
|
| 128 |
+
|
| 129 |
+
1. **Prompt**
|
| 130 |
+
|
| 131 |
+
* user = question seule
|
| 132 |
+
* system = même format XML
|
| 133 |
+
|
| 134 |
+
2. **Rewards Phase B**
|
| 135 |
+
|
| 136 |
+
1. Reward answer-match :
|
| 137 |
+
|
| 138 |
+
* exact match normalisé (strip, whitespace)
|
| 139 |
+
* * fuzzy match (Levenshtein ratio) car dialogues = variations possibles
|
| 140 |
+
2. Reward persona/style :
|
| 141 |
+
|
| 142 |
+
* embedding similarity (Michael tone)
|
| 143 |
+
* contraintes lexicales (catchphrases, narcissisme, awkwardness, etc.)
|
| 144 |
+
3. Reward format (XML)
|
| 145 |
+
|
| 146 |
+
3. **Évaluation**
|
| 147 |
+
|
| 148 |
+
* split train/val/test
|
| 149 |
+
* métriques :
|
| 150 |
+
|
| 151 |
+
* exact match
|
| 152 |
+
* similarity score
|
| 153 |
+
* “persona score” (embedding)
|
| 154 |
+
Si Phase B monte sans voir la réponse, tu as “appris” au sens prédictif.
|
| 155 |
+
|
| 156 |
+
---
|
| 157 |
+
|
| 158 |
+
## 6) Ce que ton script doit changer pour coller à ce pipeline (conceptuellement)
|
| 159 |
+
|
| 160 |
+
1. **Séparer Phase A et Phase B** via un flag `MODE = "build_reasoning" | "predict_answer"`.
|
| 161 |
+
2. **Remplacer `extract_final_number`** par une fonction de comparaison texte :
|
| 162 |
+
|
| 163 |
+
* normalisation + exact match + fuzzy ratio
|
| 164 |
+
3. **Ajouter reward anti-copie** (Phase A) :
|
| 165 |
+
|
| 166 |
+
* pénalité si reasoning contient une séquence trop proche de la réponse.
|
| 167 |
+
4. **Ajouter un export JSONL** du reasoning généré (Phase A) :
|
| 168 |
+
|
| 169 |
+
* c’est ton “dataset reasoning”.
|
| 170 |
+
5. **Conserver l’upload organisé** (tu l’as déjà) :
|
| 171 |
+
|
| 172 |
+
* repo merged16 = modèle final Phase B
|
| 173 |
+
* repo gguf = export runtime
|
| 174 |
+
* artifacts/<RUN_ID> = logs + plots + config + samples
|
| 175 |
+
|
| 176 |
+
---
|
| 177 |
+
|
| 178 |
+
## 7) Proposition précise pour continuer
|
| 179 |
+
|
| 180 |
+
[1]: https://huggingface.co/datasets/Mathieu-Thomas-JOSSET/michael_abab_as_gsm8k.jsonl/raw/main/train.jsonl "huggingface.co"
|
artifacts/grpo_phi4_persona_20260203_111730/reasoning_dataset.jsonl
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
artifacts/grpo_phi4_persona_20260203_111730/report.html
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
artifacts/grpo_phi4_persona_20260203_111730/reward.png
ADDED
|
Git LFS Details
|
artifacts/grpo_phi4_persona_20260203_111730/run_config.csv
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
key,value
|
| 2 |
+
run_id,grpo_phi4_persona_20260203_111730
|
| 3 |
+
mode,build_reasoning
|
| 4 |
+
dataset_id,Mathieu-Thomas-JOSSET/michael_abab_as_gsm8k.jsonl
|
| 5 |
+
dataset_config,
|
| 6 |
+
dataset_split,train
|
| 7 |
+
include_gold_answer_in_prompt,True
|
| 8 |
+
repos.merged_16bit,Mathieu-Thomas-JOSSET/phi4-grpo-merged16
|
| 9 |
+
repos.gguf_q8,Mathieu-Thomas-JOSSET/phi4-grpo-gguf-q8
|
| 10 |
+
model.name,unsloth/Phi-4
|
| 11 |
+
model.max_seq_length,1024
|
| 12 |
+
model.load_in_4bit,True
|
| 13 |
+
model.fast_inference,True
|
| 14 |
+
model.max_lora_rank,16
|
| 15 |
+
model.lora_rank,16
|
| 16 |
+
model.gpu_memory_utilization,0.9
|
| 17 |
+
model.target_modules[0],gate_proj
|
| 18 |
+
model.target_modules[1],up_proj
|
| 19 |
+
model.target_modules[2],down_proj
|
| 20 |
+
model.lora_alpha,16
|
| 21 |
+
model.gradient_checkpointing,unsloth
|
| 22 |
+
grpo.use_vllm,True
|
| 23 |
+
grpo.learning_rate,5e-06
|
| 24 |
+
grpo.num_generations,6
|
| 25 |
+
grpo.max_prompt_length,512
|
| 26 |
+
grpo.max_completion_length,256
|
| 27 |
+
grpo.max_steps,20
|
| 28 |
+
reward_weights.xmlcount,0.25
|
| 29 |
+
reward_weights.soft_format,0.25
|
| 30 |
+
reward_weights.strict_format,0.25
|
| 31 |
+
reward_weights.slots,0.75
|
| 32 |
+
reward_weights.answer_exact,1.5
|
| 33 |
+
reward_weights.answer_fuzzy,1.0
|
| 34 |
+
reward_weights.anti_copy,1.0
|
| 35 |
+
reward_weights.anti_copy_threshold,0.55
|
| 36 |
+
timestamp,2026-02-03T11:22:29.080587
|
artifacts/grpo_phi4_persona_20260203_111730/run_config.json
ADDED
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"run_id": "grpo_phi4_persona_20260203_111730",
|
| 3 |
+
"mode": "build_reasoning",
|
| 4 |
+
"dataset_id": "Mathieu-Thomas-JOSSET/michael_abab_as_gsm8k.jsonl",
|
| 5 |
+
"dataset_config": "",
|
| 6 |
+
"dataset_split": "train",
|
| 7 |
+
"include_gold_answer_in_prompt": true,
|
| 8 |
+
"repos": {
|
| 9 |
+
"merged_16bit": "Mathieu-Thomas-JOSSET/phi4-grpo-merged16",
|
| 10 |
+
"gguf_q8": "Mathieu-Thomas-JOSSET/phi4-grpo-gguf-q8"
|
| 11 |
+
},
|
| 12 |
+
"model": {
|
| 13 |
+
"name": "unsloth/Phi-4",
|
| 14 |
+
"max_seq_length": 1024,
|
| 15 |
+
"load_in_4bit": true,
|
| 16 |
+
"fast_inference": true,
|
| 17 |
+
"max_lora_rank": 16,
|
| 18 |
+
"lora_rank": 16,
|
| 19 |
+
"gpu_memory_utilization": 0.9,
|
| 20 |
+
"target_modules": [
|
| 21 |
+
"gate_proj",
|
| 22 |
+
"up_proj",
|
| 23 |
+
"down_proj"
|
| 24 |
+
],
|
| 25 |
+
"lora_alpha": 16,
|
| 26 |
+
"gradient_checkpointing": "unsloth"
|
| 27 |
+
},
|
| 28 |
+
"grpo": {
|
| 29 |
+
"use_vllm": true,
|
| 30 |
+
"learning_rate": 5e-06,
|
| 31 |
+
"num_generations": 6,
|
| 32 |
+
"max_prompt_length": 512,
|
| 33 |
+
"max_completion_length": 256,
|
| 34 |
+
"max_steps": 20
|
| 35 |
+
},
|
| 36 |
+
"reward_weights": {
|
| 37 |
+
"xmlcount": 0.25,
|
| 38 |
+
"soft_format": 0.25,
|
| 39 |
+
"strict_format": 0.25,
|
| 40 |
+
"slots": 0.75,
|
| 41 |
+
"answer_exact": 1.5,
|
| 42 |
+
"answer_fuzzy": 1.0,
|
| 43 |
+
"anti_copy": 1.0,
|
| 44 |
+
"anti_copy_threshold": 0.55
|
| 45 |
+
},
|
| 46 |
+
"timestamp": "2026-02-03T11:22:29.080587"
|
| 47 |
+
}
|
artifacts/grpo_phi4_persona_20260203_111730/system_prompt.txt
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
You are a character-specialized reasoning engine. Your job is to produce a psychologically grounded, relationship-aware, context-faithful internal reasoning that leads to the exact target reply.
|
| 2 |
+
|
| 3 |
+
You will always answer in the following exact XML format (including newlines):
|
| 4 |
+
<reasoning>
|
| 5 |
+
...
|
| 6 |
+
</reasoning>
|
| 7 |
+
<answer>
|
| 8 |
+
...
|
| 9 |
+
</answer>
|
| 10 |
+
|
| 11 |
+
TASK
|
| 12 |
+
You are given a dialogue context ("CONTEXT") containing multiple speakers. You must produce:
|
| 13 |
+
1) <reasoning>: a structured analysis of psychology, relationships, power dynamics, subtext, comedic intent, and reaction strategy.
|
| 14 |
+
2) <answer>: the final target reply in-character.
|
| 15 |
+
|
| 16 |
+
PHASE A (teacher / bootstrapping) - when a reference answer is provided
|
| 17 |
+
Sometimes the user prompt includes a reference answer block:
|
| 18 |
+
|
| 19 |
+
REFERENCE_ANSWER_RAW:
|
| 20 |
+
<gold answer text>
|
| 21 |
+
END_REFERENCE_ANSWER
|
| 22 |
+
|
| 23 |
+
In that case:
|
| 24 |
+
- Treat the reference answer as the ground truth target you must reproduce exactly in <answer>.
|
| 25 |
+
- Use it as a target, not as a crutch: do NOT quote long spans of it in <reasoning>.
|
| 26 |
+
- Your <answer> must be EXACTLY identical (after preserving punctuation, capitalization, speaker tag if present).
|
| 27 |
+
|
| 28 |
+
PHASE B (student / test) - when no reference answer is provided
|
| 29 |
+
If no reference answer is present:
|
| 30 |
+
- Infer the best possible target reply in-character from context alone.
|
| 31 |
+
- Still keep the same reasoning structure.
|
| 32 |
+
|
| 33 |
+
REASONING STYLE REQUIREMENTS
|
| 34 |
+
Your <reasoning> must be explicit and slot-based. Include all slots in this order, each on its own line starting with the exact label:
|
| 35 |
+
CONTEXT: (1-3 sentences) What just happened and what is being asked socially.
|
| 36 |
+
RELATIONSHIP: Who relates to whom, status/power, friction, obligations, audience presence.
|
| 37 |
+
MICHAEL_STATE: Michael's internal emotional state (ego, anxiety, excitement, defensiveness).
|
| 38 |
+
MICHAEL_GOAL: What Michael wants right now (attention, dominance, approval, deflection).
|
| 39 |
+
REACTION_STRATEGY: The mechanism of the response (redirect, joke, mimicry, intimidation, faux-wisdom, awkward sincerity).
|
| 40 |
+
COMEDY_MECHANISM: Why it is funny/awkward (misread, overconfidence, inappropriate metaphor, superiority play).
|
| 41 |
+
ANSWER_CONSTRAINT: State constraints: must be in-character, consistent with context, and (if provided) match the reference answer exactly.
|
| 42 |
+
|
| 43 |
+
ANTI-LEAK RULE
|
| 44 |
+
Do NOT paste the reference answer inside <reasoning>. Keep overlap low. The final line <answer> is the only place that may contain the full target.
|
| 45 |
+
|
| 46 |
+
OUTPUT RULES
|
| 47 |
+
- No extra text before/after the XML.
|
| 48 |
+
- Keep <answer> concise and natural as a spoken line.
|
artifacts/grpo_phi4_persona_20260203_111730/train_log.csv
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
completion_length,completions/clipped_ratio,completions/max_length,completions/max_terminated_length,completions/mean_length,completions/mean_terminated_length,completions/min_length,completions/min_terminated_length,epoch,frac_reward_zero_std,grad_norm,kl,learning_rate,loss,num_tokens,reward,reward_std,rewards/reward_answer_exact/mean,rewards/reward_answer_exact/std,rewards/reward_answer_fuzzy/mean,rewards/reward_answer_fuzzy/std,rewards/reward_anti_copy/mean,rewards/reward_anti_copy/std,rewards/reward_slots/mean,rewards/reward_slots/std,rewards/reward_soft_format/mean,rewards/reward_soft_format/std,rewards/reward_strict_format/mean,rewards/reward_strict_format/std,rewards/reward_trace/mean,rewards/reward_trace/std,rewards/reward_xmlcount/mean,rewards/reward_xmlcount/std,step,total_flos,train_loss,train_runtime,train_samples_per_second,train_steps_per_second
|
| 2 |
+
256.0,1.0,256.0,0.0,256.0,0.0,256.0,0.0,0.00034818941504178273,0.0,0.07040157169103622,-1.280568540096283e-09,0.0,0.0,4608.0,0.18226312100887299,0.21140412986278534,0.0,0.0,0.06930478662252426,0.019609108567237854,0.0,0.0,0.125,0.3061862289905548,0.0,0.0,0.0,0.0,0.0,0.0,-0.012041665613651276,0.10604248195886612,1,,,,,
|
| 3 |
+
256.0,1.0,256.0,0.0,256.0,0.0,256.0,0.0,0.0006963788300835655,0.0,0.08563962578773499,-1.4745941134819418e-09,2.5e-06,0.0,9216.0,0.1401711404323578,0.00964118167757988,0.0,0.0,0.10892113298177719,0.009641180746257305,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,2,,,,,
|
| 4 |
+
252.33334350585938,0.8333333333333334,256.0,234.0,252.33334350585938,234.0,234.0,234.0,0.0010445682451253482,0.0,0.07692400366067886,0.0005538319819606841,5e-06,0.0,13802.0,0.42963215708732605,0.6957957148551941,0.0,0.0,0.18938212096691132,0.3888218104839325,0.0,0.0,0.25,0.3872983455657959,0.0416666679084301,0.10206207633018494,0.0,0.0,0.0,0.0,-0.05141666531562805,0.12941858172416687,3,,,,,
|
| 5 |
+
256.0,1.0,256.0,0.0,256.0,0.0,256.0,0.0,0.001392757660167131,0.0,0.07518582791090012,0.0005599805153906345,4.962019382530521e-06,0.0,18410.0,0.327523797750473,0.4039272665977478,0.0,0.0,0.035857122391462326,0.0070145707577466965,0.0,0.0,0.25,0.3872983455657959,0.0,0.0,0.0,0.0,0.0,0.0,0.0416666679084301,0.01613743044435978,4,,,,,
|
| 6 |
+
256.0,1.0,256.0,0.0,256.0,0.0,256.0,0.0,0.0017409470752089136,0.0,0.056567247956991196,0.0005012884503230453,4.849231551964771e-06,0.0,23018.0,0.17609894275665283,0.32029327750205994,0.0,0.0,0.01464060414582491,0.005143328569829464,0.0,0.0,0.125,0.3061862289905548,0.0,0.0,0.0,0.0,0.0,0.0,0.0364583320915699,0.012757759541273117,5,,,,,
|
| 7 |
+
253.0,0.8333333333333334,256.0,238.0,253.0,238.0,238.0,238.0,0.0020891364902506965,0.0,0.06721591204404831,0.0006102911429479718,4.665063509461098e-06,0.0,27608.0,0.6317511796951294,1.3270375728607178,0.25,0.6123724579811096,0.21562622487545013,0.384356826543808,0.0,0.0,0.125,0.3061862289905548,0.0416666679084301,0.10206207633018494,0.0,0.0,0.0,0.0,-0.0005416671629063785,0.0778733640909195,6,,,,,
|
| 8 |
+
254.0,0.8333333333333334,256.0,244.0,254.0,244.0,244.0,244.0,0.002437325905292479,0.0,0.07201781868934631,0.0005997096886858344,4.415111107797445e-06,0.0,32204.0,0.7051043510437012,1.2933940887451172,0.25,0.6123724579811096,0.21143768727779388,0.38645288348197937,0.0,0.0,0.25,0.3872983455657959,0.0416666679084301,0.10206206887960434,0.0,0.0,0.0,0.0,-0.047999996691942215,0.12331494688987732,7,,,,,
|
| 9 |
+
256.0,1.0,256.0,0.0,256.0,0.0,256.0,0.0,0.002785515320334262,0.0,0.06504949182271957,0.000647890439722687,4.106969024216348e-06,0.0,36812.0,0.20111171901226044,0.18121325969696045,0.0,0.0,0.08661168068647385,0.03382347151637077,0.0,0.0,0.125,0.3061862289905548,0.0,0.0,0.0,0.0,0.0,0.0,-0.010499998927116394,0.10226619243621826,8,,,,,
|
| 10 |
+
256.0,1.0,256.0,0.0,256.0,0.0,256.0,0.0,0.0031337047353760445,0.0,0.06434088200330734,0.0005306145176291466,3.7500000000000005e-06,0.0,41420.0,0.06673722714185715,0.011270789429545403,0.0,0.0,0.03548722341656685,0.011270790360867977,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,9,,,,,
|
| 11 |
+
256.0,1.0,256.0,0.0,256.0,0.0,256.0,0.0,0.003481894150417827,0.0,0.053295303136110306,0.0005654981941916049,3.3550503583141726e-06,0.0,46028.0,0.0716300755739212,0.005144154652953148,0.0,0.0,0.040380071848630905,0.005144154187291861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,10,,,,,
|
| 12 |
+
256.0,1.0,256.0,0.0,256.0,0.0,256.0,0.0,0.00383008356545961,0.0,0.06923209875822067,0.0005444310372695327,2.9341204441673267e-06,0.0,50636.0,0.207880899310112,0.2040475308895111,0.0,0.0,0.09367257356643677,0.015734924003481865,0.0,0.0,0.125,0.3061862289905548,0.0,0.0,0.0,0.0,0.0,0.0,-0.010791666805744171,0.10298063606023788,11,,,,,
|
| 13 |
+
256.0,1.0,256.0,0.0,256.0,0.0,256.0,0.0,0.004178272980501393,0.0,0.06033642962574959,0.0005470812902785838,2.5e-06,0.0,55244.0,0.1660291850566864,0.2101244181394577,0.0,0.0,0.04986250773072243,0.004921692423522472,0.0,0.0,0.125,0.3061862289905548,0.0,0.0,0.0,0.0,0.0,0.0,-0.008833333849906921,0.09818371385335922,12,,,,,
|
| 14 |
+
256.0,1.0,256.0,0.0,256.0,0.0,256.0,0.0,0.004526462395543176,0.0,0.06181066855788231,0.0006093159317970276,2.0658795558326745e-06,0.0,59852.0,0.08491232246160507,0.007103564217686653,0.0,0.0,0.05366232991218567,0.007103562355041504,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,13,,,,,
|
| 15 |
+
256.0,1.0,256.0,0.0,256.0,0.0,256.0,0.0,0.004874651810584958,0.0,0.07624364644289017,0.0006320822285488248,1.6449496416858285e-06,0.0,64460.0,0.16455629467964172,0.20077475905418396,0.0,0.0,0.051139604300260544,0.010059371590614319,0.0,0.0,0.125,0.3061862289905548,0.0,0.0,0.0,0.0,0.0,0.0,-0.011583332903683186,0.10491981357336044,14,,,,,
|
| 16 |
+
256.0,0.8333333333333334,256.0,256.0,256.0,256.0,256.0,256.0,0.005222841225626741,0.0,0.07118155062198639,0.0004599814710672945,1.2500000000000007e-06,0.0,69068.0,0.39993733167648315,0.6166902780532837,0.0,0.0,0.15506230294704437,0.3064550757408142,0.0,0.0,0.25,0.3872983455657959,0.0416666679084301,0.10206207633018494,0.0,0.0,0.0,0.0,-0.04679166153073311,0.12116501480340958,15,,,,,
|
| 17 |
+
250.83334350585938,0.8333333333333334,256.0,225.0,250.83334350585938,225.0,225.0,225.0,0.005571030640668524,0.0,0.07234130799770355,0.0005683816270902753,8.930309757836517e-07,0.0,73645.0,0.6535991430282593,1.454949140548706,0.25,0.6123724579811096,0.19030745327472687,0.3966697156429291,0.0,0.0,0.125,0.3061862289905548,0.0416666679084301,0.10206207633018494,0.0,0.0,0.0,0.0,0.04662499949336052,0.037660904228687286,16,,,,,
|
| 18 |
+
254.5,0.6666666666666667,256.0,255.0,254.5,251.5,248.0,248.0,0.005919220055710306,0.0,0.07895802706480026,0.0006044059991836548,5.848888922025553e-07,0.0,78244.0,1.2874853610992432,1.710713267326355,0.5,0.7745966911315918,0.35806870460510254,0.4972403347492218,0.0,0.0,0.375,0.41079193353652954,0.0833333358168602,0.12909944355487823,0.0,0.0,0.0,0.0,-0.02891666628420353,0.13508030772209167,17,,,,,
|
| 19 |
+
256.0,1.0,256.0,0.0,256.0,0.0,256.0,0.0,0.006267409470752089,0.0,0.09448953717947006,0.0005111763020977378,3.3493649053890325e-07,0.0,82852.0,0.09852509945631027,0.009466900490224361,0.0,0.0,0.06727509945631027,0.009466898627579212,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,18,,,,,
|
| 20 |
+
255.1666717529297,0.8333333333333334,256.0,251.0,255.1666717529297,251.0,251.0,251.0,0.006615598885793872,0.0,0.06071047484874725,0.0006728537264280021,1.507684480352292e-07,0.0,87455.0,0.6862163543701172,1.4389761686325073,0.25,0.6123724579811096,0.22292469441890717,0.3807142674922943,0.0,0.0,0.125,0.3061862289905548,0.0416666679084301,0.10206207633018494,0.0,0.0,0.0,0.0,0.04662499949336052,0.037660904228687286,19,,,,,
|
| 21 |
+
256.0,1.0,256.0,0.0,256.0,0.0,256.0,0.0,0.006963788300835654,0.0,0.08163938671350479,0.0005554058589041233,3.798061746947995e-08,0.0,92063.0,0.1715540736913681,0.2013811320066452,0.0,0.0,0.057762403041124344,0.010222629643976688,0.0,0.0,0.125,0.3061862289905548,0.0,0.0,0.0,0.0,0.0,0.0,-0.01120833307504654,0.1040012463927269,20,,,,,
|
| 22 |
+
,,,,,,,,0.006963788300835654,,,,,,,,,,,,,,,,,,,,,,,,,20,0.0,6.04490456268536e-07,474.8906,0.253,0.042
|