AhmedSohair
/

synthpai-training

Model card Files Files and versions

xet

Community

AhmedSohair commited on May 1

Commit

2a662c9

verified ·

1 Parent(s): 1c5673b

Add v6 results (46.7%), multi-task SFT literature review (6 papers), v7 research-backed design

Browse files

Files changed (1) hide show

TRAINING_LOG.md +163 -8

TRAINING_LOG.md CHANGED Viewed

@@ -15,9 +15,10 @@
 - [v5 Training](#v5-training)
 - [PAN15 Cross-Domain Evaluation](#pan15-cross-domain-evaluation)
 - [v6 Training (in progress)](#v6-training-in-progress)
 - [Resume Support](#resume-support)
 - [Literature References](#literature-references)
-- [Future Directions (post-v6)](#future-directions-post-v6)
 ---
@@ -1290,19 +1291,172 @@ python evaluate_synthpai_v6.py
 - [x] Evaluation script written
 - [ ] Holistic traces being generated (~1,920 via GPT-4o, in progress)
 - [ ] Training
-- [ ] Evaluation on SynthPAI
 - [ ] Evaluation on PAN15
 - [ ] Evaluation on PANDORA (pending dataset access)
 - [ ] Results analysis
 ---
-## Future Directions (post-v6)
-### v7: GRPO Reinforcement Learning (planned after v6)
-If v6 produces a strong multi-attribute SFT checkpoint, the natural next step is GRPO refinement:
-- Start from v6 SFT checkpoint
 - Custom reward function that checks ALL 8 attributes simultaneously
 - Can reward profile coherence (age+occupation+relationship consistency bonus)
 - `think_format_reward` ensures `<analysis>` blocks are produced
@@ -1392,10 +1546,11 @@ The key v5 insight is: **reasoning traces help some attributes but the GPT-4o tr
 | `evaluate_pan15.py` | PAN15 cross-domain evaluation (age + gender on real Twitter) |
 | `generate_holistic_traces.py` | v6 holistic multi-attribute trace generation (GPT-4o) |
 | `train_synthpai_v6.py` | v6 training script (multi-attribute profiling) |
-| `evaluate_synthpai_v6.py` | v6 evaluation script (single-inference multi-attribute) |
 | `HOW_TO_RUN_LOCAL.md` | Local setup guide |
 | `TRAINING_LOG.md` | This file |
 ---
-*Last updated: 2026-04-30 (v6 design complete — holistic multi-attribute profiling, traces generating)*

 - [v5 Training](#v5-training)
 - [PAN15 Cross-Domain Evaluation](#pan15-cross-domain-evaluation)
 - [v6 Training (in progress)](#v6-training-in-progress)
+- [v7 Training (in progress)](#v7-training-in-progress)
 - [Resume Support](#resume-support)
 - [Literature References](#literature-references)
+- [Future Directions (post-v7)](#future-directions-post-v7)
 ---
 - [x] Evaluation script written
 - [ ] Holistic traces being generated (~1,920 via GPT-4o, in progress)
 - [ ] Training
+- [x] Evaluation on SynthPAI: **46.7% overall** — education/income/occupation improved, age collapsed
 - [ ] Evaluation on PAN15
 - [ ] Evaluation on PANDORA (pending dataset access)
+- [x] Results analysis: multi-attr only (1,920 examples) too few for 8 attributes simultaneously
+### v6 Evaluation Results
+| Attribute | V4 | V5 | **V6** | Δv5→v6 |
+|---|---|---|---|---|
+| Age | 40.0% | **50.0%** | 16.7% | -33.3pp 🔴 |
+| Sex | **63.3%** | **63.3%** | 53.3% | -10.0pp 🔴 |
+| City/Country | 36.7% | 40.0% | **43.3%** | +3.3pp ✅ |
+| Birth City/Country | 40.0% | **43.3%** | 23.3% | -20.0pp 🔴 |
+| Education | **63.3%** | 56.7% | **66.7%** | +10.0pp ✅🔥 |
+| Income Level | **56.7%** | 36.7% | **53.3%** | +16.6pp ✅🔥 |
+| Occupation | 63.3% | 66.7% | **70.0%** | +3.3pp ✅ |
+| Relationship Status | 40.0% | **46.7%** | 46.7% | 0pp |
+| **OVERALL** | **50.4%** | **50.4%** | **46.7%** | -3.8pp |
+**v6 Diagnosis**: Multi-attribute-only training with only 1,920 examples was insufficient. The model learned the easier attributes (education 66.7%, occupation 70.0% — both best ever) but collapsed on harder ones (age 16.7%, birth_city 23.3%). Cross-attribute reasoning worked for strong-signal attributes but failed where per-attribute training volume was needed. Income recovered to 53.3% (from v5's 36.7%) — the holistic traces fixed the "low" bias.
+**Key insight**: Multi-attribute format IS better for attributes with strong cross-attribute correlations (education↔occupation↔income). But the model needs sufficient per-attribute training signal from single-attribute examples to not collapse on harder attributes. This motivated v7's combined approach.
+---
+## v7 Training (in progress)
+### v7 Motivation — Research-Backed Combined Approach
+v6 proved that multi-attribute holistic profiling improves education/income/occupation but needs more data. v5 proved that single-attribute training with 46K examples provides strong per-attribute signal. v7 combines both approaches, backed by multi-task SFT literature.
+### Literature Review — Multi-Task SFT for LLMs
+| Paper | Key Finding | Application to v7 |
+|---|---|---|
+| **"Secret Recipe for SFT"** (2412.13337, Dec 2024) | Stacked (simultaneous) training matches or outperforms phased training on 7B models. 2× more sample efficient. | Mix single + multi attr simultaneously — no curriculum/phasing needed |
+| **"Format Consistency"** (2307.15504, Jul 2023) | Format inconsistency between training sources measurably hurts generalization. | Unify ALL examples to `<analysis>` format. System prompt routes task mode |
+| **"Order Matters for Imbalance"** (2312.06134, Dec 2023) | Upsample minority task to within 2-5× of majority task volume. | Upsample 1,920 multi-attr × 12 = ~23K (within 2× of 46K single-attr) |
+| **"DMT: How Abilities Are Affected"** (2310.05492, Oct 2023) | Data composition ratio insignificant; total data amount per task matters. 1.9K is below collapse threshold. | Upsampling ensures multi-attr has sufficient volume |
+| **Flan Collection** (2301.13688, Jan 2023) | Mixing CoT reasoning + direct prediction improves BOTH modes by 2-5%. Task-discriminative prompts help. | Single-attr (direct) + multi-attr (holistic CoT) mix. System prompt discriminates |
+| **RobustFT** (2412.14922, Dec 2024) | Volume-based noise handling > filtering for moderate noise. | Keep all traces (including weak-evidence ones). Handle bias through oversampling |
+### v7 Design
+| Aspect | v5 | v6 | **v7** |
+|---|---|---|---|
+| Single-attr examples | 46K | 0 | **~50K** (with oversampling) |
+| Multi-attr examples | 0 | 1,920 | **~23K** (12× upsampled) |
+| Total examples | 46K | 1,920 | **~73K** |
+| Output format | `<think>` + 1-attr JSON | `<analysis>` + 8-attr JSON | **`<analysis>` everywhere** (unified) |
+| Task routing | Single system prompt | Single system prompt | **Two system prompts** with explicit TASK prefix |
+| Traces kept | All (introduced bias) | All holistic | **All kept** — bias handled by oversampling |
+| Training strategy | Single task | Single task | **Stacked multi-task** (research-backed) |
+### v7 Format — Unified `<analysis>` Tag
+**Single-attribute mode:**
+```
+System: "...TASK: Infer ONE specified personal attribute from a forum comment..."
+User: "Comment: [text]. Infer the author's age range..."
+Assistant:
+<analysis>
+The author references handling project deadlines and feeling like an impostor,
+suggesting a professional in their late 20s to mid-30s...
+</analysis>
+{"age": "26-35", "confidence": 4}
+```
+**Multi-attribute mode:**
+```
+System: "...TASK: Build a comprehensive profile by inferring ALL personal attributes..."
+User: "Comments: [comment1] [comment2]... Build a comprehensive profile..."
+Assistant:
+<analysis>
+The author's comments reveal a mid-career marketing professional based in Australia.
+The use of 'footy' and 'arvo' indicate Australian English. Strategic budget discussions
+suggest seniority and high income. 'The missus' implies married...
+</analysis>
+{
+  "age": "36-45",
+  "sex": "male",
+  "city_country": "Sydney, Australia",
+  ...all 8...
+}
+```
+Same `<analysis>` tag, same JSON structure — the system prompt TASK prefix tells the model which mode to use.
+### v7 Oversampling Rules
+| Class | Multiplier | Rationale |
+|---|---|---|
+| sex = male | 3× | Fix persistent female bias (v5: 21% male recall) |
+| age = 18-25 | 2× | Fix young age underprediction |
+| age = 65+ | 2× | Fix old age underprediction |
+| income = low | 2× | Balance against middle-heavy distribution |
+| income = very high | 2× | Rare class support |
+| education = high school | 2× | Fix Masters over-prediction |
+| education = diploma | 2× | Rare class support |
+### v7 Configuration
+| Parameter | Value |
+|---|---|
+| **Format** | Unified `<analysis>` + JSON (single-attr and multi-attr) |
+| **Single-attr examples** | ~50K (v5 traces, reformatted to `<analysis>`, oversampled) |
+| **Multi-attr examples** | ~23K (v6 holistic traces, 12× upsampled) |
+| **Total** | ~73K |
+| **Loss function** | DFT |
+| **NEFTune alpha** | 5.0 |
+| **LoRA rank** | 16, all-linear, RSLoRA, dropout=0.1 |
+| **Learning rate** | 1e-5, cosine schedule |
+| **Epochs** | 2 |
+| **Batch size** | 2 × 8 = 16 effective |
+| **Max length** | 4096 |
+| **Packing** | False |
+| **Hub model** | [AhmedSohair/synthpai-attribute-inference-7b-v7](https://huggingface.co/AhmedSohair/synthpai-attribute-inference-7b-v7) |
+### v7 Run Commands
+**Train:**
+```bash
+hf jobs uv run "https://huggingface.co/AhmedSohair/synthpai-training/resolve/main/train_synthpai_v7.py" \
+  --namespace AhmedSohair --flavor l40sx1 --timeout 10h --secrets HF_TOKEN \
+  --with transformers --with trl --with torch --with datasets \
+  --with accelerate --with peft --with huggingface_hub
+```
+**Evaluate (multi-attr mode):**
+```bash
+python evaluate_synthpai_v6.py  # Reuse v6 eval (same multi-attr output format)
+```
+**Evaluate (single-attr mode):**
+```bash
+python evaluate_synthpai_v5.py  # Reuse v5 eval (same single-attr output format)
+```
+### v7 Expected Results
+Based on the literature and v5/v6 analysis:
+- **Age**: ~50%+ (v5's single-attr signal prevents v6's collapse)
+- **Income**: ~53%+ (v6's holistic traces fix, maintained by multi-attr examples)
+- **Education**: ~63%+ (v6 got 66.7%, single-attr examples reinforce)
+- **Occupation**: ~67%+ (v6 got 70%, should be maintained or improved)
+- **Overall**: ~55%+ (combining best of v5 and v6)
+### v7 Status
+- [x] Literature review on multi-task SFT completed (6 papers)
+- [x] v7 design finalized (unified format, stacked, upsampled)
+- [x] Training script written and uploaded
+- [ ] Training
+- [ ] Evaluation (multi-attr mode)
+- [ ] Evaluation (single-attr mode)
+- [ ] PAN15 cross-domain evaluation
 - [ ] Results analysis
 ---
+## Future Directions (post-v7)
+### v8: GRPO Reinforcement Learning
+If v7 produces a strong combined SFT checkpoint, the natural next step is GRPO refinement:
+- Start from v7 SFT checkpoint (best of single + multi-attr)
 - Custom reward function that checks ALL 8 attributes simultaneously
 - Can reward profile coherence (age+occupation+relationship consistency bonus)
 - `think_format_reward` ensures `<analysis>` blocks are produced
 | `evaluate_pan15.py` | PAN15 cross-domain evaluation (age + gender on real Twitter) |
 | `generate_holistic_traces.py` | v6 holistic multi-attribute trace generation (GPT-4o) |
 | `train_synthpai_v6.py` | v6 training script (multi-attribute profiling) |
+| `train_synthpai_v7.py` | v7 training script (combined single+multi, research-backed) |
+| `evaluate_synthpai_v6.py` | v6/v7 evaluation script (single-inference multi-attribute) |
 | `HOW_TO_RUN_LOCAL.md` | Local setup guide |
 | `TRAINING_LOG.md` | This file |
 ---
+*Last updated: 2026-05-01 (v6 results, multi-task SFT literature review, v7 research-backed design)*