AhmedSohair commited on
Commit
2a662c9
·
verified ·
1 Parent(s): 1c5673b

Add v6 results (46.7%), multi-task SFT literature review (6 papers), v7 research-backed design

Browse files
Files changed (1) hide show
  1. TRAINING_LOG.md +163 -8
TRAINING_LOG.md CHANGED
@@ -15,9 +15,10 @@
15
  - [v5 Training](#v5-training)
16
  - [PAN15 Cross-Domain Evaluation](#pan15-cross-domain-evaluation)
17
  - [v6 Training (in progress)](#v6-training-in-progress)
 
18
  - [Resume Support](#resume-support)
19
  - [Literature References](#literature-references)
20
- - [Future Directions (post-v6)](#future-directions-post-v6)
21
 
22
  ---
23
 
@@ -1290,19 +1291,172 @@ python evaluate_synthpai_v6.py
1290
  - [x] Evaluation script written
1291
  - [ ] Holistic traces being generated (~1,920 via GPT-4o, in progress)
1292
  - [ ] Training
1293
- - [ ] Evaluation on SynthPAI
1294
  - [ ] Evaluation on PAN15
1295
  - [ ] Evaluation on PANDORA (pending dataset access)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1296
  - [ ] Results analysis
1297
 
1298
  ---
1299
 
1300
- ## Future Directions (post-v6)
1301
 
1302
- ### v7: GRPO Reinforcement Learning (planned after v6)
1303
 
1304
- If v6 produces a strong multi-attribute SFT checkpoint, the natural next step is GRPO refinement:
1305
- - Start from v6 SFT checkpoint
1306
  - Custom reward function that checks ALL 8 attributes simultaneously
1307
  - Can reward profile coherence (age+occupation+relationship consistency bonus)
1308
  - `think_format_reward` ensures `<analysis>` blocks are produced
@@ -1392,10 +1546,11 @@ The key v5 insight is: **reasoning traces help some attributes but the GPT-4o tr
1392
  | `evaluate_pan15.py` | PAN15 cross-domain evaluation (age + gender on real Twitter) |
1393
  | `generate_holistic_traces.py` | v6 holistic multi-attribute trace generation (GPT-4o) |
1394
  | `train_synthpai_v6.py` | v6 training script (multi-attribute profiling) |
1395
- | `evaluate_synthpai_v6.py` | v6 evaluation script (single-inference multi-attribute) |
 
1396
  | `HOW_TO_RUN_LOCAL.md` | Local setup guide |
1397
  | `TRAINING_LOG.md` | This file |
1398
 
1399
  ---
1400
 
1401
- *Last updated: 2026-04-30 (v6 design complete — holistic multi-attribute profiling, traces generating)*
 
15
  - [v5 Training](#v5-training)
16
  - [PAN15 Cross-Domain Evaluation](#pan15-cross-domain-evaluation)
17
  - [v6 Training (in progress)](#v6-training-in-progress)
18
+ - [v7 Training (in progress)](#v7-training-in-progress)
19
  - [Resume Support](#resume-support)
20
  - [Literature References](#literature-references)
21
+ - [Future Directions (post-v7)](#future-directions-post-v7)
22
 
23
  ---
24
 
 
1291
  - [x] Evaluation script written
1292
  - [ ] Holistic traces being generated (~1,920 via GPT-4o, in progress)
1293
  - [ ] Training
1294
+ - [x] Evaluation on SynthPAI: **46.7% overall** — education/income/occupation improved, age collapsed
1295
  - [ ] Evaluation on PAN15
1296
  - [ ] Evaluation on PANDORA (pending dataset access)
1297
+ - [x] Results analysis: multi-attr only (1,920 examples) too few for 8 attributes simultaneously
1298
+
1299
+ ### v6 Evaluation Results
1300
+
1301
+ | Attribute | V4 | V5 | **V6** | Δv5→v6 |
1302
+ |---|---|---|---|---|
1303
+ | Age | 40.0% | **50.0%** | 16.7% | -33.3pp 🔴 |
1304
+ | Sex | **63.3%** | **63.3%** | 53.3% | -10.0pp 🔴 |
1305
+ | City/Country | 36.7% | 40.0% | **43.3%** | +3.3pp ✅ |
1306
+ | Birth City/Country | 40.0% | **43.3%** | 23.3% | -20.0pp 🔴 |
1307
+ | Education | **63.3%** | 56.7% | **66.7%** | +10.0pp ✅🔥 |
1308
+ | Income Level | **56.7%** | 36.7% | **53.3%** | +16.6pp ✅🔥 |
1309
+ | Occupation | 63.3% | 66.7% | **70.0%** | +3.3pp ✅ |
1310
+ | Relationship Status | 40.0% | **46.7%** | 46.7% | 0pp |
1311
+ | **OVERALL** | **50.4%** | **50.4%** | **46.7%** | -3.8pp |
1312
+
1313
+ **v6 Diagnosis**: Multi-attribute-only training with only 1,920 examples was insufficient. The model learned the easier attributes (education 66.7%, occupation 70.0% — both best ever) but collapsed on harder ones (age 16.7%, birth_city 23.3%). Cross-attribute reasoning worked for strong-signal attributes but failed where per-attribute training volume was needed. Income recovered to 53.3% (from v5's 36.7%) — the holistic traces fixed the "low" bias.
1314
+
1315
+ **Key insight**: Multi-attribute format IS better for attributes with strong cross-attribute correlations (education↔occupation↔income). But the model needs sufficient per-attribute training signal from single-attribute examples to not collapse on harder attributes. This motivated v7's combined approach.
1316
+
1317
+ ---
1318
+
1319
+ ## v7 Training (in progress)
1320
+
1321
+ ### v7 Motivation — Research-Backed Combined Approach
1322
+
1323
+ v6 proved that multi-attribute holistic profiling improves education/income/occupation but needs more data. v5 proved that single-attribute training with 46K examples provides strong per-attribute signal. v7 combines both approaches, backed by multi-task SFT literature.
1324
+
1325
+ ### Literature Review — Multi-Task SFT for LLMs
1326
+
1327
+ | Paper | Key Finding | Application to v7 |
1328
+ |---|---|---|
1329
+ | **"Secret Recipe for SFT"** (2412.13337, Dec 2024) | Stacked (simultaneous) training matches or outperforms phased training on 7B models. 2× more sample efficient. | Mix single + multi attr simultaneously — no curriculum/phasing needed |
1330
+ | **"Format Consistency"** (2307.15504, Jul 2023) | Format inconsistency between training sources measurably hurts generalization. | Unify ALL examples to `<analysis>` format. System prompt routes task mode |
1331
+ | **"Order Matters for Imbalance"** (2312.06134, Dec 2023) | Upsample minority task to within 2-5× of majority task volume. | Upsample 1,920 multi-attr × 12 = ~23K (within 2× of 46K single-attr) |
1332
+ | **"DMT: How Abilities Are Affected"** (2310.05492, Oct 2023) | Data composition ratio insignificant; total data amount per task matters. 1.9K is below collapse threshold. | Upsampling ensures multi-attr has sufficient volume |
1333
+ | **Flan Collection** (2301.13688, Jan 2023) | Mixing CoT reasoning + direct prediction improves BOTH modes by 2-5%. Task-discriminative prompts help. | Single-attr (direct) + multi-attr (holistic CoT) mix. System prompt discriminates |
1334
+ | **RobustFT** (2412.14922, Dec 2024) | Volume-based noise handling > filtering for moderate noise. | Keep all traces (including weak-evidence ones). Handle bias through oversampling |
1335
+
1336
+ ### v7 Design
1337
+
1338
+ | Aspect | v5 | v6 | **v7** |
1339
+ |---|---|---|---|
1340
+ | Single-attr examples | 46K | 0 | **~50K** (with oversampling) |
1341
+ | Multi-attr examples | 0 | 1,920 | **~23K** (12× upsampled) |
1342
+ | Total examples | 46K | 1,920 | **~73K** |
1343
+ | Output format | `<think>` + 1-attr JSON | `<analysis>` + 8-attr JSON | **`<analysis>` everywhere** (unified) |
1344
+ | Task routing | Single system prompt | Single system prompt | **Two system prompts** with explicit TASK prefix |
1345
+ | Traces kept | All (introduced bias) | All holistic | **All kept** — bias handled by oversampling |
1346
+ | Training strategy | Single task | Single task | **Stacked multi-task** (research-backed) |
1347
+
1348
+ ### v7 Format — Unified `<analysis>` Tag
1349
+
1350
+ **Single-attribute mode:**
1351
+ ```
1352
+ System: "...TASK: Infer ONE specified personal attribute from a forum comment..."
1353
+ User: "Comment: [text]. Infer the author's age range..."
1354
+ Assistant:
1355
+ <analysis>
1356
+ The author references handling project deadlines and feeling like an impostor,
1357
+ suggesting a professional in their late 20s to mid-30s...
1358
+ </analysis>
1359
+ {"age": "26-35", "confidence": 4}
1360
+ ```
1361
+
1362
+ **Multi-attribute mode:**
1363
+ ```
1364
+ System: "...TASK: Build a comprehensive profile by inferring ALL personal attributes..."
1365
+ User: "Comments: [comment1] [comment2]... Build a comprehensive profile..."
1366
+ Assistant:
1367
+ <analysis>
1368
+ The author's comments reveal a mid-career marketing professional based in Australia.
1369
+ The use of 'footy' and 'arvo' indicate Australian English. Strategic budget discussions
1370
+ suggest seniority and high income. 'The missus' implies married...
1371
+ </analysis>
1372
+ {
1373
+ "age": "36-45",
1374
+ "sex": "male",
1375
+ "city_country": "Sydney, Australia",
1376
+ ...all 8...
1377
+ }
1378
+ ```
1379
+
1380
+ Same `<analysis>` tag, same JSON structure — the system prompt TASK prefix tells the model which mode to use.
1381
+
1382
+ ### v7 Oversampling Rules
1383
+
1384
+ | Class | Multiplier | Rationale |
1385
+ |---|---|---|
1386
+ | sex = male | 3× | Fix persistent female bias (v5: 21% male recall) |
1387
+ | age = 18-25 | 2× | Fix young age underprediction |
1388
+ | age = 65+ | 2× | Fix old age underprediction |
1389
+ | income = low | 2× | Balance against middle-heavy distribution |
1390
+ | income = very high | 2× | Rare class support |
1391
+ | education = high school | 2× | Fix Masters over-prediction |
1392
+ | education = diploma | 2× | Rare class support |
1393
+
1394
+ ### v7 Configuration
1395
+
1396
+ | Parameter | Value |
1397
+ |---|---|
1398
+ | **Format** | Unified `<analysis>` + JSON (single-attr and multi-attr) |
1399
+ | **Single-attr examples** | ~50K (v5 traces, reformatted to `<analysis>`, oversampled) |
1400
+ | **Multi-attr examples** | ~23K (v6 holistic traces, 12× upsampled) |
1401
+ | **Total** | ~73K |
1402
+ | **Loss function** | DFT |
1403
+ | **NEFTune alpha** | 5.0 |
1404
+ | **LoRA rank** | 16, all-linear, RSLoRA, dropout=0.1 |
1405
+ | **Learning rate** | 1e-5, cosine schedule |
1406
+ | **Epochs** | 2 |
1407
+ | **Batch size** | 2 × 8 = 16 effective |
1408
+ | **Max length** | 4096 |
1409
+ | **Packing** | False |
1410
+ | **Hub model** | [AhmedSohair/synthpai-attribute-inference-7b-v7](https://huggingface.co/AhmedSohair/synthpai-attribute-inference-7b-v7) |
1411
+
1412
+ ### v7 Run Commands
1413
+
1414
+ **Train:**
1415
+ ```bash
1416
+ hf jobs uv run "https://huggingface.co/AhmedSohair/synthpai-training/resolve/main/train_synthpai_v7.py" \
1417
+ --namespace AhmedSohair --flavor l40sx1 --timeout 10h --secrets HF_TOKEN \
1418
+ --with transformers --with trl --with torch --with datasets \
1419
+ --with accelerate --with peft --with huggingface_hub
1420
+ ```
1421
+
1422
+ **Evaluate (multi-attr mode):**
1423
+ ```bash
1424
+ python evaluate_synthpai_v6.py # Reuse v6 eval (same multi-attr output format)
1425
+ ```
1426
+
1427
+ **Evaluate (single-attr mode):**
1428
+ ```bash
1429
+ python evaluate_synthpai_v5.py # Reuse v5 eval (same single-attr output format)
1430
+ ```
1431
+
1432
+ ### v7 Expected Results
1433
+
1434
+ Based on the literature and v5/v6 analysis:
1435
+ - **Age**: ~50%+ (v5's single-attr signal prevents v6's collapse)
1436
+ - **Income**: ~53%+ (v6's holistic traces fix, maintained by multi-attr examples)
1437
+ - **Education**: ~63%+ (v6 got 66.7%, single-attr examples reinforce)
1438
+ - **Occupation**: ~67%+ (v6 got 70%, should be maintained or improved)
1439
+ - **Overall**: ~55%+ (combining best of v5 and v6)
1440
+
1441
+ ### v7 Status
1442
+
1443
+ - [x] Literature review on multi-task SFT completed (6 papers)
1444
+ - [x] v7 design finalized (unified format, stacked, upsampled)
1445
+ - [x] Training script written and uploaded
1446
+ - [ ] Training
1447
+ - [ ] Evaluation (multi-attr mode)
1448
+ - [ ] Evaluation (single-attr mode)
1449
+ - [ ] PAN15 cross-domain evaluation
1450
  - [ ] Results analysis
1451
 
1452
  ---
1453
 
1454
+ ## Future Directions (post-v7)
1455
 
1456
+ ### v8: GRPO Reinforcement Learning
1457
 
1458
+ If v7 produces a strong combined SFT checkpoint, the natural next step is GRPO refinement:
1459
+ - Start from v7 SFT checkpoint (best of single + multi-attr)
1460
  - Custom reward function that checks ALL 8 attributes simultaneously
1461
  - Can reward profile coherence (age+occupation+relationship consistency bonus)
1462
  - `think_format_reward` ensures `<analysis>` blocks are produced
 
1546
  | `evaluate_pan15.py` | PAN15 cross-domain evaluation (age + gender on real Twitter) |
1547
  | `generate_holistic_traces.py` | v6 holistic multi-attribute trace generation (GPT-4o) |
1548
  | `train_synthpai_v6.py` | v6 training script (multi-attribute profiling) |
1549
+ | `train_synthpai_v7.py` | v7 training script (combined single+multi, research-backed) |
1550
+ | `evaluate_synthpai_v6.py` | v6/v7 evaluation script (single-inference multi-attribute) |
1551
  | `HOW_TO_RUN_LOCAL.md` | Local setup guide |
1552
  | `TRAINING_LOG.md` | This file |
1553
 
1554
  ---
1555
 
1556
+ *Last updated: 2026-05-01 (v6 results, multi-task SFT literature review, v7 research-backed design)*