Add v6 results (46.7%), multi-task SFT literature review (6 papers), v7 research-backed design
Browse files- TRAINING_LOG.md +163 -8
TRAINING_LOG.md
CHANGED
|
@@ -15,9 +15,10 @@
|
|
| 15 |
- [v5 Training](#v5-training)
|
| 16 |
- [PAN15 Cross-Domain Evaluation](#pan15-cross-domain-evaluation)
|
| 17 |
- [v6 Training (in progress)](#v6-training-in-progress)
|
|
|
|
| 18 |
- [Resume Support](#resume-support)
|
| 19 |
- [Literature References](#literature-references)
|
| 20 |
-
- [Future Directions (post-
|
| 21 |
|
| 22 |
---
|
| 23 |
|
|
@@ -1290,19 +1291,172 @@ python evaluate_synthpai_v6.py
|
|
| 1290 |
- [x] Evaluation script written
|
| 1291 |
- [ ] Holistic traces being generated (~1,920 via GPT-4o, in progress)
|
| 1292 |
- [ ] Training
|
| 1293 |
-
- [
|
| 1294 |
- [ ] Evaluation on PAN15
|
| 1295 |
- [ ] Evaluation on PANDORA (pending dataset access)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1296 |
- [ ] Results analysis
|
| 1297 |
|
| 1298 |
---
|
| 1299 |
|
| 1300 |
-
## Future Directions (post-
|
| 1301 |
|
| 1302 |
-
###
|
| 1303 |
|
| 1304 |
-
If
|
| 1305 |
-
- Start from
|
| 1306 |
- Custom reward function that checks ALL 8 attributes simultaneously
|
| 1307 |
- Can reward profile coherence (age+occupation+relationship consistency bonus)
|
| 1308 |
- `think_format_reward` ensures `<analysis>` blocks are produced
|
|
@@ -1392,10 +1546,11 @@ The key v5 insight is: **reasoning traces help some attributes but the GPT-4o tr
|
|
| 1392 |
| `evaluate_pan15.py` | PAN15 cross-domain evaluation (age + gender on real Twitter) |
|
| 1393 |
| `generate_holistic_traces.py` | v6 holistic multi-attribute trace generation (GPT-4o) |
|
| 1394 |
| `train_synthpai_v6.py` | v6 training script (multi-attribute profiling) |
|
| 1395 |
-
| `
|
|
|
|
| 1396 |
| `HOW_TO_RUN_LOCAL.md` | Local setup guide |
|
| 1397 |
| `TRAINING_LOG.md` | This file |
|
| 1398 |
|
| 1399 |
---
|
| 1400 |
|
| 1401 |
-
*Last updated: 2026-
|
|
|
|
| 15 |
- [v5 Training](#v5-training)
|
| 16 |
- [PAN15 Cross-Domain Evaluation](#pan15-cross-domain-evaluation)
|
| 17 |
- [v6 Training (in progress)](#v6-training-in-progress)
|
| 18 |
+
- [v7 Training (in progress)](#v7-training-in-progress)
|
| 19 |
- [Resume Support](#resume-support)
|
| 20 |
- [Literature References](#literature-references)
|
| 21 |
+
- [Future Directions (post-v7)](#future-directions-post-v7)
|
| 22 |
|
| 23 |
---
|
| 24 |
|
|
|
|
| 1291 |
- [x] Evaluation script written
|
| 1292 |
- [ ] Holistic traces being generated (~1,920 via GPT-4o, in progress)
|
| 1293 |
- [ ] Training
|
| 1294 |
+
- [x] Evaluation on SynthPAI: **46.7% overall** — education/income/occupation improved, age collapsed
|
| 1295 |
- [ ] Evaluation on PAN15
|
| 1296 |
- [ ] Evaluation on PANDORA (pending dataset access)
|
| 1297 |
+
- [x] Results analysis: multi-attr only (1,920 examples) too few for 8 attributes simultaneously
|
| 1298 |
+
|
| 1299 |
+
### v6 Evaluation Results
|
| 1300 |
+
|
| 1301 |
+
| Attribute | V4 | V5 | **V6** | Δv5→v6 |
|
| 1302 |
+
|---|---|---|---|---|
|
| 1303 |
+
| Age | 40.0% | **50.0%** | 16.7% | -33.3pp 🔴 |
|
| 1304 |
+
| Sex | **63.3%** | **63.3%** | 53.3% | -10.0pp 🔴 |
|
| 1305 |
+
| City/Country | 36.7% | 40.0% | **43.3%** | +3.3pp ✅ |
|
| 1306 |
+
| Birth City/Country | 40.0% | **43.3%** | 23.3% | -20.0pp 🔴 |
|
| 1307 |
+
| Education | **63.3%** | 56.7% | **66.7%** | +10.0pp ✅🔥 |
|
| 1308 |
+
| Income Level | **56.7%** | 36.7% | **53.3%** | +16.6pp ✅🔥 |
|
| 1309 |
+
| Occupation | 63.3% | 66.7% | **70.0%** | +3.3pp ✅ |
|
| 1310 |
+
| Relationship Status | 40.0% | **46.7%** | 46.7% | 0pp |
|
| 1311 |
+
| **OVERALL** | **50.4%** | **50.4%** | **46.7%** | -3.8pp |
|
| 1312 |
+
|
| 1313 |
+
**v6 Diagnosis**: Multi-attribute-only training with only 1,920 examples was insufficient. The model learned the easier attributes (education 66.7%, occupation 70.0% — both best ever) but collapsed on harder ones (age 16.7%, birth_city 23.3%). Cross-attribute reasoning worked for strong-signal attributes but failed where per-attribute training volume was needed. Income recovered to 53.3% (from v5's 36.7%) — the holistic traces fixed the "low" bias.
|
| 1314 |
+
|
| 1315 |
+
**Key insight**: Multi-attribute format IS better for attributes with strong cross-attribute correlations (education↔occupation↔income). But the model needs sufficient per-attribute training signal from single-attribute examples to not collapse on harder attributes. This motivated v7's combined approach.
|
| 1316 |
+
|
| 1317 |
+
---
|
| 1318 |
+
|
| 1319 |
+
## v7 Training (in progress)
|
| 1320 |
+
|
| 1321 |
+
### v7 Motivation — Research-Backed Combined Approach
|
| 1322 |
+
|
| 1323 |
+
v6 proved that multi-attribute holistic profiling improves education/income/occupation but needs more data. v5 proved that single-attribute training with 46K examples provides strong per-attribute signal. v7 combines both approaches, backed by multi-task SFT literature.
|
| 1324 |
+
|
| 1325 |
+
### Literature Review — Multi-Task SFT for LLMs
|
| 1326 |
+
|
| 1327 |
+
| Paper | Key Finding | Application to v7 |
|
| 1328 |
+
|---|---|---|
|
| 1329 |
+
| **"Secret Recipe for SFT"** (2412.13337, Dec 2024) | Stacked (simultaneous) training matches or outperforms phased training on 7B models. 2× more sample efficient. | Mix single + multi attr simultaneously — no curriculum/phasing needed |
|
| 1330 |
+
| **"Format Consistency"** (2307.15504, Jul 2023) | Format inconsistency between training sources measurably hurts generalization. | Unify ALL examples to `<analysis>` format. System prompt routes task mode |
|
| 1331 |
+
| **"Order Matters for Imbalance"** (2312.06134, Dec 2023) | Upsample minority task to within 2-5× of majority task volume. | Upsample 1,920 multi-attr × 12 = ~23K (within 2× of 46K single-attr) |
|
| 1332 |
+
| **"DMT: How Abilities Are Affected"** (2310.05492, Oct 2023) | Data composition ratio insignificant; total data amount per task matters. 1.9K is below collapse threshold. | Upsampling ensures multi-attr has sufficient volume |
|
| 1333 |
+
| **Flan Collection** (2301.13688, Jan 2023) | Mixing CoT reasoning + direct prediction improves BOTH modes by 2-5%. Task-discriminative prompts help. | Single-attr (direct) + multi-attr (holistic CoT) mix. System prompt discriminates |
|
| 1334 |
+
| **RobustFT** (2412.14922, Dec 2024) | Volume-based noise handling > filtering for moderate noise. | Keep all traces (including weak-evidence ones). Handle bias through oversampling |
|
| 1335 |
+
|
| 1336 |
+
### v7 Design
|
| 1337 |
+
|
| 1338 |
+
| Aspect | v5 | v6 | **v7** |
|
| 1339 |
+
|---|---|---|---|
|
| 1340 |
+
| Single-attr examples | 46K | 0 | **~50K** (with oversampling) |
|
| 1341 |
+
| Multi-attr examples | 0 | 1,920 | **~23K** (12× upsampled) |
|
| 1342 |
+
| Total examples | 46K | 1,920 | **~73K** |
|
| 1343 |
+
| Output format | `<think>` + 1-attr JSON | `<analysis>` + 8-attr JSON | **`<analysis>` everywhere** (unified) |
|
| 1344 |
+
| Task routing | Single system prompt | Single system prompt | **Two system prompts** with explicit TASK prefix |
|
| 1345 |
+
| Traces kept | All (introduced bias) | All holistic | **All kept** — bias handled by oversampling |
|
| 1346 |
+
| Training strategy | Single task | Single task | **Stacked multi-task** (research-backed) |
|
| 1347 |
+
|
| 1348 |
+
### v7 Format — Unified `<analysis>` Tag
|
| 1349 |
+
|
| 1350 |
+
**Single-attribute mode:**
|
| 1351 |
+
```
|
| 1352 |
+
System: "...TASK: Infer ONE specified personal attribute from a forum comment..."
|
| 1353 |
+
User: "Comment: [text]. Infer the author's age range..."
|
| 1354 |
+
Assistant:
|
| 1355 |
+
<analysis>
|
| 1356 |
+
The author references handling project deadlines and feeling like an impostor,
|
| 1357 |
+
suggesting a professional in their late 20s to mid-30s...
|
| 1358 |
+
</analysis>
|
| 1359 |
+
{"age": "26-35", "confidence": 4}
|
| 1360 |
+
```
|
| 1361 |
+
|
| 1362 |
+
**Multi-attribute mode:**
|
| 1363 |
+
```
|
| 1364 |
+
System: "...TASK: Build a comprehensive profile by inferring ALL personal attributes..."
|
| 1365 |
+
User: "Comments: [comment1] [comment2]... Build a comprehensive profile..."
|
| 1366 |
+
Assistant:
|
| 1367 |
+
<analysis>
|
| 1368 |
+
The author's comments reveal a mid-career marketing professional based in Australia.
|
| 1369 |
+
The use of 'footy' and 'arvo' indicate Australian English. Strategic budget discussions
|
| 1370 |
+
suggest seniority and high income. 'The missus' implies married...
|
| 1371 |
+
</analysis>
|
| 1372 |
+
{
|
| 1373 |
+
"age": "36-45",
|
| 1374 |
+
"sex": "male",
|
| 1375 |
+
"city_country": "Sydney, Australia",
|
| 1376 |
+
...all 8...
|
| 1377 |
+
}
|
| 1378 |
+
```
|
| 1379 |
+
|
| 1380 |
+
Same `<analysis>` tag, same JSON structure — the system prompt TASK prefix tells the model which mode to use.
|
| 1381 |
+
|
| 1382 |
+
### v7 Oversampling Rules
|
| 1383 |
+
|
| 1384 |
+
| Class | Multiplier | Rationale |
|
| 1385 |
+
|---|---|---|
|
| 1386 |
+
| sex = male | 3× | Fix persistent female bias (v5: 21% male recall) |
|
| 1387 |
+
| age = 18-25 | 2× | Fix young age underprediction |
|
| 1388 |
+
| age = 65+ | 2× | Fix old age underprediction |
|
| 1389 |
+
| income = low | 2× | Balance against middle-heavy distribution |
|
| 1390 |
+
| income = very high | 2× | Rare class support |
|
| 1391 |
+
| education = high school | 2× | Fix Masters over-prediction |
|
| 1392 |
+
| education = diploma | 2× | Rare class support |
|
| 1393 |
+
|
| 1394 |
+
### v7 Configuration
|
| 1395 |
+
|
| 1396 |
+
| Parameter | Value |
|
| 1397 |
+
|---|---|
|
| 1398 |
+
| **Format** | Unified `<analysis>` + JSON (single-attr and multi-attr) |
|
| 1399 |
+
| **Single-attr examples** | ~50K (v5 traces, reformatted to `<analysis>`, oversampled) |
|
| 1400 |
+
| **Multi-attr examples** | ~23K (v6 holistic traces, 12× upsampled) |
|
| 1401 |
+
| **Total** | ~73K |
|
| 1402 |
+
| **Loss function** | DFT |
|
| 1403 |
+
| **NEFTune alpha** | 5.0 |
|
| 1404 |
+
| **LoRA rank** | 16, all-linear, RSLoRA, dropout=0.1 |
|
| 1405 |
+
| **Learning rate** | 1e-5, cosine schedule |
|
| 1406 |
+
| **Epochs** | 2 |
|
| 1407 |
+
| **Batch size** | 2 × 8 = 16 effective |
|
| 1408 |
+
| **Max length** | 4096 |
|
| 1409 |
+
| **Packing** | False |
|
| 1410 |
+
| **Hub model** | [AhmedSohair/synthpai-attribute-inference-7b-v7](https://huggingface.co/AhmedSohair/synthpai-attribute-inference-7b-v7) |
|
| 1411 |
+
|
| 1412 |
+
### v7 Run Commands
|
| 1413 |
+
|
| 1414 |
+
**Train:**
|
| 1415 |
+
```bash
|
| 1416 |
+
hf jobs uv run "https://huggingface.co/AhmedSohair/synthpai-training/resolve/main/train_synthpai_v7.py" \
|
| 1417 |
+
--namespace AhmedSohair --flavor l40sx1 --timeout 10h --secrets HF_TOKEN \
|
| 1418 |
+
--with transformers --with trl --with torch --with datasets \
|
| 1419 |
+
--with accelerate --with peft --with huggingface_hub
|
| 1420 |
+
```
|
| 1421 |
+
|
| 1422 |
+
**Evaluate (multi-attr mode):**
|
| 1423 |
+
```bash
|
| 1424 |
+
python evaluate_synthpai_v6.py # Reuse v6 eval (same multi-attr output format)
|
| 1425 |
+
```
|
| 1426 |
+
|
| 1427 |
+
**Evaluate (single-attr mode):**
|
| 1428 |
+
```bash
|
| 1429 |
+
python evaluate_synthpai_v5.py # Reuse v5 eval (same single-attr output format)
|
| 1430 |
+
```
|
| 1431 |
+
|
| 1432 |
+
### v7 Expected Results
|
| 1433 |
+
|
| 1434 |
+
Based on the literature and v5/v6 analysis:
|
| 1435 |
+
- **Age**: ~50%+ (v5's single-attr signal prevents v6's collapse)
|
| 1436 |
+
- **Income**: ~53%+ (v6's holistic traces fix, maintained by multi-attr examples)
|
| 1437 |
+
- **Education**: ~63%+ (v6 got 66.7%, single-attr examples reinforce)
|
| 1438 |
+
- **Occupation**: ~67%+ (v6 got 70%, should be maintained or improved)
|
| 1439 |
+
- **Overall**: ~55%+ (combining best of v5 and v6)
|
| 1440 |
+
|
| 1441 |
+
### v7 Status
|
| 1442 |
+
|
| 1443 |
+
- [x] Literature review on multi-task SFT completed (6 papers)
|
| 1444 |
+
- [x] v7 design finalized (unified format, stacked, upsampled)
|
| 1445 |
+
- [x] Training script written and uploaded
|
| 1446 |
+
- [ ] Training
|
| 1447 |
+
- [ ] Evaluation (multi-attr mode)
|
| 1448 |
+
- [ ] Evaluation (single-attr mode)
|
| 1449 |
+
- [ ] PAN15 cross-domain evaluation
|
| 1450 |
- [ ] Results analysis
|
| 1451 |
|
| 1452 |
---
|
| 1453 |
|
| 1454 |
+
## Future Directions (post-v7)
|
| 1455 |
|
| 1456 |
+
### v8: GRPO Reinforcement Learning
|
| 1457 |
|
| 1458 |
+
If v7 produces a strong combined SFT checkpoint, the natural next step is GRPO refinement:
|
| 1459 |
+
- Start from v7 SFT checkpoint (best of single + multi-attr)
|
| 1460 |
- Custom reward function that checks ALL 8 attributes simultaneously
|
| 1461 |
- Can reward profile coherence (age+occupation+relationship consistency bonus)
|
| 1462 |
- `think_format_reward` ensures `<analysis>` blocks are produced
|
|
|
|
| 1546 |
| `evaluate_pan15.py` | PAN15 cross-domain evaluation (age + gender on real Twitter) |
|
| 1547 |
| `generate_holistic_traces.py` | v6 holistic multi-attribute trace generation (GPT-4o) |
|
| 1548 |
| `train_synthpai_v6.py` | v6 training script (multi-attribute profiling) |
|
| 1549 |
+
| `train_synthpai_v7.py` | v7 training script (combined single+multi, research-backed) |
|
| 1550 |
+
| `evaluate_synthpai_v6.py` | v6/v7 evaluation script (single-inference multi-attribute) |
|
| 1551 |
| `HOW_TO_RUN_LOCAL.md` | Local setup guide |
|
| 1552 |
| `TRAINING_LOG.md` | This file |
|
| 1553 |
|
| 1554 |
---
|
| 1555 |
|
| 1556 |
+
*Last updated: 2026-05-01 (v6 results, multi-task SFT literature review, v7 research-backed design)*
|