SynthPAI Attribute Inference Training

Fine-tune Qwen2.5-7B-Instruct on SynthPAI for personal attribute inference from online text.

Quick Start

Option A: HF Jobs (personal namespace)

pip install huggingface_hub
huggingface-cli login

# Download training script
wget https://huggingface.co/AhmedSohair/synthpai-training/resolve/main/train_synthpai.py

# Launch via HF Jobs API (uses YOUR credits, not org credits)
huggingface-cli jobs run train_synthpai.py \
  --namespace AhmedSohair \
  --hardware a100-large \
  --timeout 6h \
  --dependencies transformers trl torch datasets trackio accelerate peft bitsandbytes huggingface_hub

Option B: Any GPU machine (Colab, runpod, etc.)

pip install transformers trl torch datasets trackio accelerate peft bitsandbytes huggingface_hub
huggingface-cli login

wget https://huggingface.co/AhmedSohair/synthpai-training/resolve/main/train_synthpai.py
python train_synthpai.py

Requirements: 24GB+ VRAM (L4, A10G, A100, etc.)

Evaluate

wget https://huggingface.co/AhmedSohair/synthpai-training/resolve/main/evaluate_synthpai.py
python evaluate_synthpai.py --model AhmedSohair/synthpai-attribute-inference-7b --split test --mode per_comment

Architecture

Component	Details
Base model	Qwen/Qwen2.5-7B-Instruct
Method	SFT + LoRA (r=64, all-linear, RSLoRA)
Dataset	RobinSta/SynthPAI — 7,823 comments, 300 authors
Training examples	52K single-comment + ~5.7K multi-comment ≈ 58K total (from 240 train authors)
Attributes	age, sex, city/country, birth city/country, education, income level, occupation, relationship status
Split	Author-level: 240 train / 30 val / 30 test (no author leakage)

Hyperparameters

Parameter	Value
Learning rate	2e-4
Effective batch size	16 (2 × 8 grad accum)
Epochs	3
Max seq length	1024
Packing	True
Precision	bf16
LoRA rank	64
LoRA alpha	16
Target modules	all-linear
RSLoRA	True
Scheduler	cosine
Warmup	5%

Baselines (FTI GPT-4 zero-shot on SynthPAI)

Source: AutoProfiler Table 5, FTI column (Staab et al. 2024)

Attribute	FTI (GPT-4)
Age	69.4%
Sex	92.8%
Location	80.0%
Birth place	88.0%
Education	73.0%
Income	66.7%
Occupation	73.9%
Relationship	79.2%
Average	77.9%

References

Staab et al. (2023). Beyond Memorization: Violating Privacy Via Inference with LLMs. arxiv:2310.07298
Yukhymenko et al. (2024). A Synthetic Dataset for Personal Attribute Inference. arxiv:2406.07217
Zhang et al. (2025). Automated Profile Inference with Language Model Agents. arxiv:2505.12402

License

Model trained on SynthPAI (CC-BY-NC-SA-4.0). Use accordingly.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for AhmedSohair/synthpai-training