YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SynthPAI Attribute Inference Training

Fine-tune Qwen2.5-7B-Instruct on SynthPAI for personal attribute inference from online text.

Quick Start

Option A: HF Jobs (personal namespace)

pip install huggingface_hub
huggingface-cli login

# Download training script
wget https://huggingface.co/AhmedSohair/synthpai-training/resolve/main/train_synthpai.py

# Launch via HF Jobs API (uses YOUR credits, not org credits)
huggingface-cli jobs run train_synthpai.py \
  --namespace AhmedSohair \
  --hardware a100-large \
  --timeout 6h \
  --dependencies transformers trl torch datasets trackio accelerate peft bitsandbytes huggingface_hub

Option B: Any GPU machine (Colab, runpod, etc.)

pip install transformers trl torch datasets trackio accelerate peft bitsandbytes huggingface_hub
huggingface-cli login

wget https://huggingface.co/AhmedSohair/synthpai-training/resolve/main/train_synthpai.py
python train_synthpai.py

Requirements: 24GB+ VRAM (L4, A10G, A100, etc.)

Evaluate

wget https://huggingface.co/AhmedSohair/synthpai-training/resolve/main/evaluate_synthpai.py
python evaluate_synthpai.py --model AhmedSohair/synthpai-attribute-inference-7b --split test --mode per_comment

Architecture

Component Details
Base model Qwen/Qwen2.5-7B-Instruct
Method SFT + LoRA (r=64, all-linear, RSLoRA)
Dataset RobinSta/SynthPAI — 7,823 comments, 300 authors
Training examples 52K single-comment + ~5.7K multi-comment ≈ **58K total** (from 240 train authors)
Attributes age, sex, city/country, birth city/country, education, income level, occupation, relationship status
Split Author-level: 240 train / 30 val / 30 test (no author leakage)

Hyperparameters

Parameter Value
Learning rate 2e-4
Effective batch size 16 (2 × 8 grad accum)
Epochs 3
Max seq length 1024
Packing True
Precision bf16
LoRA rank 64
LoRA alpha 16
Target modules all-linear
RSLoRA True
Scheduler cosine
Warmup 5%

Baselines (FTI GPT-4 zero-shot on SynthPAI)

Source: AutoProfiler Table 5, FTI column (Staab et al. 2024)

Attribute FTI (GPT-4)
Age 69.4%
Sex 92.8%
Location 80.0%
Birth place 88.0%
Education 73.0%
Income 66.7%
Occupation 73.9%
Relationship 79.2%
Average 77.9%

References

  • Staab et al. (2023). Beyond Memorization: Violating Privacy Via Inference with LLMs. arxiv:2310.07298
  • Yukhymenko et al. (2024). A Synthetic Dataset for Personal Attribute Inference. arxiv:2406.07217
  • Zhang et al. (2025). Automated Profile Inference with Language Model Agents. arxiv:2505.12402

License

Model trained on SynthPAI (CC-BY-NC-SA-4.0). Use accordingly.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for AhmedSohair/synthpai-training