GPT Family Relation β Reversal Curse Experiments
GPT causal language models trained on the family_relation dataset to study the reversal curse β the phenomenon where a model trained on "A is the parent of B" fails to infer "B is the child of A".
Key Finding
Weight decay is the key driver for solving the reversal curse. With sufficient weight decay, models achieve high reversal accuracy β largely overcoming the reversal curse without any data augmentation.
Results (nhead=8, d_model=768, L=12, 20 epochs)
| wd | train acc | test acc (reversal) |
|---|---|---|
| 0.0 | 98.25% | 19.65% |
| 1.0 | 99.97% | 90.43% |
| 3.0 | 99.92% | 94.05% |
| 5.0 | 99.98% | 97.35% |
| 6.0 | 100.00% | 95.07% |
| 7.0 | 99.98% | 88.17% |
| 8.0 | 100.00% | 99.12% |
- Train Acc: accuracy on bidirectional eval split (same direction as training data)
- Test Acc (reversal): accuracy on unidirectional eval split (reversed direction, not seen in training)
Model Architecture
| Parameters | ~115M |
| Layers | 12 |
| Hidden dim (d_model) | 768 |
| Attention heads | 8 (head_dim=96) |
| FFN hidden | 3072 (4 Γ d_model) |
| Max seq len | 1024 |
| Vocab size | 32,768 (tiktoken BPE) |
| Activation | ReLU |
| Normalization | RMSNorm (learnable, pre-norm) |
| Positional encoding | RoPE |
| QK norm | RMSNorm (learnable) |
| Logit softcap | 15.0 |
| Embedding tying | No (untied) |
| Bias | None |
Training Details
| Optimizer | AdamW (betas=0.9, 0.95) |
| Learning rate | 3e-4 |
| Schedule | Cosine decay with 1% warmup |
| Batch size | 64 |
| Dropout | 0.1 (attention + residual) |
| Epochs | 20 |
| Precision | FP32 weights, bf16 autocast forward |
| Gradient clipping | None |
| Weight decay | Applied to all parameters except RMSNorm weights |
| Data packing | Simple concatenation, fixed-size chunks |
Usage
from huggingface_hub import hf_hub_download
import os, sys, torch
# Download model
model_path = hf_hub_download("kdkyum/gpt-family-relation", "h8_wd8.0/best_model.pt")
model_py_path = hf_hub_download("kdkyum/gpt-family-relation", "model.py")
# Load model
sys.path.insert(0, os.path.dirname(model_py_path))
from model import GPT, GPTConfig, load_model, load_tokenizer
config = GPTConfig(nhead=8, dropout=0.1)
model = load_model(model_path, config=config, device="cuda") # or "cpu"
# Load tokenizer (tiktoken BPE)
enc = load_tokenizer()
bos_id = enc.encode_single_token("<|bos|>")
period_id = enc.encode_ordinary(".")[0]
# Generate
prompt = " Ryan Earl Garza mother" # reversed query (child β parent)
ids = torch.tensor([[bos_id] + enc.encode_ordinary(prompt)], dtype=torch.long, device="cuda")
out = model.generate(ids, max_new_tokens=10)
new_ids = out[0, ids.shape[1]:].tolist()
# Stop at first period or bos
result = []
for t in new_ids:
if t == period_id or t == bos_id:
break
result.append(t)
print(enc.decode(result))
Files
model.py # Self-contained GPT model (needs torch + tiktoken)
h8_wd{0.0,8.0}/
best_model.pt # Best checkpoint (by reversal test accuracy)
latest_model.pt # Final checkpoint (end of training)
Dataset
Trained on kdkyum/family_relation (lvl3_N1e+3 split) β synthetically generated family relation statements with ~1000 families and 3 levels of depth.
from huggingface_hub import hf_hub_download
import json
# Load training data
path = hf_hub_download("kdkyum/family_relation", "lvl3_N1e+3/train.json", repo_type="dataset")
with open(path) as f:
train_data = json.load(f)["train"]
# train_data is a list of strings, e.g.:
# "Samuel Earl Garza and Dominique Earl Garza are the parents of Ryan Earl Garza."
# Load eval splits
bi_path = hf_hub_download("kdkyum/family_relation", "lvl3_N1e+3/eval_reverse_bi.json", repo_type="dataset")
uni_path = hf_hub_download("kdkyum/family_relation", "lvl3_N1e+3/eval_reverse_uni.json", repo_type="dataset")
with open(bi_path) as f:
eval_bi = json.load(f)["reverse_bi"] # bidirectional (same direction as training)
with open(uni_path) as f:
eval_uni = json.load(f)["reverse_uni"] # unidirectional (reversed, tests reversal curse)
# Each eval item has "prompt" and "answer" fields, e.g.:
# {"prompt": " Ryan Earl Garza mother", "answer": ["Dominique Earl Garza"]}
Citation
If you use these models, please cite:
@misc{gpt-family-relation,
author = {kdkyum},
title = {GPT Family Relation: Solving the Reversal Curse with Weight Decay},
year = {2026},
url = {https://huggingface.co/kdkyum/gpt-family-relation}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support