---
datasets:
- tinyllms/game24-trajectories
- tinyllms/aime-1983-2023-trajectories
base_model:
- Qwen/Qwen2.5-7B-Instruct
tags:
- leave-one-out
- loo-domain-knowledge
- max_seq_length=16384
- lr=2e-5
- batch_size=1
- grad_accum=16
- epochs=1
- qlora
- quantize=4bit_nf4
- lora_rank=64
- lora_alpha=128
- lora_dropout=0.05
- completion_only_loss
- eval_size=0.1
- cosine_schedule
- warmup=0.05
- bf16
- ddp_workers=2
- ray_job=raysubmit_A55M5NnZckrXmfWN
---

# Qwen2.5-7B-Instruct SFT — LOO Domain Knowledge

Fine-tuned from **Qwen/Qwen2.5-7B-Instruct** using QLoRA (4-bit NF4 quantization + LoRA adapters, merged before upload).

This is the **SFT stage** of a leave-one-out (LOO) experiment: the model is trained on Game24 and AIME trajectories, deliberately **excluding domain knowledge (GPQA)** data. The held-out domain is later used to measure cross-domain transfer.

## Training Configuration

- **Learning rate:** 2e-5 (cosine schedule, 5% warmup)
- **Batch size:** 1 per device, gradient accumulation 16 (effective batch size 32 with 2 workers)
- **Epochs:** 1
- **Max sequence length:** 16384
- **Precision:** bf16
- **Weight decay:** 0.01

## QLoRA

- **Quantization:** 4-bit NF4 with double quantization
- **LoRA rank:** 64
- **LoRA alpha:** 128
- **LoRA dropout:** 0.05
- **Target modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

## Loss

- **completion_only_loss:** prompt tokens are masked; loss is computed only on assistant completion tokens
- Dataset is converted from `messages` to `prompt`/`completion` format before training

## Datasets

Trained on two datasets (domain knowledge held out):

| Dataset | Domain |
|---------|--------|
| `tinyllms/game24-trajectories` | Game of 24 — arithmetic reasoning |
| `tinyllms/aime-1983-2023-trajectories` | AIME — competition math |

Examples exceeding `max_seq_len` are filtered out. A 10% holdout is used for evaluation (eval runs every 10 steps).

## Leave-One-Out Design

| Domain | Role |
|--------|------|
| Game24 | Train |
| AIME | Train |
| Domain Knowledge (GPQA) | **Held out** |

The GRPO stage follows using `tinyllms/qwen2.5-7b-instruct-grpo-loo-domain-knowledge`, trained on the same two datasets. Transfer is measured by evaluating on GPQA Diamond.

## Infrastructure

- **GPU:** 2x NVIDIA H100 80GB (DDP)
- **Framework:** TRL 0.29 + Ray Train
- **Tracking:** [Weights & Biases](https://wandb.ai/psr-labs/pocket-sheet-sft/runs/pzs50igz) (project: `pocket-sheet-sft`)
- **Ray Job ID:** raysubmit_A55M5NnZckrXmfWN