--- datasets: - tinyllms/game24-trajectories - tinyllms/aime-1983-2023-trajectories base_model: - Qwen/Qwen2.5-7B-Instruct tags: - leave-one-out - loo-domain-knowledge - max_seq_length=16384 - lr=2e-5 - batch_size=1 - grad_accum=16 - epochs=1 - qlora - quantize=4bit_nf4 - lora_rank=64 - lora_alpha=128 - lora_dropout=0.05 - completion_only_loss - eval_size=0.1 - cosine_schedule - warmup=0.05 - bf16 - ddp_workers=2 - ray_job=raysubmit_A55M5NnZckrXmfWN --- # Qwen2.5-7B-Instruct SFT — LOO Domain Knowledge Fine-tuned from **Qwen/Qwen2.5-7B-Instruct** using QLoRA (4-bit NF4 quantization + LoRA adapters, merged before upload). This is the **SFT stage** of a leave-one-out (LOO) experiment: the model is trained on Game24 and AIME trajectories, deliberately **excluding domain knowledge (GPQA)** data. The held-out domain is later used to measure cross-domain transfer. ## Training Configuration - **Learning rate:** 2e-5 (cosine schedule, 5% warmup) - **Batch size:** 1 per device, gradient accumulation 16 (effective batch size 32 with 2 workers) - **Epochs:** 1 - **Max sequence length:** 16384 - **Precision:** bf16 - **Weight decay:** 0.01 ## QLoRA - **Quantization:** 4-bit NF4 with double quantization - **LoRA rank:** 64 - **LoRA alpha:** 128 - **LoRA dropout:** 0.05 - **Target modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj ## Loss - **completion_only_loss:** prompt tokens are masked; loss is computed only on assistant completion tokens - Dataset is converted from `messages` to `prompt`/`completion` format before training ## Datasets Trained on two datasets (domain knowledge held out): | Dataset | Domain | |---------|--------| | `tinyllms/game24-trajectories` | Game of 24 — arithmetic reasoning | | `tinyllms/aime-1983-2023-trajectories` | AIME — competition math | Examples exceeding `max_seq_len` are filtered out. A 10% holdout is used for evaluation (eval runs every 10 steps). ## Leave-One-Out Design | Domain | Role | |--------|------| | Game24 | Train | | AIME | Train | | Domain Knowledge (GPQA) | **Held out** | The GRPO stage follows using `tinyllms/qwen2.5-7b-instruct-grpo-loo-domain-knowledge`, trained on the same two datasets. Transfer is measured by evaluating on GPQA Diamond. ## Infrastructure - **GPU:** 2x NVIDIA H100 80GB (DDP) - **Framework:** TRL 0.29 + Ray Train - **Tracking:** [Weights & Biases](https://wandb.ai/psr-labs/pocket-sheet-sft/runs/pzs50igz) (project: `pocket-sheet-sft`) - **Ray Job ID:** raysubmit_A55M5NnZckrXmfWN