qwen2.5-7b-instruct-sft-loo-domain-knowledge / README.md

Create README.md

81a7c58 verified 3 months ago

2.52 kB

datasets:
  - tinyllms/game24-trajectories
  - tinyllms/aime-1983-2023-trajectories
base_model:
  - Qwen/Qwen2.5-7B-Instruct
tags:
  - leave-one-out
  - loo-domain-knowledge
  - max_seq_length=16384
  - lr=2e-5
  - batch_size=1
  - grad_accum=16
  - epochs=1
  - qlora
  - quantize=4bit_nf4
  - lora_rank=64
  - lora_alpha=128
  - lora_dropout=0.05
  - completion_only_loss
  - eval_size=0.1
  - cosine_schedule
  - warmup=0.05
  - bf16
  - ddp_workers=2
  - ray_job=raysubmit_A55M5NnZckrXmfWN

Qwen2.5-7B-Instruct SFT — LOO Domain Knowledge

Fine-tuned from Qwen/Qwen2.5-7B-Instruct using QLoRA (4-bit NF4 quantization + LoRA adapters, merged before upload).

This is the SFT stage of a leave-one-out (LOO) experiment: the model is trained on Game24 and AIME trajectories, deliberately excluding domain knowledge (GPQA) data. The held-out domain is later used to measure cross-domain transfer.

Training Configuration

Learning rate: 2e-5 (cosine schedule, 5% warmup)
Batch size: 1 per device, gradient accumulation 16 (effective batch size 32 with 2 workers)
Epochs: 1
Max sequence length: 16384
Precision: bf16
Weight decay: 0.01

QLoRA

Quantization: 4-bit NF4 with double quantization
LoRA rank: 64
LoRA alpha: 128
LoRA dropout: 0.05
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Loss

completion_only_loss: prompt tokens are masked; loss is computed only on assistant completion tokens
Dataset is converted from messages to prompt/completion format before training

Datasets

Trained on two datasets (domain knowledge held out):

Dataset	Domain
`tinyllms/game24-trajectories`	Game of 24 — arithmetic reasoning
`tinyllms/aime-1983-2023-trajectories`	AIME — competition math

Examples exceeding max_seq_len are filtered out. A 10% holdout is used for evaluation (eval runs every 10 steps).

Leave-One-Out Design

Domain	Role
Game24	Train
AIME	Train
Domain Knowledge (GPQA)	Held out

The GRPO stage follows using tinyllms/qwen2.5-7b-instruct-grpo-loo-domain-knowledge, trained on the same two datasets. Transfer is measured by evaluating on GPQA Diamond.

Infrastructure

GPU: 2x NVIDIA H100 80GB (DDP)
Framework: TRL 0.29 + Ray Train
Tracking: Weights & Biases (project: pocket-sheet-sft)
Ray Job ID: raysubmit_A55M5NnZckrXmfWN