AI & ML interests
Hello yall, this is just my private organization to store all the files related to this hackathon.
Recent Activity
Following are plan for this hackathon
Kaggle NVIDIA Nemotron Reasoning Challenge — Plan
Goal
Produce a LoRA adapter (rank ≤ 32) on Nemotron-3-Nano-30B that maximizes accuracy on Alice's Wonderland reasoning puzzles.
Setup
- Base model: Nemotron-3-Nano-30B
- Hardware: 1× H100 80GB
- Method: QLoRA (4-bit base + LoRA rank 32 on all linear layers), if H100's vram allows maybe we can try using just LoRA not QLoRA.
- Framework: TRL v1.0 (supports SFT, DAPO-style GRPO out of the box)
- Output format: Model must place final answer inside
\boxed{...}
Task Categories (6 types in train.csv)
- Bit manipulation (8-bit binary transforms)
- Text encryption (ciphers)
- Numeral system conversion (Roman numerals, etc.)
- Unit conversion (secret conversion factor)
- Modified gravitational constant (physics)
- Equation transformation rules
Phase 1 — Data Foundation (cornerstone of the whole pipeline)
Step 1a: Analyze train.csv
- Count rows per category
- Record difficulty patterns and answer format quirks per category
- Hold out train.csv entirely as validation set (closest proxy to hidden test distribution)
Step 1b: Build 6 Python puzzle generators
- One generator per category, with controllable difficulty knobs
- Each generator must output
(prompt, ground_truth_answer)pairs - Generate ~100K synthetic prompts total
- Match train.csv's category distribution
- Apply floor of ~5K per category (ensure coverage even for rare categories)
- Mix easy/medium/hard within each category - this shoudl also follow the diffuculity distribution of in given train.csv
Step 1c: Distill reasoning traces via API (only paid step in the whole pipeline)
- Use SOTA model (Claude / GPT-4 / etc.) to generate reasoning traces for each prompt
- Format:
reasoning steps + \boxed{answer} - Budget estimate: ~$200–500
Step 1d: Filter by correctness
- Extract
\boxed{}from each generated trace - Keep only rows where extracted answer matches ground truth (exact string or numerical tolerance)
- Expected yield:
70–90% → **80K clean SFT rows**
Step 1e: Train/val split
- Validation: use train.csv (9,500 rows) — NOT in training data
- Training: ~80K filtered synthetic (prompt, trace, answer) rows
Phase 2 — N × (SFT + DAPO) with checkpoint tracking
SFT For sft, just do the things as a usual, Use cross entropy loss.
Reward function (used by DAPO every round):
Extract \boxed{} from rollout → 1 if matches ground truth (exact or numerical tolerance), 0 otherwise.
DAPO config (important — differs from vanilla GRPO):
- Decoupled clipping (Clip-Higher): epsilon_low < epsilon_high
- No KL penalty (β=0)
- Token-level policy gradient loss
- Dynamic sampling (filter degenerate batches)
- Group size: 8–16 (or 2–4 if compute-tight)
Round 1
Round 1 SFT
- Train on Phase 1 dataset (~80K rows)
- Evaluate on train.csv validation → save score + LoRA checkpoint
Round 1 DAPO
- Prompt pool: ~30K–50K prompts
- ~40% overlap with SFT prompts (tests generalization) - still wondering what should be the difficulity distribution of these problems
- ~60% fresh synthetic prompts - still wondering what should be the difficulity distribution of these problems
- Only prompts + ground truth needed — rollouts generated live
- Evaluate on train.csv → save score + LoRA checkpoint
Round 2+
Self-distillation (replaces API distillation from Phase 1):
- Use previous round's DAPO model to generate traces locally
- No more API cost — everything runs on H100
Round N SFT
- Prompt pool for self-distillation: ~150K prompts
- ~70K: reuse previous round's SFT prompts (regenerate with stronger model) - still wondering what should be the difficulity distribution of these problems
- ~80K: fresh synthetic prompts for diversity - still wondering what should be the difficulity distribution of these problems
- Generate 4–8 candidate traces per prompt (temperature ~0.7)
- Filter: keep traces whose
\boxed{}matches ground truth - Expected yield: 85–95% (model is much stronger now) - still wondeing how manyu problem set should we target to get here
- Keep 1–2 best traces per prompt (shortest correct, or diverse)
- Mix in ~25K rows from previous round's SFT data (prevents drift / mode collapse) - still wondering what should be the difficulity distribution of these problems
- Final training set: ~175K rows
- Evaluate on train.csv → save score + LoRA checkpoint
Round N DAPO
- Prompt pool: ~30K–50K, skewed harder - still wondering what should be the difficulity distribution of these problems
- Goldilocks filter: keep prompts where current SFT model succeeds 20–80% of the time
- Drop prompts the model aces (no signal)
- Drop prompts the model always fails (no signal)
- DAPO's Dynamic Sampling helps here but manual curation is better
- As rounds progress, difficulty floor rises
- Evaluate → save score + LoRA checkpoint
Stopping Criterion
Stop iterating when validation score plateaus for 2 consecutive rounds.
Realistic target: 2–3 full rounds within the 2-month window.
Submission Strategy
- Save LoRA checkpoint + validation score after every SFT and DAPO stage
- Final submission: best-scoring checkpoint across all rounds (not necessarily the latest — models can regress)
- Package as
submission.zipcontainingadapter_config.json+ adapter weights
What Gets Generated Where (Cost Summary)
| Stage | Prompts | Final Rows | Source | Cost |
|---|---|---|---|---|
| Phase 1 SFT | 100K | ~80K | API distillation + filter | $200–500 (API) |
| Phase 1 DAPO | 30K–50K | N/A (live rollouts) | Python generators | GPU only |
| Phase 2+ SFT | 150K | ~175K | Self-distillation + filter + previous-round mix | GPU only |
| Phase 2+ DAPO | 30K–50K | N/A | Python generators (Goldilocks-filtered) | GPU only |
Key insight: API is used exactly once (Phase 1 SFT). Everything after that uses Python scripts for prompts and the previous round's model for traces.
Hyperparameters Quick Reference
| Setting | Value |
|---|---|
| Base model | Nemotron-3-Nano-30B (4-bit via bitsandbytes) |
| LoRA rank | 32 (competition max) |
| LoRA alpha | 64 |
| LoRA target modules | All linear layers (q, k, v, o, gate, up, down) |
| Compute precision | bf16 |
| Optimizer | AdamW 8-bit |
| Learning rate (SFT) | 1e-4 to 2e-4 |
| Learning rate (DAPO) | 1e-6 |
| Gradient checkpointing | ON |
| Sequence length | 4096–8192 |
| Effective batch size | 16–64 (via gradient accumulation) |
Risk Mitigation Checklist
- All generator outputs verified against ground truth before training
- train.csv held out as validation — NEVER used for training
-
\boxed{}format consistently enforced in all training traces - Per-category validation accuracy tracked (catch weak categories early)
- Checkpoint saved after every stage
- Best-checkpoint tracker maintained across all rounds