AI & ML interests

Hello yall, this is just my private organization to store all the files related to this hackathon.

Recent Activity

Seungjun  updated a Space about 1 month ago
NVIDIAReasoningChallenge/README
Seungjun  published a Space about 1 month ago
NVIDIAReasoningChallenge/README
View all activity

Organization Card

Following are plan for this hackathon

Kaggle NVIDIA Nemotron Reasoning Challenge — Plan

Goal

Produce a LoRA adapter (rank ≤ 32) on Nemotron-3-Nano-30B that maximizes accuracy on Alice's Wonderland reasoning puzzles.

Setup

  • Base model: Nemotron-3-Nano-30B
  • Hardware: 1× H100 80GB
  • Method: QLoRA (4-bit base + LoRA rank 32 on all linear layers), if H100's vram allows maybe we can try using just LoRA not QLoRA.
  • Framework: TRL v1.0 (supports SFT, DAPO-style GRPO out of the box)
  • Output format: Model must place final answer inside \boxed{...}

Task Categories (6 types in train.csv)

  1. Bit manipulation (8-bit binary transforms)
  2. Text encryption (ciphers)
  3. Numeral system conversion (Roman numerals, etc.)
  4. Unit conversion (secret conversion factor)
  5. Modified gravitational constant (physics)
  6. Equation transformation rules

Phase 1 — Data Foundation (cornerstone of the whole pipeline)

Step 1a: Analyze train.csv

  • Count rows per category
  • Record difficulty patterns and answer format quirks per category
  • Hold out train.csv entirely as validation set (closest proxy to hidden test distribution)

Step 1b: Build 6 Python puzzle generators

  • One generator per category, with controllable difficulty knobs
  • Each generator must output (prompt, ground_truth_answer) pairs
  • Generate ~100K synthetic prompts total
    • Match train.csv's category distribution
    • Apply floor of ~5K per category (ensure coverage even for rare categories)
    • Mix easy/medium/hard within each category - this shoudl also follow the diffuculity distribution of in given train.csv

Step 1c: Distill reasoning traces via API (only paid step in the whole pipeline)

  • Use SOTA model (Claude / GPT-4 / etc.) to generate reasoning traces for each prompt
  • Format: reasoning steps + \boxed{answer}
  • Budget estimate: ~$200–500

Step 1d: Filter by correctness

  • Extract \boxed{} from each generated trace
  • Keep only rows where extracted answer matches ground truth (exact string or numerical tolerance)
  • Expected yield: 70–90% → **80K clean SFT rows**

Step 1e: Train/val split

  • Validation: use train.csv (9,500 rows) — NOT in training data
  • Training: ~80K filtered synthetic (prompt, trace, answer) rows

Phase 2 — N × (SFT + DAPO) with checkpoint tracking

SFT For sft, just do the things as a usual, Use cross entropy loss.

Reward function (used by DAPO every round): Extract \boxed{} from rollout → 1 if matches ground truth (exact or numerical tolerance), 0 otherwise.

DAPO config (important — differs from vanilla GRPO):

  • Decoupled clipping (Clip-Higher): epsilon_low < epsilon_high
  • No KL penalty (β=0)
  • Token-level policy gradient loss
  • Dynamic sampling (filter degenerate batches)
  • Group size: 8–16 (or 2–4 if compute-tight)

Round 1

Round 1 SFT

  • Train on Phase 1 dataset (~80K rows)
  • Evaluate on train.csv validation → save score + LoRA checkpoint

Round 1 DAPO

  • Prompt pool: ~30K–50K prompts
    • ~40% overlap with SFT prompts (tests generalization) - still wondering what should be the difficulity distribution of these problems
    • ~60% fresh synthetic prompts - still wondering what should be the difficulity distribution of these problems
  • Only prompts + ground truth needed — rollouts generated live
  • Evaluate on train.csv → save score + LoRA checkpoint

Round 2+

Self-distillation (replaces API distillation from Phase 1):

  • Use previous round's DAPO model to generate traces locally
  • No more API cost — everything runs on H100

Round N SFT

  • Prompt pool for self-distillation: ~150K prompts
    • ~70K: reuse previous round's SFT prompts (regenerate with stronger model) - still wondering what should be the difficulity distribution of these problems
    • ~80K: fresh synthetic prompts for diversity - still wondering what should be the difficulity distribution of these problems
  • Generate 4–8 candidate traces per prompt (temperature ~0.7)
  • Filter: keep traces whose \boxed{} matches ground truth
  • Expected yield: 85–95% (model is much stronger now) - still wondeing how manyu problem set should we target to get here
  • Keep 1–2 best traces per prompt (shortest correct, or diverse)
  • Mix in ~25K rows from previous round's SFT data (prevents drift / mode collapse) - still wondering what should be the difficulity distribution of these problems
  • Final training set: ~175K rows
  • Evaluate on train.csv → save score + LoRA checkpoint

Round N DAPO

  • Prompt pool: ~30K–50K, skewed harder - still wondering what should be the difficulity distribution of these problems
  • Goldilocks filter: keep prompts where current SFT model succeeds 20–80% of the time
    • Drop prompts the model aces (no signal)
    • Drop prompts the model always fails (no signal)
    • DAPO's Dynamic Sampling helps here but manual curation is better
  • As rounds progress, difficulty floor rises
  • Evaluate → save score + LoRA checkpoint

Stopping Criterion

Stop iterating when validation score plateaus for 2 consecutive rounds.

Realistic target: 2–3 full rounds within the 2-month window.


Submission Strategy

  • Save LoRA checkpoint + validation score after every SFT and DAPO stage
  • Final submission: best-scoring checkpoint across all rounds (not necessarily the latest — models can regress)
  • Package as submission.zip containing adapter_config.json + adapter weights

What Gets Generated Where (Cost Summary)

Stage Prompts Final Rows Source Cost
Phase 1 SFT 100K ~80K API distillation + filter $200–500 (API)
Phase 1 DAPO 30K–50K N/A (live rollouts) Python generators GPU only
Phase 2+ SFT 150K ~175K Self-distillation + filter + previous-round mix GPU only
Phase 2+ DAPO 30K–50K N/A Python generators (Goldilocks-filtered) GPU only

Key insight: API is used exactly once (Phase 1 SFT). Everything after that uses Python scripts for prompts and the previous round's model for traces.

Hyperparameters Quick Reference

Setting Value
Base model Nemotron-3-Nano-30B (4-bit via bitsandbytes)
LoRA rank 32 (competition max)
LoRA alpha 64
LoRA target modules All linear layers (q, k, v, o, gate, up, down)
Compute precision bf16
Optimizer AdamW 8-bit
Learning rate (SFT) 1e-4 to 2e-4
Learning rate (DAPO) 1e-6
Gradient checkpointing ON
Sequence length 4096–8192
Effective batch size 16–64 (via gradient accumulation)

Risk Mitigation Checklist

  • All generator outputs verified against ground truth before training
  • train.csv held out as validation — NEVER used for training
  • \boxed{} format consistently enforced in all training traces
  • Per-category validation accuracy tracked (catch weak categories early)
  • Checkpoint saved after every stage
  • Best-checkpoint tracker maintained across all rounds

models 0

None public yet

datasets 0

None public yet