NVIDIA Nemotron Model Reasoning Challenge

university

https://www.kaggle.com/competitions/nvidia-nemotron-model-reasoning-challenge

Activity Feed

AI & ML interests

Hello yall, this is just my private organization to store all the files related to this hackathon.

Organization Card

Community About org cards

Following are plan for this hackathon

Kaggle NVIDIA Nemotron Reasoning Challenge — Plan

Goal

Produce a LoRA adapter (rank ≤ 32) on Nemotron-3-Nano-30B that maximizes accuracy on Alice's Wonderland reasoning puzzles.

Setup

Base model: Nemotron-3-Nano-30B
Hardware: 1× H100 80GB
Method: QLoRA (4-bit base + LoRA rank 32 on all linear layers), if H100's vram allows maybe we can try using just LoRA not QLoRA.
Framework: TRL v1.0 (supports SFT, DAPO-style GRPO out of the box)
Output format: Model must place final answer inside \boxed{...}

Task Categories (6 types in train.csv)

Bit manipulation (8-bit binary transforms)
Text encryption (ciphers)
Numeral system conversion (Roman numerals, etc.)
Unit conversion (secret conversion factor)
Modified gravitational constant (physics)
Equation transformation rules

Phase 1 — Data Foundation (cornerstone of the whole pipeline)

Step 1a: Analyze train.csv

Count rows per category
Record difficulty patterns and answer format quirks per category
Hold out train.csv entirely as validation set (closest proxy to hidden test distribution)

Step 1b: Build 6 Python puzzle generators

One generator per category, with controllable difficulty knobs
Each generator must output (prompt, ground_truth_answer) pairs
Generate ~100K synthetic prompts total
- Match train.csv's category distribution
- Apply floor of ~5K per category (ensure coverage even for rare categories)
- Mix easy/medium/hard within each category - this shoudl also follow the diffuculity distribution of in given train.csv

Step 1c: Distill reasoning traces via API (only paid step in the whole pipeline)

Use SOTA model (Claude / GPT-4 / etc.) to generate reasoning traces for each prompt
Format: reasoning steps + \boxed{answer}
Budget estimate: ~$200–500

Step 1d: Filter by correctness

Extract \boxed{} from each generated trace
Keep only rows where extracted answer matches ground truth (exact string or numerical tolerance)
Expected yield: 70–90% → **80K clean SFT rows**

Step 1e: Train/val split

Validation: use train.csv (9,500 rows) — NOT in training data
Training: ~80K filtered synthetic (prompt, trace, answer) rows

Phase 2 — N × (SFT + DAPO) with checkpoint tracking

SFT For sft, just do the things as a usual, Use cross entropy loss.

Reward function (used by DAPO every round): Extract \boxed{} from rollout → 1 if matches ground truth (exact or numerical tolerance), 0 otherwise.

DAPO config (important — differs from vanilla GRPO):

Decoupled clipping (Clip-Higher): epsilon_low < epsilon_high
No KL penalty (β=0)
Token-level policy gradient loss
Dynamic sampling (filter degenerate batches)
Group size: 8–16 (or 2–4 if compute-tight)

Round 1

Round 1 SFT

Train on Phase 1 dataset (~80K rows)
Evaluate on train.csv validation → save score + LoRA checkpoint

Round 1 DAPO

Prompt pool: ~30K–50K prompts
- ~40% overlap with SFT prompts (tests generalization) - still wondering what should be the difficulity distribution of these problems
- ~60% fresh synthetic prompts - still wondering what should be the difficulity distribution of these problems
Only prompts + ground truth needed — rollouts generated live
Evaluate on train.csv → save score + LoRA checkpoint

Round 2+

Self-distillation (replaces API distillation from Phase 1):

Use previous round's DAPO model to generate traces locally
No more API cost — everything runs on H100

Round N SFT

Prompt pool for self-distillation: ~150K prompts
- ~70K: reuse previous round's SFT prompts (regenerate with stronger model) - still wondering what should be the difficulity distribution of these problems
- ~80K: fresh synthetic prompts for diversity - still wondering what should be the difficulity distribution of these problems
Generate 4–8 candidate traces per prompt (temperature ~0.7)
Filter: keep traces whose \boxed{} matches ground truth
Expected yield: 85–95% (model is much stronger now) - still wondeing how manyu problem set should we target to get here
Keep 1–2 best traces per prompt (shortest correct, or diverse)
Mix in ~25K rows from previous round's SFT data (prevents drift / mode collapse) - still wondering what should be the difficulity distribution of these problems
Final training set: ~175K rows
Evaluate on train.csv → save score + LoRA checkpoint

Round N DAPO

Prompt pool: ~30K–50K, skewed harder - still wondering what should be the difficulity distribution of these problems
Goldilocks filter: keep prompts where current SFT model succeeds 20–80% of the time
- Drop prompts the model aces (no signal)
- Drop prompts the model always fails (no signal)
- DAPO's Dynamic Sampling helps here but manual curation is better
As rounds progress, difficulty floor rises
Evaluate → save score + LoRA checkpoint

Stopping Criterion

Stop iterating when validation score plateaus for 2 consecutive rounds.

Realistic target: 2–3 full rounds within the 2-month window.

Submission Strategy

Save LoRA checkpoint + validation score after every SFT and DAPO stage
Final submission: best-scoring checkpoint across all rounds (not necessarily the latest — models can regress)
Package as submission.zip containing adapter_config.json + adapter weights

What Gets Generated Where (Cost Summary)

Stage	Prompts	Final Rows	Source	Cost
Phase 1 SFT	100K	~80K	API distillation + filter	$200–500 (API)
Phase 1 DAPO	30K–50K	N/A (live rollouts)	Python generators	GPU only
Phase 2+ SFT	150K	~175K	Self-distillation + filter + previous-round mix	GPU only
Phase 2+ DAPO	30K–50K	N/A	Python generators (Goldilocks-filtered)	GPU only

Key insight: API is used exactly once (Phase 1 SFT). Everything after that uses Python scripts for prompts and the previous round's model for traces.

Hyperparameters Quick Reference

Setting	Value
Base model	Nemotron-3-Nano-30B (4-bit via bitsandbytes)
LoRA rank	32 (competition max)
LoRA alpha	64
LoRA target modules	All linear layers (q, k, v, o, gate, up, down)
Compute precision	bf16
Optimizer	AdamW 8-bit
Learning rate (SFT)	1e-4 to 2e-4
Learning rate (DAPO)	1e-6
Gradient checkpointing	ON
Sequence length	4096–8192
Effective batch size	16–64 (via gradient accumulation)

Risk Mitigation Checklist

All generator outputs verified against ground truth before training
train.csv held out as validation — NEVER used for training
\boxed{} format consistently enforced in all training traces
Per-category validation accuracy tracked (catch weak categories early)
Checkpoint saved after every stage
Best-checkpoint tracker maintained across all rounds

models 0

None public yet

datasets 0