DeepSeek-V2-Lite-OpenMath-SFT
Supervised fine-tune of deepseek-ai/DeepSeek-V2-Lite
(15.7B total / 2.4B active params, 64 experts top-6 + 2 shared, MLA attention)
on nvidia/OpenMathInstruct-2
train_1M.
Training setup
- Stack: torchtitan 0.2.2 + DeepEP + NVSHMEM IBGDA + PyTorch 2.11
- Hardware: 2 × 8 × NVIDIA H200 (16 GPU)
- Parallelism: FSDP across 16 ranks + Expert Parallel (EP=16) cross-node, expert dispatch via DeepEP/IBGDA
- Precision: bfloat16, AdamW, loss-only
torch.compile - Hyperparameters:
- Learning rate: 2e-5, cosine decay, 5% warmup (93 steps)
- Effective batch size: 128 (16 GPU × 8 local × 1 grad-accum)
- Sequence length: 2048 packed
- Steps: 1864 (one epoch over
train_1Mpost-packing) - Wall-clock: ~44 minutes
- Steady-state MFU: ~20% at EP=16 cross-node + DeepEP/IBGDA
Prompt template
The model was fine-tuned with the following template, with loss masking applied to the prompt portion (only the response contributes to the SFT loss):
Solve the following math problem step by step:
{problem}
Solution:
{response}<|end_of_sentence|>
For best results at inference time, use the same template structure.
Loss curve
| Step | Loss |
|---|---|
| 1 | 0.650 |
| 150 | 0.387 |
| 750 | 0.320 |
| 1500 | 0.313 |
| 1864 (final) | ~0.32 |
Limitations
- Trained for math-instruction-following specifically; not a general chat model.
- Inherits any limitations of the DeepSeek-V2-Lite base.
- No RLHF or preference tuning beyond the SFT phase.
- Downloads last month
- 61
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for aandraca/openmath-deepseek-v2-lite
Base model
deepseek-ai/DeepSeek-V2-Lite