DeepSeek-V2-Lite-OpenMath-SFT

Supervised fine-tune of deepseek-ai/DeepSeek-V2-Lite (15.7B total / 2.4B active params, 64 experts top-6 + 2 shared, MLA attention) on nvidia/OpenMathInstruct-2 train_1M.

Training setup

  • Stack: torchtitan 0.2.2 + DeepEP + NVSHMEM IBGDA + PyTorch 2.11
  • Hardware: 2 × 8 × NVIDIA H200 (16 GPU)
  • Parallelism: FSDP across 16 ranks + Expert Parallel (EP=16) cross-node, expert dispatch via DeepEP/IBGDA
  • Precision: bfloat16, AdamW, loss-only torch.compile
  • Hyperparameters:
    • Learning rate: 2e-5, cosine decay, 5% warmup (93 steps)
    • Effective batch size: 128 (16 GPU × 8 local × 1 grad-accum)
    • Sequence length: 2048 packed
    • Steps: 1864 (one epoch over train_1M post-packing)
    • Wall-clock: ~44 minutes
  • Steady-state MFU: ~20% at EP=16 cross-node + DeepEP/IBGDA

Prompt template

The model was fine-tuned with the following template, with loss masking applied to the prompt portion (only the response contributes to the SFT loss):

Solve the following math problem step by step:

{problem}

Solution:
{response}<|end_of_sentence|>

For best results at inference time, use the same template structure.

Loss curve

Step Loss
1 0.650
150 0.387
750 0.320
1500 0.313
1864 (final) ~0.32

Limitations

  • Trained for math-instruction-following specifically; not a general chat model.
  • Inherits any limitations of the DeepSeek-V2-Lite base.
  • No RLHF or preference tuning beyond the SFT phase.
Downloads last month
61
Safetensors
Model size
16B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aandraca/openmath-deepseek-v2-lite

Finetuned
(27)
this model

Dataset used to train aandraca/openmath-deepseek-v2-lite