DeepSeek-V2-Lite-OpenMath-SFT

Supervised fine-tune of deepseek-ai/DeepSeek-V2-Lite (15.7B total / 2.4B active params, 64 experts top-6 + 2 shared, MLA attention) on nvidia/OpenMathInstruct-2 train_1M.

Training setup

Stack: torchtitan 0.2.2 + DeepEP + NVSHMEM IBGDA + PyTorch 2.11
Hardware: 2 × 8 × NVIDIA H200 (16 GPU)
Parallelism: FSDP across 16 ranks + Expert Parallel (EP=16) cross-node, expert dispatch via DeepEP/IBGDA
Precision: bfloat16, AdamW, loss-only torch.compile
Hyperparameters:
- Learning rate: 2e-5, cosine decay, 5% warmup (93 steps)
- Effective batch size: 128 (16 GPU × 8 local × 1 grad-accum)
- Sequence length: 2048 packed
- Steps: 1864 (one epoch over train_1M post-packing)
- Wall-clock: ~44 minutes
Steady-state MFU: ~20% at EP=16 cross-node + DeepEP/IBGDA

Prompt template

The model was fine-tuned with the following template, with loss masking applied to the prompt portion (only the response contributes to the SFT loss):

Solve the following math problem step by step:

{problem}

Solution:
{response}<|end_of_sentence|>

For best results at inference time, use the same template structure.

Loss curve

Step	Loss
1	0.650
150	0.387
750	0.320
1500	0.313
1864 (final)	~0.32

Limitations

Trained for math-instruction-following specifically; not a general chat model.
Inherits any limitations of the DeepSeek-V2-Lite base.
No RLHF or preference tuning beyond the SFT phase.

Downloads last month: 61

Safetensors

Model size

16B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aandraca/openmath-deepseek-v2-lite

Base model

deepseek-ai/DeepSeek-V2-Lite

Finetuned

(27)

this model

aandraca
/

openmath-deepseek-v2-lite