imu_1_midtrain / README.md

thepowerfuldeez

Update README.md

a0b042d verified 5 months ago

preview code

raw

history blame contribute delete

1.39 kB

metadata

license: apache-2.0
datasets:
  - nvidia/Nemotron-Post-Training-Dataset-v2
  - HuggingFaceTB/smol-smoltalk
  - allenai/tulu-3-sft-mixture
language:
  - en

Imu1-Midtrain

First small language model trained on consumer GPUs with competitive performance.

Trained on 2B tokens of publicly available post-training datasets using advanced NorMuon optimizer with Cautious Weight Decay and Polar Express Newton-Shulz coefficients and WSD scheduler.

Custom library for training: https://github.com/thepowerfuldeez/sample_efficient_gpt

Training phase Stable stage - start with 50bs + 1024 context grad acc 8, so 409k tokens per update step.

Decay stage - 40bs, context 1024, grad acc 10. Inverse sqrt decay.

EMA during cooldown, post-hoc due to memory limits, increase checkpoint frequency Total micro steps: 50000 Evals

'ARC-Easy': 0.44023569023569026
'ARC-Challenge': 0.34897610921501704
'MMLU': 0.3349950149551346
'GSM8K': 0.01288855193328279
'HumanEval': 0.10365853658536585
'ChatCORE metric': 0.12309790154528757

Inference Custom fork of transformers required

uv pip install "git+https://github.com/thepowerfuldeez/transformers.git@imu1"

Custom fork of vllm required

uv pip install "git+https://github.com/thepowerfuldeez/vllm.git@imu1"

Chat template uses custom tokens such as <bos>, <user_start>, <user_end>, <assistant_start>, <assistant_end>