smoLLM3-500M

A base pretrained causal language model (~500M parameters) trained from scratch on ~19.8B tokens of mixed web and reference text.

This model was trained as a small, under-1B alternative to larger base models (e.g. Qwen, LLaMA-derived checkpoints), mainly for my own experimentation and learning.

It is not instruction-tuned and not RLHF’d — this is a raw pretrained base model.

Source code can be found here:

https://github.com/Xv-yn/smolLM3_500M

Model details

This model uses the standard smolLM3 architecture and tokenizer; only the configuration (e.g. depth/width) was adjusted to target ~500M parameters.

Architecture: smolLM3 causal language model

Parameters: ~500M (via modified model configuration)

Context length: 4096 tokens

Tokenizer: smolLM3 tokenizer (unchanged)

Precision during training: bfloat16

Attention implementation: PyTorch SDPA

Intended use

  • Research and experimentation
  • Fine-tuning / instruction-tuning
  • Studying pretraining behavior at sub-1B scale

Limitations

  • Not instruction-tuned
  • Not aligned for chat or safety
  • No formal benchmarks included
  • Trained for experimentation, not deployment

Training data

The model was trained on a mixture of publicly available datasets, streamed and packed into fixed-length token blocks.

Source mix (Approximation)

[
  {"name":"fineweb_edu","weight":0.88,"token_cap":17.6B},
  {"name":"dclm","weight":0.07,"token_cap":1.4B},
  {"name":"stackexchange","weight":0.03,"token_cap":0.6B},
  {"name":"wiki","weight":0.01,"token_cap":0.2B}
]
  • FineWeb-Edu makes up the bulk of the data for general language coverage.

  • DCLM adds higher-quality filtered web text.

  • StackExchange and Wikipedia provide more factual / structured content.

  • Each source has a token cap to prevent any single dataset from dominating.

Note: One dataset used during early experimentation is not listed here because it required gated Hugging Face access. It was intentionally excluded so the data pipeline can be reproduced without authentication.

export SOURCES='[
  {"name":"fineweb_edu","dataset":"HuggingFaceFW/fineweb-edu","config":null,"text_col":"text","weight":0.88,"token_cap":17600000000},
  {"name":"dclm","dataset":"mlfoundations/dclm-baseline-1.0","config":null,"text_col":"text","weight":0.07,"token_cap":1400000000},
  {"name":"stackexchange","dataset":"allenai/dolmino-mix-1124","config":"stackexchange","text_col":"text","weight":0.03,"token_cap":600000000},
  {"name":"wiki","dataset":"allenai/dolmino-mix-1124","config":"wiki","text_col":"text","weight":0.01,"token_cap":200000000}
]'

Data processing & packing

  • Text is streamed directly from Hugging Face (streaming=True)
  • Documents are tokenized without special tokens
  • An EOS token is appended between documents
  • Tokens are concatenated into a rolling buffer
  • Fixed-length blocks of exactly 4096 tokens are sliced
  • Each block becomes one training example
  • Blocks are written to Parquet shards for efficient streaming during training

This keeps training simple and avoids dynamic padding.

python build_packed_shards.py \
  --model_dir . \
  --out_dir ./packed/seq4096 \
  --seq_len 4096 \
  --total_tokens 20000000000 \
  --blocks_per_shard 8192 \
  --shuffle_buffer 100000 \
  --seed 1234 \
  --sources_json "$SOURCES"

Training setup

Hardware

4 × NVIDIA RTX 6000 GPUs

This is why the training commands and batch sizing are configured the way they are.

Training configuration

Sequence length: 4096

Micro batch size: 4 (per GPU)

Gradient accumulation: 1

Effective batch: 4 × 4 × 4096 tokens per step

Optimizer: AdamW

Learning rate: 1e-4

Weight decay: 0.01

Scheduler: Cosine with warmup

Warmup steps: 2000

Total training steps: 610,352

Total tokens seen: ~19.8B

The training loop is step-based rather than epoch-based, since the dataset is streamed from Parquet shards and does not have a fixed number of batches.

accelerate launch --num_processes 4 pretrain_smollm3.py \
  --model_dir . \
  --packed_dir ./packed/seq4096 \
  --output_dir ./runs/full_20B_1epoch_4gpu \
  --seq_len 4096 \
  --micro_batch_size 4 \
  --grad_accum_steps 1 \
  --learning_rate 1e-4 \
  --weight_decay 0.01 \
  --num_train_steps 610352 \
  --warmup_steps 2000 \
  --log_every 10 \
  --save_every 50000 \
  --num_workers 0 \
  --mixed_precision bf16 \
  --attn_impl sdpa

Why this model exists

I trained this model because I wanted:

  • something under 1B parameters
  • a from-scratch pretrained model (not distilled or continued from Qwen / LLaMA)
  • a setup that was simple to reason about and reproduce
  • a checkpoint that’s cheap enough to experiment with locally or on modest hardware

There are many strong large models available already; this was more about understanding and controlling the full pretraining pipeline end-to-end.

Downloads last month
2
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support