smoLLM3-500M
A base pretrained causal language model (~500M parameters) trained from scratch on ~19.8B tokens of mixed web and reference text.
This model was trained as a small, under-1B alternative to larger base models (e.g. Qwen, LLaMA-derived checkpoints), mainly for my own experimentation and learning.
It is not instruction-tuned and not RLHF’d — this is a raw pretrained base model.
Source code can be found here:
https://github.com/Xv-yn/smolLM3_500M
Model details
This model uses the standard smolLM3 architecture and tokenizer; only the configuration (e.g. depth/width) was adjusted to target ~500M parameters.
Architecture: smolLM3 causal language model
Parameters: ~500M (via modified model configuration)
Context length: 4096 tokens
Tokenizer: smolLM3 tokenizer (unchanged)
Precision during training: bfloat16
Attention implementation: PyTorch SDPA
Intended use
- Research and experimentation
- Fine-tuning / instruction-tuning
- Studying pretraining behavior at sub-1B scale
Limitations
- Not instruction-tuned
- Not aligned for chat or safety
- No formal benchmarks included
- Trained for experimentation, not deployment
Training data
The model was trained on a mixture of publicly available datasets, streamed and packed into fixed-length token blocks.
Source mix (Approximation)
[
{"name":"fineweb_edu","weight":0.88,"token_cap":17.6B},
{"name":"dclm","weight":0.07,"token_cap":1.4B},
{"name":"stackexchange","weight":0.03,"token_cap":0.6B},
{"name":"wiki","weight":0.01,"token_cap":0.2B}
]
FineWeb-Edu makes up the bulk of the data for general language coverage.
DCLM adds higher-quality filtered web text.
StackExchange and Wikipedia provide more factual / structured content.
Each source has a token cap to prevent any single dataset from dominating.
Note: One dataset used during early experimentation is not listed here because it required gated Hugging Face access. It was intentionally excluded so the data pipeline can be reproduced without authentication.
export SOURCES='[
{"name":"fineweb_edu","dataset":"HuggingFaceFW/fineweb-edu","config":null,"text_col":"text","weight":0.88,"token_cap":17600000000},
{"name":"dclm","dataset":"mlfoundations/dclm-baseline-1.0","config":null,"text_col":"text","weight":0.07,"token_cap":1400000000},
{"name":"stackexchange","dataset":"allenai/dolmino-mix-1124","config":"stackexchange","text_col":"text","weight":0.03,"token_cap":600000000},
{"name":"wiki","dataset":"allenai/dolmino-mix-1124","config":"wiki","text_col":"text","weight":0.01,"token_cap":200000000}
]'
Data processing & packing
- Text is streamed directly from Hugging Face (
streaming=True) - Documents are tokenized without special tokens
- An EOS token is appended between documents
- Tokens are concatenated into a rolling buffer
- Fixed-length blocks of exactly 4096 tokens are sliced
- Each block becomes one training example
- Blocks are written to Parquet shards for efficient streaming during training
This keeps training simple and avoids dynamic padding.
python build_packed_shards.py \
--model_dir . \
--out_dir ./packed/seq4096 \
--seq_len 4096 \
--total_tokens 20000000000 \
--blocks_per_shard 8192 \
--shuffle_buffer 100000 \
--seed 1234 \
--sources_json "$SOURCES"
Training setup
Hardware
4 × NVIDIA RTX 6000 GPUs
This is why the training commands and batch sizing are configured the way they are.
Training configuration
Sequence length: 4096
Micro batch size: 4 (per GPU)
Gradient accumulation: 1
Effective batch: 4 × 4 × 4096 tokens per step
Optimizer: AdamW
Learning rate: 1e-4
Weight decay: 0.01
Scheduler: Cosine with warmup
Warmup steps: 2000
Total training steps: 610,352
Total tokens seen: ~19.8B
The training loop is step-based rather than epoch-based, since the dataset is streamed from Parquet shards and does not have a fixed number of batches.
accelerate launch --num_processes 4 pretrain_smollm3.py \
--model_dir . \
--packed_dir ./packed/seq4096 \
--output_dir ./runs/full_20B_1epoch_4gpu \
--seq_len 4096 \
--micro_batch_size 4 \
--grad_accum_steps 1 \
--learning_rate 1e-4 \
--weight_decay 0.01 \
--num_train_steps 610352 \
--warmup_steps 2000 \
--log_every 10 \
--save_every 50000 \
--num_workers 0 \
--mixed_precision bf16 \
--attn_impl sdpa
Why this model exists
I trained this model because I wanted:
- something under 1B parameters
- a from-scratch pretrained model (not distilled or continued from Qwen / LLaMA)
- a setup that was simple to reason about and reproduce
- a checkpoint that’s cheap enough to experiment with locally or on modest hardware
There are many strong large models available already; this was more about understanding and controlling the full pretraining pipeline end-to-end.
- Downloads last month
- 2