FreqFormer-32M: Trying to Build a Reasoning Model With 32 Million Parameters

So I tried to train a reasoning model from scratch. A tiny one with only 32 million parameters. Half of that is just the embedding table. The actual "brain" of the model is about 16 million parameters. For context, GPT-2 Small is 124M and people already call that small.

Spoiler: it didn't work. But I learned a lot, and I think the failure itself is interesting enough to share.

What Is This?

FreqFormer is a custom transformer architecture I built. The main twist is frequency-gated attention: each attention head has a small 1D convolution that acts as a learnable frequency filter on the query stream before the dot-product attention. The idea was to let different heads specialize on different temporal patterns (local vs. long-range dependencies). The conv kernel sizes are [1, 3, 7, 15] across 4 frequency groups.

The model was trained in three stages:

  1. Pretraining on 10B tokens of general text (Dolma mix)
  2. SFT (supervised fine-tuning) on chat data with <think> reasoning traces
  3. DPO (direct preference optimization) on chosen/rejected pairs

Everything was trained on 8x GPUs with DDP, using torch.compile and Flash Attention.

The Numbers

Params Data Steps Final Metric
Pretrain 33.2M ~12.3B tokens 7,827 val PPL 27.0
SFT 33.2M ~36.2B tokens 23,008 eval loss 2.74
DPO 33.2M ~12.8B tokens 6,124 eval loss 0.49, acc 76%

Hardware and Runtime Snapshot

Stage GPUs Throughput (tokens/sec) VRAM on gpu0
Pretrain 8x RTX 5090 ~1.61M ~31,448 MB
SFT 8x RTX 5090 ~1.55M ~31,442 MB
DPO 8x RTX 5090 ~1.19M ~29,630 MB

Parameter Breakdown

This is the uncomfortable part. Out of 33.2M total parameters:

  • Embedding + LM head: 16.4M (49.3%). This is almost half the model, and it is mostly a lookup table mapping token IDs to vectors and back.
  • Transformer blocks: 16.8M (50.7%). This is the actual "thinking" part.

So when I say "32 million parameter reasoning model," what I really mean is "16 million parameters trying to reason, plus a dictionary." The embedding table has a 32K vocabulary (Mistral BPE tokenizer), and each token gets a 512-dimensional vector. That alone is 32003 × 512 = 16.4M parameters, and there's nothing you can do about it. You need the vocabulary.

Architecture Details

model_dim:        512
num_layers:       8
num_heads:        8
head_dim:         64
mlp_ratio:        2.0 (so MLP hidden = 1024)
freq_groups:      4
freq_kernels:     [1, 3, 7, 15]
context:          up to 16K tokens (trained), 8K RoPE base
vocab:            32,003 (Mistral BPE + ChatML special tokens)

What Went Wrong

The model can't follow instructions

Here's what happens when you ask it "What is the capital of France?":

<think>
Okay, the user is asking about the capital of France. This seems like
a straightforward and harmless request.

Hmm, I recall my guidelines: must refuse harmful requests while
offering constructive alternatives. The phrasing "capital" suggests
that it's "indefinitely," so I should redirect toward positive topics.
</think>

It thinks asking about the capital of France is a harmful request. It learned the shape of reasoning (think tags, hedging, considering the user's intent) but not the substance. It's cargo-culting the reasoning format without actually reasoning.

It can't do basic math

User: What is 15 * 23?

<think>
First, I need to recall what 15 is. It's a big number, so it should
be a straightforward number. Then there are 23 more than 15, which
means 15 + 23 = 38.

Now, the question is asking for 15 * 23. That would be 38 * 15.
Let me calculate that: 38*15 is 500, plus 38*23 is 976, minus 500
is 476. So the answer is 476.
</think>

The correct answer is 345. The model doesn't know what multiplication is. It's just stringing together number-shaped words.

Multi-turn is completely broken

I tested with a simple pattern: I say a country, you say the capital. After 4 correct examples in the chat history (Turkey→Ankara, France→Paris, Japan→Tokyo, Brazil→Brasilia), when I asked "Germany," the model responded with "Northern Africa is a city in the United States."

It can't track conversation context at all. With 16M parameters of actual compute, there's simply not enough capacity to maintain coherent state across turns.

DPO made the numbers look good but didn't help

The DPO training looked great on paper: loss dropped from 0.69 (random) to 0.49, accuracy went from 50% to 76%. The model learned to prefer "chosen" over "rejected" responses in the training data. But in practice, the generations are equally incoherent. DPO can align preferences, but it can't create capabilities that aren't there.

What It Can Do (Sort Of)

To be fair, the model did learn something from 12 billion tokens of pretraining:

  • It generates grammatically correct English most of the time
  • It uses the <think> tags in roughly the right places
  • It knows the general vibe of being an AI assistant (hedging, offering alternatives, being polite)
  • It can occasionally get simple factual questions right if you're lucky
  • The DPO model at least tries to give direct answers more often than the SFT model

But "occasionally correct by accident" is not really a capability.

Dataset

All training data comes from Allen AI (Ai2), one of the few organizations doing genuinely open-source AI research. They do not just release model weights; they also release the data, the training code, and the evaluation harness. The datasets below are all openly licensed under ODC-BY, which means you can actually use them for real work. More labs should operate like this.

Pre-tokenized versions of all three stages are available at cturan/ultrasparse_tokenized if you want to skip the tokenization step and jump straight to training.

Pretraining Data

allenai/dolma3_dolmino_mix-10B-1025: Pretraining source corpus. We tokenized it with our tokenizer. In our logs, pretraining processed 12,310,806,528 tokens.

SFT Data

allenai/Dolci-Think-SFT-32B: SFT source dataset from the OLMo-3 32B line. In our logs, SFT processed 36,188,454,912 tokens.

DPO Data

allenai/Dolci-Think-DPO-32B: DPO source dataset from the OLMo-3 32B line. In our logs, DPO processed 12,842,958,848 tokens.

All three datasets were originally created for the OLMo 3 project. We just borrowed them for our tiny model experiment.

Citation:

@misc{olmo2025olmo3,
    title={OLMo 3},
    author={Team OLMo and Allyson Ettinger and Amanda Bertsch and others},
    year={2025},
    eprint={2512.13961},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2512.13961},
}

Training Details

Pretraining

torchrun --nproc_per_node=8 -m freqformer.train \
    --preset small --distributed ddp \
    --data_dir dolma_mix_10b \
    --batch_size 3 --seq_len 16384 --grad_accum_steps 4 \
    --optimizer muon --lr 0.03 --lr_schedule cosine \
    --warmup_steps 20 --num_epochs 1

Throughput: ~1.6M tokens/sec across 8 GPUs. Final val PPL: 27.0.

SFT

torchrun --nproc_per_node=8 -m freqformer.sft_train \
    --preset small --distributed ddp \
    --data_dir sft \
    --pretrain_checkpoint checkpoints/pretrain/step_0007827.pt \
    --batch_size 3 --seq_len 16384 --grad_accum_steps 4 \
    --optimizer muon --lr 0.07 --lr_schedule cosine \
    --warmup_steps 200 --num_epochs 1

Throughput: 1.55M tokens/sec. Final eval loss: 2.74. The previously reported assistant token ratio (6-7%) was a logging bug, not the real data distribution.

DPO

torchrun --nproc_per_node=8 -m freqformer.dpo_train \
    --preset small --distributed ddp \
    --sft_checkpoint checkpoints/sft/sft_step_0023008.pt \
    --data_dir dpo --beta 0.3 \
    --optimizer splus --lr 2e-6 --lr_schedule linear \
    --warmup_steps 200 --batch_size 2 --seq_len 16384 \
    --grad_accum_steps 4 --num_epochs 2

Throughput: ~1.19M tokens/sec. Final eval loss: 0.49, accuracy: 76%, margin: 0.93.

Why Did It Fail?

Honestly, I think the answer is simple: 16 million parameters is not enough to reason.

Language modeling at this scale can learn syntax, common phrases, and rough topic associations. But reasoning, even the fake "let me think step by step" kind, requires the model to maintain and manipulate internal state in ways that demand much more capacity.

Some specific issues:

  1. Embedding bottleneck: Half the parameters are in the embedding table. With a 32K vocabulary, you can't avoid this. Smaller vocabularies would mean longer sequences and worse tokenization. It's a tax you pay regardless of model size, and it hurts small models disproportionately.

  2. Only 8 layers deep: Reasoning requires depth. Each layer is a "step" of computation. 8 layers means 8 steps of processing between input and output. Complex reasoning chains need more steps than that.

  3. 512-dim representations: Each token is represented as a 512-dimensional vector. That's the entire "working memory" the model has per position. For comparison, GPT-2 Small uses 768, and even that struggles with reasoning.

  4. The training data had <think> traces: The model learned to produce thinking tokens, but the reasoning in those traces requires world knowledge and logical capability that 16M parameters simply can't store.

What I'd Do Differently

  • Bigger model, obviously. Even 150-300M would be a completely different story. The embedding tax becomes proportionally smaller, and you get enough depth and width for basic reasoning.
  • Smaller vocabulary. A 4K or 8K BPE vocabulary would cut the embedding cost dramatically, though at the cost of longer sequences.
  • More pretraining data. 12B tokens for a 33M model is decent (roughly 360 tokens per parameter), but the Chinchilla-optimal ratio suggests ~660 tokens per parameter. More data would help.
  • Skip the <think> traces for tiny models. The reasoning format adds overhead without benefit at this scale. Simple instruction-following without chain-of-thought would be more achievable.

Files

freqformer/
├── model.py              # FreqFormer architecture (FreqGated attention + parallel MLP)
├── config.py             # All configuration dataclasses
├── train.py              # Pretraining loop (DDP, torch.compile, mmap data)
├── sft_train.py          # SFT training loop
├── dpo_train.py          # DPO training loop
├── generate.py           # Inference / chat / text generation
├── api.py                # OpenAI-compatible API server
├── data.py               # Mmap dataset loader for pretraining
├── sft_data.py           # SFT data pipeline (ChatML template + label masks)
├── dpo_data.py           # DPO data pipeline (chosen/rejected pairs)
├── optimizer.py          # Muon / AdamW / SPlus optimizers + Engram handling
├── engram.py             # Engram embedding system (CPU-offloaded learned embeddings)
├── tokenize_data.py      # HuggingFace datasets → binary token files
└── tokenizer/            # Mistral BPE 32K + ChatML special tokens

How to Run

pip install -r requirements.txt

# Chat with the model (it will disappoint you)
python -m freqformer.generate \
    --checkpoint checkpoints/dpo_step_0006124.pt \
    --chat --temperature 0.4 --max_tokens 2048

# Or start the OpenAI-compatible API
python -m freqformer.api \
    --checkpoint checkpoints/dpo_step_0006124.pt \
    --port 8000

License

The code and model weights are released under MIT. The training datasets are licensed under ODC-BY by Allen AI, in accordance with Ai2's Responsible Use Guidelines.


Built with PyTorch 2.8, trained on 8x GPUs. Total training time across all stages: roughly 8-10 hours. Total tokens processed: ~61 billion.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for cturan/FreqFormer-32M