| --- |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - causal-lm |
| - reasoning |
| - thought-experiments |
| - chain-of-thought |
| - sft |
| - dpo |
| - alignment |
| - small-language-model |
| - custom-architecture |
| base_model: tensorfiend/DotLM-165M |
| datasets: |
| - tensorfiend/SimpleThoughts |
| pipeline_tag: text-generation |
| library_name: transformers |
| --- |
| |
| # DotLM |
|
|
| DotLM is a minimal 165M parameter model, from-scratch transformer trained entirely on the |
| [SimpleThoughts](https://huggingface.co/datasets/tensorfiend/SimpleThoughts) dataset. It uses explicit `<think>...</think>` |
| chain-of-thought traces to reason through intuitive physics, logic, causal inference, and other everyday phenomena before producing an |
| answer. |
|
|
| ## Model Details |
|
|
| ### Architecture |
|
|
| | Parameter | Value | |
| |---|---| |
| | Parameters | ~165M | |
| | Layers | 24 | |
| | Model dimension | 768 | |
| | FFN hidden dim | 2048 (SwiGLU) | |
| | Attention heads | 6 | |
| | KV heads (GQA) | 2 | |
| | Head dimension | 128 | |
| | Context length | 4096 tokens | |
| | Vocabulary size | 16,384 (BPE) | |
| | Positional encoding | RoPE (θ = 10,000) | |
| | Normalization | RMSNorm (ε = 1e-6) | |
| | Tied embeddings | Yes | |
|
|
| **Key design choices:** Grouped-Query Attention (GQA) with 3:1 head ratio for efficient KV memory, SwiGLU activations, pre-norm |
| architecture, and bf16 mixed-precision training throughout. |
|
|
| ### Training Pipeline |
|
|
| The model was trained sequentially across four stages using the [DotLM framework](https://github.com/shanmukh05/DotLM): |
|
|
| | Stage | Dataset | Samples | Objective | |
| |---|---|---|---| |
| | Pretraining | SimpleThoughts/pretrain | 352,214 | Next-token prediction | |
| | SFT | SimpleThoughts/sft | 25,788 | ChatML instruction following | |
| | Alignment | SimpleThoughts/alignment | 7,172 | Reference-free DPO (SimPO-style) | |
| | Reasoning | SimpleThoughts/reasoning | 6,300 | Chain-of-thought with `<think>` traces | |
|
|
| ### Special Tokens |
|
|
| | Token | Purpose | |
| |---|---| |
| | `<\|im_start\|>` | Start of turn (BOS) | |
| | `<\|im_end\|>` | End of turn | |
| | `<think>` | Begin reasoning trace | |
| | `</think>` | End reasoning trace | |
| | `<endoftext>` | End of sequence (EOS) | |
| | `<pad>` | Padding | |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| repo_id = "tensorfiend/DotLM-165M" |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| |
| tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| repo_id, |
| trust_remote_code=True, |
| torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32, |
| ).to(device) |
| |
| user_query = "If a ball is placed inside a box and the box is sealed, where is the ball?" |
| |
| prompt = f"<|im_start|>user\n{user_query}<|im_end|>\n<|im_start|>assistant\n<think>" |
| |
| inputs = tokenizer(prompt, return_tensors="pt").to(device) |
| |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=512, |
| temperature=0.7, |
| top_k=50, |
| do_sample=True, |
| eos_token_id=tokenizer.eos_token_id, |
| ) |
| |
| print(tokenizer.decode(outputs[0], skip_special_tokens=False)) |
| ``` |
|
|
| ### Prompt Format |
|
|
| DotLM uses the ChatML format with an explicit reasoning prefix: |
|
|
| ``` |
| <|im_start|>user |
| {your question}<|im_end|> |
| <|im_start|>assistant |
| <think> |
| {model reasons here} |
| </think> |
| {final answer} |
| ``` |
|
|
| ## Performance & Limitations |
|
|
| - Scale: At 165M parameters, DotLM is a research-scale model. It is not competitive with large-scale LLMs on general benchmarks. |
| - Domain: The model is specialized on thought experiments — intuitive physics, causal reasoning, spatial reasoning, theory of mind, and |
| related domains. It may underperform on unrelated topics. |
| - Reasoning quality: The chain-of-thought traces are coherent on in-distribution thought experiments but may hallucinate or ramble on |
| out-of-distribution inputs. |
| - Context: Maximum context length is 4,096 tokens. |
| - Safety: No RLHF safety training was applied. Not suitable for deployment in user-facing products without additional safety measures. |
|
|
| ## Training Details |
|
|
| Checkout the blog for training details: [DotLM - An end-to-end trained 165M model](https://www.tensorwrites.com/) (coming soon) |
|
|
| Related Resources |
|
|
| - Dataset: [SimpleThoughts](https://huggingface.co/datasets/tensorfiend/SimpleThoughts) |
| - Training code: [DotLM](https://github.com/shanmukh05/DotLM) (coming soon) |
|
|
| ## Citation |
|
|
| @misc{dotlm2026, |
| author = {Shanmukh}, |
| title = {DotLM-165M: A Minimal Reasoning Language Model Trained on Thought Experiments}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/tensorfiend/DotLM-165M} |
| } |
|
|
| ## License |
|
|
| https://www.apache.org/licenses/LICENSE-2.0 |