File size: 4,586 Bytes
2e3c313 9b021de 2e3c313 9b021de 2e3c313 9b021de 2e3c313 9b021de 2e3c313 9b021de 2e3c313 9b021de 2e3c313 9b021de 2e3c313 9b021de 2e3c313 9b021de 2e3c313 9b021de 2e3c313 9b021de | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | ---
language:
- en
license: apache-2.0
tags:
- causal-lm
- reasoning
- thought-experiments
- chain-of-thought
- sft
- dpo
- alignment
- small-language-model
- custom-architecture
base_model: tensorfiend/DotLM-165M
datasets:
- tensorfiend/SimpleThoughts
pipeline_tag: text-generation
library_name: transformers
---
# DotLM
DotLM is a minimal 165M parameter model, from-scratch transformer trained entirely on the
[SimpleThoughts](https://huggingface.co/datasets/tensorfiend/SimpleThoughts) dataset. It uses explicit `<think>...</think>`
chain-of-thought traces to reason through intuitive physics, logic, causal inference, and other everyday phenomena before producing an
answer.
## Model Details
### Architecture
| Parameter | Value |
|---|---|
| Parameters | ~165M |
| Layers | 24 |
| Model dimension | 768 |
| FFN hidden dim | 2048 (SwiGLU) |
| Attention heads | 6 |
| KV heads (GQA) | 2 |
| Head dimension | 128 |
| Context length | 4096 tokens |
| Vocabulary size | 16,384 (BPE) |
| Positional encoding | RoPE (θ = 10,000) |
| Normalization | RMSNorm (ε = 1e-6) |
| Tied embeddings | Yes |
**Key design choices:** Grouped-Query Attention (GQA) with 3:1 head ratio for efficient KV memory, SwiGLU activations, pre-norm
architecture, and bf16 mixed-precision training throughout.
### Training Pipeline
The model was trained sequentially across four stages using the [DotLM framework](https://github.com/shanmukh05/DotLM):
| Stage | Dataset | Samples | Objective |
|---|---|---|---|
| Pretraining | SimpleThoughts/pretrain | 352,214 | Next-token prediction |
| SFT | SimpleThoughts/sft | 25,788 | ChatML instruction following |
| Alignment | SimpleThoughts/alignment | 7,172 | Reference-free DPO (SimPO-style) |
| Reasoning | SimpleThoughts/reasoning | 6,300 | Chain-of-thought with `<think>` traces |
### Special Tokens
| Token | Purpose |
|---|---|
| `<\|im_start\|>` | Start of turn (BOS) |
| `<\|im_end\|>` | End of turn |
| `<think>` | Begin reasoning trace |
| `</think>` | End reasoning trace |
| `<endoftext>` | End of sequence (EOS) |
| `<pad>` | Padding |
## Usage
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
repo_id = "tensorfiend/DotLM-165M"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
).to(device)
user_query = "If a ball is placed inside a box and the box is sealed, where is the ball?"
prompt = f"<|im_start|>user\n{user_query}<|im_end|>\n<|im_start|>assistant\n<think>"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_k=50,
do_sample=True,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
```
### Prompt Format
DotLM uses the ChatML format with an explicit reasoning prefix:
```
<|im_start|>user
{your question}<|im_end|>
<|im_start|>assistant
<think>
{model reasons here}
</think>
{final answer}
```
## Performance & Limitations
- Scale: At 165M parameters, DotLM is a research-scale model. It is not competitive with large-scale LLMs on general benchmarks.
- Domain: The model is specialized on thought experiments — intuitive physics, causal reasoning, spatial reasoning, theory of mind, and
related domains. It may underperform on unrelated topics.
- Reasoning quality: The chain-of-thought traces are coherent on in-distribution thought experiments but may hallucinate or ramble on
out-of-distribution inputs.
- Context: Maximum context length is 4,096 tokens.
- Safety: No RLHF safety training was applied. Not suitable for deployment in user-facing products without additional safety measures.
## Training Details
Checkout the blog for training details: [DotLM - An end-to-end trained 165M model](https://www.tensorwrites.com/) (coming soon)
Related Resources
- Dataset: [SimpleThoughts](https://huggingface.co/datasets/tensorfiend/SimpleThoughts)
- Training code: [DotLM](https://github.com/shanmukh05/DotLM) (coming soon)
## Citation
@misc{dotlm2026,
author = {Shanmukh},
title = {DotLM-165M: A Minimal Reasoning Language Model Trained on Thought Experiments},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/tensorfiend/DotLM-165M}
}
## License
https://www.apache.org/licenses/LICENSE-2.0 |