File size: 4,586 Bytes
2e3c313
9b021de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e3c313
 
 
9b021de
2e3c313
9b021de
 
 
 
2e3c313
 
 
9b021de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e3c313
 
 
9b021de
2e3c313
9b021de
2e3c313
9b021de
 
2e3c313
9b021de
2e3c313
9b021de
 
 
 
 
 
 
2e3c313
9b021de
2e3c313
9b021de
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
language:
- en
license: apache-2.0
tags:
- causal-lm
- reasoning
- thought-experiments
- chain-of-thought
- sft
- dpo
- alignment
- small-language-model
- custom-architecture
base_model: tensorfiend/DotLM-165M
datasets:
- tensorfiend/SimpleThoughts
pipeline_tag: text-generation
library_name: transformers
---

# DotLM

DotLM is a minimal 165M parameter model, from-scratch transformer trained entirely on the
[SimpleThoughts](https://huggingface.co/datasets/tensorfiend/SimpleThoughts) dataset. It uses explicit `<think>...</think>`
chain-of-thought traces to reason through intuitive physics, logic, causal inference, and other everyday phenomena before producing an
answer.

## Model Details

### Architecture

| Parameter | Value |
|---|---|
| Parameters | ~165M |
| Layers | 24 |
| Model dimension | 768 |
| FFN hidden dim | 2048 (SwiGLU) |
| Attention heads | 6 |
| KV heads (GQA) | 2 |
| Head dimension | 128 |
| Context length | 4096 tokens |
| Vocabulary size | 16,384 (BPE) |
| Positional encoding | RoPE (θ = 10,000) |
| Normalization | RMSNorm (ε = 1e-6) |
| Tied embeddings | Yes |

**Key design choices:** Grouped-Query Attention (GQA) with 3:1 head ratio for efficient KV memory, SwiGLU activations, pre-norm
architecture, and bf16 mixed-precision training throughout.

### Training Pipeline

The model was trained sequentially across four stages using the [DotLM framework](https://github.com/shanmukh05/DotLM):

| Stage | Dataset | Samples | Objective |
|---|---|---|---|
| Pretraining | SimpleThoughts/pretrain | 352,214 | Next-token prediction |
| SFT | SimpleThoughts/sft | 25,788 | ChatML instruction following |
| Alignment | SimpleThoughts/alignment | 7,172 | Reference-free DPO (SimPO-style) |
| Reasoning | SimpleThoughts/reasoning | 6,300 | Chain-of-thought with `<think>` traces |

### Special Tokens

| Token | Purpose |
|---|---|
| `<\|im_start\|>` | Start of turn (BOS) |
| `<\|im_end\|>` | End of turn |
| `<think>` | Begin reasoning trace |
| `</think>` | End reasoning trace |
| `<endoftext>` | End of sequence (EOS) |
| `<pad>` | Padding |

## Usage

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "tensorfiend/DotLM-165M"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
).to(device)

user_query = "If a ball is placed inside a box and the box is sealed, where is the ball?"

prompt = f"<|im_start|>user\n{user_query}<|im_end|>\n<|im_start|>assistant\n<think>"

inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_k=50,
    do_sample=True,
    eos_token_id=tokenizer.eos_token_id,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))
```

### Prompt Format

DotLM uses the ChatML format with an explicit reasoning prefix:

```
<|im_start|>user
{your question}<|im_end|>
<|im_start|>assistant
<think>
{model reasons here}
</think>
{final answer}
```

## Performance & Limitations

- Scale: At 165M parameters, DotLM is a research-scale model. It is not competitive with large-scale LLMs on general benchmarks.
- Domain: The model is specialized on thought experiments — intuitive physics, causal reasoning, spatial reasoning, theory of mind, and
related domains. It may underperform on unrelated topics.
- Reasoning quality: The chain-of-thought traces are coherent on in-distribution thought experiments but may hallucinate or ramble on
out-of-distribution inputs.
- Context: Maximum context length is 4,096 tokens.
- Safety: No RLHF safety training was applied. Not suitable for deployment in user-facing products without additional safety measures.

## Training Details

Checkout the blog for training details: [DotLM - An end-to-end trained 165M model](https://www.tensorwrites.com/) (coming soon)

Related Resources

- Dataset: [SimpleThoughts](https://huggingface.co/datasets/tensorfiend/SimpleThoughts)
- Training code: [DotLM](https://github.com/shanmukh05/DotLM) (coming soon)                      

## Citation

@misc{dotlm2026,
  author    = {Shanmukh},
  title     = {DotLM-165M: A Minimal Reasoning Language Model Trained on Thought Experiments},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/tensorfiend/DotLM-165M}
}

## License

https://www.apache.org/licenses/LICENSE-2.0