README.md · tensorfiend/DotLM-165M at main

DotLM-165M / README.md

tensorfiend

Update README.md

9b021de verified 2 days ago

preview code

raw

history blame contribute delete

4.59 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- causal-lm
	- reasoning
	- thought-experiments
	- chain-of-thought
	- sft
	- dpo
	- alignment
	- small-language-model
	- custom-architecture
	base_model: tensorfiend/DotLM-165M
	datasets:
	- tensorfiend/SimpleThoughts
	pipeline_tag: text-generation
	library_name: transformers
	---

	# DotLM

	DotLM is a minimal 165M parameter model, from-scratch transformer trained entirely on the
	[SimpleThoughts](https://huggingface.co/datasets/tensorfiend/SimpleThoughts) dataset. It uses explicit `<think>...</think>`
	chain-of-thought traces to reason through intuitive physics, logic, causal inference, and other everyday phenomena before producing an
	answer.

	## Model Details

	### Architecture

	\| Parameter \| Value \|
	\|---\|---\|
	\| Parameters \| ~165M \|
	\| Layers \| 24 \|
	\| Model dimension \| 768 \|
	\| FFN hidden dim \| 2048 (SwiGLU) \|
	\| Attention heads \| 6 \|
	\| KV heads (GQA) \| 2 \|
	\| Head dimension \| 128 \|
	\| Context length \| 4096 tokens \|
	\| Vocabulary size \| 16,384 (BPE) \|
	\| Positional encoding \| RoPE (θ = 10,000) \|
	\| Normalization \| RMSNorm (ε = 1e-6) \|
	\| Tied embeddings \| Yes \|

	Key design choices: Grouped-Query Attention (GQA) with 3:1 head ratio for efficient KV memory, SwiGLU activations, pre-norm
	architecture, and bf16 mixed-precision training throughout.

	### Training Pipeline

	The model was trained sequentially across four stages using the [DotLM framework](https://github.com/shanmukh05/DotLM):

	\| Stage \| Dataset \| Samples \| Objective \|
	\|---\|---\|---\|---\|
	\| Pretraining \| SimpleThoughts/pretrain \| 352,214 \| Next-token prediction \|
	\| SFT \| SimpleThoughts/sft \| 25,788 \| ChatML instruction following \|
	\| Alignment \| SimpleThoughts/alignment \| 7,172 \| Reference-free DPO (SimPO-style) \|
	\| Reasoning \| SimpleThoughts/reasoning \| 6,300 \| Chain-of-thought with `<think>` traces \|

	### Special Tokens

	\| Token \| Purpose \|
	\|---\|---\|
	\| `<\\|im_start\\|>` \| Start of turn (BOS) \|
	\| `<\\|im_end\\|>` \| End of turn \|
	\| `<think>` \| Begin reasoning trace \|
	\| `</think>` \| End reasoning trace \|
	\| `<endoftext>` \| End of sequence (EOS) \|
	\| `<pad>` \| Padding \|

	## Usage

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	repo_id = "tensorfiend/DotLM-165M"
	device = "cuda" if torch.cuda.is_available() else "cpu"

	tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	repo_id,
	trust_remote_code=True,
	torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
	).to(device)

	user_query = "If a ball is placed inside a box and the box is sealed, where is the ball?"

	prompt = f"<\|im_start\|>user\n{user_query}<\|im_end\|>\n<\|im_start\|>assistant\n<think>"

	inputs = tokenizer(prompt, return_tensors="pt").to(device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	top_k=50,
	do_sample=True,
	eos_token_id=tokenizer.eos_token_id,
	)

	print(tokenizer.decode(outputs[0], skip_special_tokens=False))
	```

	### Prompt Format

	DotLM uses the ChatML format with an explicit reasoning prefix:

	```
	<\|im_start\|>user
	{your question}<\|im_end\|>
	<\|im_start\|>assistant
	<think>
	{model reasons here}
	</think>
	{final answer}
	```

	## Performance & Limitations

	- Scale: At 165M parameters, DotLM is a research-scale model. It is not competitive with large-scale LLMs on general benchmarks.
	- Domain: The model is specialized on thought experiments — intuitive physics, causal reasoning, spatial reasoning, theory of mind, and
	related domains. It may underperform on unrelated topics.
	- Reasoning quality: The chain-of-thought traces are coherent on in-distribution thought experiments but may hallucinate or ramble on
	out-of-distribution inputs.
	- Context: Maximum context length is 4,096 tokens.
	- Safety: No RLHF safety training was applied. Not suitable for deployment in user-facing products without additional safety measures.

	## Training Details

	Checkout the blog for training details: [DotLM - An end-to-end trained 165M model](https://www.tensorwrites.com/) (coming soon)

	Related Resources

	- Dataset: [SimpleThoughts](https://huggingface.co/datasets/tensorfiend/SimpleThoughts)
	- Training code: [DotLM](https://github.com/shanmukh05/DotLM) (coming soon)

	## Citation

	@misc{dotlm2026,
	author = {Shanmukh},
	title = {DotLM-165M: A Minimal Reasoning Language Model Trained on Thought Experiments},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/tensorfiend/DotLM-165M}
	}

	## License

	https://www.apache.org/licenses/LICENSE-2.0