Update README.md

b73274c verified 1 day ago

4.18 kB

	---
	license: mit
	datasets:
	- wikimedia/wikipedia
	- AxionLab-official/ThinkSet-PTBR
	language:
	- pt
	pipeline_tag: text-generation
	library_name: transformers
	---
	# 🧠 NanoThink-5M

	> A 5M parameter language model trained from scratch on portuguese and thinking dataset to simulate structured reasoning.

	---

	## 🚀 Overview

	NanoThink-5M is an ultra-lightweight (~5M parameters) transformer model designed to explore the limits of reasoning behavior in small-scale neural networks.

	Built entirely from scratch, it runs efficiently on CPU and focuses on generating structured reasoning outputs in Portuguese.

	---

	## 💡 Key Idea

	> How far can a tiny model go in simulating reasoning?

	NanoThink-5M does not truly reason — instead, it learns to imitate reasoning patterns through structured training.

	---

	## 🧠 Capabilities

	* Generates step-by-step reasoning (`<THINK>`)
	* Produces structured answers (`<ANSWER>`)
	* Handles simple arithmetic and logic patterns
	* Fully CPU-compatible

	---

	## ⚙️ Model Details

	* Architecture: Causal Transformer (GPT-style)
	* Parameters: ~5M
	* Layers: 4
	* Heads: 4
	* Embedding size: 128
	* Context length: 256 tokens

	---

	## 🏗️ Training Pipeline

	### 1. Tokenizer

	Custom tokenizer trained from scratch.

	### 2. Pretraining

	* Portuguese text corpus
	* Language modeling objective

	### 3. Fine-tuning

	* Synthetic reasoning dataset
	* Tasks include:

	* Arithmetic
	* Logical comparisons
	* Multi-step problems

	Structured format:

	```text
	<USER> ... <\USER>
	<THINK> ... <\THINK>
	<ANSWER> ... <\ANSWER>
	<END>
	```

	---

	## 📊 Example

	Input:

	```text
	João tem 3 maçãs e ganhou 2, quantas ele tem agora?
	```

	Output:

	```text
	<THINK>
	3 + 2 = 5
	</THINK>
	<ANSWER>
	João tem 5 maçãs.
	</ANSWER>
	```

	---

	## ⚠️ Limitations

	* Not reliable for precise mathematical reasoning
	* May generate inconsistent intermediate steps
	* Reasoning is simulated, not grounded

	> This model demonstrates the appearance of reasoning, not true reasoning.

	---

	## 🧪 Research Insight

	NanoThink-5M highlights an important phenomenon:

	> Small models can learn to look intelligent before being intelligent.

	This reinforces the distinction between:

	* Simulated reasoning
	* Actual reasoning

	---

	## 💻 Usage

	```python
	import torch
	from tokenizers import Tokenizer
	from model import NanoThink
	from safetensors.torch import load_file

	MODEL_PATH = "model.safetensors"
	TOKENIZER_PATH = "tokenizer.json"


	tokenizer = Tokenizer.from_file(TOKENIZER_PATH)

	model = NanoThink(vocab_size=tokenizer.get_vocab_size())
	model.load_state_dict(load_file(MODEL_PATH))
	model.eval()

	history = ""

	while True:
	user_input = input("You: ")

	if user_input.lower() in ["get out", "exit", "quit"]:
	break

	prompt = history + f"\n<USER>\n{user_input}\n</USER>\n"

	input_ids = torch.tensor([tokenizer.encode(prompt).ids])

	output_ids = []

	for _ in range(120):
	logits = model(input_ids)
	next_token = torch.multinomial(torch.softmax(logits[0, -1], dim=-1), 1).item()

	input_ids = torch.cat([input_ids, torch.tensor([[next_token]])], dim=1)
	output_ids.append(next_token)

	text = tokenizer.decode(output_ids)

	if "</ANSWER>" in text:
	break

	output = tokenizer.decode(output_ids)


	if "<ANSWER>" in output:
	output = output.split("<ANSWER>")[1].split("</ANSWER>")[0]

	print("\n💬 Answer:")
	print(output.strip())
	print("\n" + "-"*50 + "\n")

	history += f"\n<USER>\n{user_input}\n</USER>\n<ANSWER>\n{output.strip()}\n</ANSWER>\n"
	```

	---

	## 🔮 Future Work

	* Scaling to 10M–50M parameters
	* Improving dataset quality and training techniques
	* Enhancing reasoning consistency
	* Multilingual support


	---

	## 🤝 Contributions

	This is an experimental project, contributions and ideas are welcome.

	---

	## 📜 License

	MIT

	---

	## 🧠 Author

	AxionLab Co.

	Independent research project exploring the limits of small language models.

	---

	## ⭐ Final Thought

	> Intelligence can be mimicked at small scale — but not yet achieved.

	NanoThink-5M is a step toward understanding that boundary.