karti06k
/

pycoder-300m-slm

Model card Files Files and versions

xet

pycoder-300m-slm / README.md

karti06k

Update README.md

802d2a8 verified 2 months ago

preview code

raw

history blame

4.15 kB

	# PyCoder-300M: The "Cargo Cult" Coder 🚀

	A 300M parameter Python coding model built entirely from scratch on Kaggle TPUs

	\| Checkpoint \| HumanEval Score \| What It Knows \|
	\|------------\|----------------\|---------------\|
	\| Stage 1 (Base Model) \| 0.00% \| Perfect Python syntax, zero logic \|
	\| Stage 2 (Instruct Model)\| 0.00% \| Perfect instruction formatting, beautiful reasoning blocks... still zero logic \|

	---

	## 📖 The Story

	I spent three weeks building this model from absolute scratch—not fine-tuning a pretrained model, but building the entire pipeline: custom tokenizer, custom architecture, distributed TPU training, everything.

	Stage 1: I trained it on 8 billion tokens of Python code scraped from GitHub. It learned perfect formatting—pristine indentation, beautiful docstrings, proper type hints. But the logic? Complete hallucination. It would see `def two_sum(nums, target):` and confidently return `len(nums)`. Every. Single. Time.

	It learned Python the way a parrot learns language: perfect mimicry, zero comprehension.

	Stage 2: I thought, "Maybe it just needs to learn how to think!" So, I instruction-tuned it on 59k high-quality reasoning problems generated by Qwen2.5-Coder-32B. I taught it a strict format: read the instruction, write out the reasoning, then write the code.

	The Result: It learned the exact format! It now outputs a beautiful `# REASONING:` block where it confidently hallucinates absolute nonsense, followed by flawlessly indented code that completely fails the unit tests.

	HumanEval score: Still 0.00%. This is the brutal reality of the Data Scaling Wall. To get a 300M parameter model to actually reason zero-shot, you need Trillions of high-quality tokens.

	But as an engineering project? It was a massive success. With the help of AI (Gemini and Claude), I tried to build this from the ground up using what I know about Transformers, and the distributed XLA pipeline works perfectly.

	---

	## 🏗️ Technical Architecture

	Model Specifications:
	- Parameters: 300M (24 layers × 1024 hidden dim)
	- Attention: MLA (Multi-Head Latent Attention) with QK-Norm
	- Positional Encoding: RoPE (Rotary Position Embeddings)
	- FFN: SwiGLU activation
	- Context Length: 4096 tokens
	- Tokenizer: Custom 32k BPE trained purely on Python

	Training Infrastructure:
	- Hardware: 8× TPU v5e cores (Kaggle free tier)
	- Optimizers: Muon (for weight matrices) + AdamW (for biases/norms)
	- Precision: bfloat16 with FP32-safe RMSNorm
	- Stage 1: 127k steps on 8B tokens → PPL 3.0
	- Stage 2: 690 steps (3 epochs) on 59k instructions → PPL 5.2

	Key Design Choices:
	- Used `# INSTRUCTION:` and `# REASONING:` format (optimal for Python BPE tokenizer)
	- Masked loss: only trained on model's reasoning + code, not user instruction
	- 10x lower learning rate for Stage 2 vs Stage 1 (prevents catastrophic forgetting)

	---

	## 📂 What's In This Repo

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `checkpoint before instruction traning.pt` \| 3.05 GB \| Stage 1 Base Model - 127k steps, 0% HumanEval \|
	\| `after instruction trannig.pt` \| 1.4 GB \| Stage 2 Instruct Model - After 59k instruction tuning \|
	\| `tokenizer_100_python.json` \| 2.21 MB \| Custom BPE tokenizer (required!) \|
	\| `vocab.json` / `merges.txt` \| 748 KB \| Tokenizer vocab files \|
	\| `stage1.75 and hman eval after that all code.ipynb` \| 58.8 KB \| Stage 1 training loop + HumanEval code \|
	\| `training of 59k instruction set traning stage 2.ipynb` \| 72.8 KB \| Stage 2 instruction tuning code \|

	---

	## 💻 How to Load and Use

	### ⚠️ Important Note
	This is a custom architecture built from scratch. You cannot use `transformers.AutoModel.from_pretrained()`. You must load the model class from the Jupyter notebook.

	### Step 1: Load the Tokenizer
	```python
	from tokenizers import Tokenizer

	# Load the custom tokenizer
	tokenizer = Tokenizer.from_file("tokenizer_100_python.json")

	# Test it
	text = "def fibonacci(n):"
	encoded = tokenizer.encode(text)
	print(f"Tokens: {encoded.ids}")
	print(f"Decoded: {tokenizer.decode(encoded.ids)}")

	# PyCoder-300M: The "Cargo Cult" Coder 🚀

	A 300M parameter Python coding model built entirely from scratch on Kaggle TPUs

	\| Checkpoint \| HumanEval Score \| What It Knows \|
	\|------------\|----------------\|---------------\|
	\| Stage 1 (Base Model) \| 0.00% \| Perfect Python syntax, zero logic \|
	\| Stage 2 (Instruct Model)\| 0.00% \| Perfect instruction formatting, beautiful reasoning blocks... still zero logic \|

	---

	## 📖 The Story

	I spent three weeks building this model from absolute scratch—not fine-tuning a pretrained model, but building the entire pipeline: custom tokenizer, custom architecture, distributed TPU training, everything.

	Stage 1: I trained it on 8 billion tokens of Python code scraped from GitHub. It learned perfect formatting—pristine indentation, beautiful docstrings, proper type hints. But the logic? Complete hallucination. It would see `def two_sum(nums, target):` and confidently return `len(nums)`. Every. Single. Time.

	It learned Python the way a parrot learns language: perfect mimicry, zero comprehension.

	Stage 2: I thought, "Maybe it just needs to learn how to think!" So, I instruction-tuned it on 59k high-quality reasoning problems generated by Qwen2.5-Coder-32B. I taught it a strict format: read the instruction, write out the reasoning, then write the code.

	The Result: It learned the exact format! It now outputs a beautiful `# REASONING:` block where it confidently hallucinates absolute nonsense, followed by flawlessly indented code that completely fails the unit tests.

	HumanEval score: Still 0.00%. This is the brutal reality of the Data Scaling Wall. To get a 300M parameter model to actually reason zero-shot, you need Trillions of high-quality tokens.

	But as an engineering project? It was a massive success. With the help of AI (Gemini and Claude), I tried to build this from the ground up using what I know about Transformers, and the distributed XLA pipeline works perfectly.

	---

	## 🏗️ Technical Architecture

	Model Specifications:
	- Parameters: 300M (24 layers × 1024 hidden dim)
	- Attention: MLA (Multi-Head Latent Attention) with QK-Norm
	- Positional Encoding: RoPE (Rotary Position Embeddings)
	- FFN: SwiGLU activation
	- Context Length: 4096 tokens
	- Tokenizer: Custom 32k BPE trained purely on Python

	Training Infrastructure:
	- Hardware: 8× TPU v5e cores (Kaggle free tier)
	- Optimizers: Muon (for weight matrices) + AdamW (for biases/norms)
	- Precision: bfloat16 with FP32-safe RMSNorm
	- Stage 1: 127k steps on 8B tokens → PPL 3.0
	- Stage 2: 690 steps (3 epochs) on 59k instructions → PPL 5.2

	Key Design Choices:
	- Used `# INSTRUCTION:` and `# REASONING:` format (optimal for Python BPE tokenizer)
	- Masked loss: only trained on model's reasoning + code, not user instruction
	- 10x lower learning rate for Stage 2 vs Stage 1 (prevents catastrophic forgetting)

	---

	## 📂 What's In This Repo

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `checkpoint before instruction traning.pt` \| 3.05 GB \| Stage 1 Base Model - 127k steps, 0% HumanEval \|
	\| `after instruction trannig.pt` \| 1.4 GB \| Stage 2 Instruct Model - After 59k instruction tuning \|
	\| `tokenizer_100_python.json` \| 2.21 MB \| Custom BPE tokenizer (required!) \|
	\| `vocab.json` / `merges.txt` \| 748 KB \| Tokenizer vocab files \|
	\| `stage1.75 and hman eval after that all code.ipynb` \| 58.8 KB \| Stage 1 training loop + HumanEval code \|
	\| `training of 59k instruction set traning stage 2.ipynb` \| 72.8 KB \| Stage 2 instruction tuning code \|

	---

	## 💻 How to Load and Use

	### ⚠️ Important Note
	This is a custom architecture built from scratch. You cannot use `transformers.AutoModel.from_pretrained()`. You must load the model class from the Jupyter notebook.

	### Step 1: Load the Tokenizer
	```python
	from tokenizers import Tokenizer

	# Load the custom tokenizer
	tokenizer = Tokenizer.from_file("tokenizer_100_python.json")

	# Test it
	text = "def fibonacci(n):"
	encoded = tokenizer.encode(text)
	print(f"Tokens: {encoded.ids}")
	print(f"Decoded: {tokenizer.decode(encoded.ids)}")