| # PyCoder-300M: The "Cargo Cult" Coder 🚀 |
|
|
| **A 300M parameter Python coding model built entirely from scratch on Kaggle TPUs** |
|
|
| | Checkpoint | HumanEval Score | What It Knows | |
| |------------|----------------|---------------| |
| | Stage 1 (Base Model) | 0.00% | Perfect Python syntax, zero logic | |
| | Stage 2 (Instruct Model)| 0.00% | Perfect instruction formatting, beautiful reasoning blocks... still zero logic | |
|
|
| --- |
|
|
| ## 📖 The Story |
|
|
| I spent three weeks building this model from absolute scratch—not fine-tuning a pretrained model, but building the entire pipeline: custom tokenizer, custom architecture, distributed TPU training, everything. |
|
|
| **Stage 1:** I trained it on 8 billion tokens of Python code scraped from GitHub. It learned perfect formatting—pristine indentation, beautiful docstrings, proper type hints. But the logic? Complete hallucination. It would see `def two_sum(nums, target):` and confidently return `len(nums)`. Every. Single. Time. |
|
|
| It learned Python the way a parrot learns language: perfect mimicry, zero comprehension. |
|
|
| **Stage 2:** I thought, *"Maybe it just needs to learn how to think!"* So, I instruction-tuned it on 59k high-quality reasoning problems generated by Qwen2.5-Coder-32B. I taught it a strict format: read the instruction, write out the reasoning, then write the code. |
|
|
| **The Result:** It learned the exact format! It now outputs a beautiful `# REASONING:` block where it confidently hallucinates absolute nonsense, followed by flawlessly indented code that completely fails the unit tests. |
|
|
| **HumanEval score: Still 0.00%.** This is the brutal reality of the Data Scaling Wall. To get a 300M parameter model to actually *reason* zero-shot, you need Trillions of high-quality tokens. |
|
|
| But as an engineering project? It was a massive success. With the help of AI (Gemini and Claude), I tried to build this from the ground up using what I know about Transformers, and the distributed XLA pipeline works perfectly. |
|
|
| --- |
|
|
| ## 🏗️ Technical Architecture |
|
|
| **Model Specifications:** |
| - **Parameters:** 300M (24 layers × 1024 hidden dim) |
| - **Attention:** MLA (Multi-Head Latent Attention) with QK-Norm |
| - **Positional Encoding:** RoPE (Rotary Position Embeddings) |
| - **FFN:** SwiGLU activation |
| - **Context Length:** 4096 tokens |
| - **Tokenizer:** Custom 32k BPE trained purely on Python |
|
|
| **Training Infrastructure:** |
| - **Hardware:** 8× TPU v5e cores (Kaggle free tier) |
| - **Optimizers:** Muon (for weight matrices) + AdamW (for biases/norms) |
| - **Precision:** bfloat16 with FP32-safe RMSNorm |
| - **Stage 1:** 127k steps on 8B tokens → PPL 3.0 |
| - **Stage 2:** 690 steps (3 epochs) on 59k instructions → PPL 5.2 |
|
|
| **Key Design Choices:** |
| - Used `# INSTRUCTION:` and `# REASONING:` format (optimal for Python BPE tokenizer) |
| - Masked loss: only trained on model's reasoning + code, not user instruction |
| - 10x lower learning rate for Stage 2 vs Stage 1 (prevents catastrophic forgetting) |
|
|
| --- |
|
|
| ## 📂 What's In This Repo |
|
|
| | File | Size | Description | |
| |------|------|-------------| |
| | `checkpoint before instruction traning.pt` | 3.05 GB | **Stage 1 Base Model** - 127k steps, 0% HumanEval | |
| | `after instruction trannig.pt` | 1.4 GB | **Stage 2 Instruct Model** - After 59k instruction tuning | |
| | `tokenizer_100_python.json` | 2.21 MB | Custom BPE tokenizer (required!) | |
| | `vocab.json` / `merges.txt` | 748 KB | Tokenizer vocab files | |
| | `stage1.75 and hman eval after that all code.ipynb` | 58.8 KB | Stage 1 training loop + HumanEval code | |
| | `training of 59k instruction set traning stage 2.ipynb` | 72.8 KB | Stage 2 instruction tuning code | |
|
|
| --- |
|
|
| ## 💻 How to Load and Use |
|
|
| ### ⚠️ Important Note |
| This is a **custom architecture** built from scratch. You **cannot** use `transformers.AutoModel.from_pretrained()`. You must load the model class from the Jupyter notebook. |
|
|
| ### Step 1: Load the Tokenizer |
| ```python |
| from tokenizers import Tokenizer |
| |
| # Load the custom tokenizer |
| tokenizer = Tokenizer.from_file("tokenizer_100_python.json") |
| |
| # Test it |
| text = "def fibonacci(n):" |
| encoded = tokenizer.encode(text) |
| print(f"Tokens: {encoded.ids}") |
| print(f"Decoded: {tokenizer.decode(encoded.ids)}") |