YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

PyCoder-300M: The "Cargo Cult" Coder πŸš€

A 300M parameter Python coding model built entirely from scratch on Kaggle TPUs

Checkpoint HumanEval Score What It Knows
Stage 1 (Base Model) 0.00% Perfect Python syntax, zero logic
Stage 2 (Instruct Model) 0.00% Perfect instruction formatting, beautiful reasoning blocks... still zero logic

πŸ“– The Story

I spent three weeks building this model from absolute scratchβ€”not fine-tuning a pretrained model, but building the entire pipeline: custom tokenizer, custom architecture, distributed TPU training, everything.

Stage 1: I trained it on 8 billion tokens of Python code scraped from GitHub. It learned perfect formattingβ€”pristine indentation, beautiful docstrings, proper type hints. But the logic? Complete hallucination. It would see def two_sum(nums, target): and confidently return len(nums). Every. Single. Time.

It learned Python the way a parrot learns language: perfect mimicry, zero comprehension.

Stage 2: I thought, "Maybe it just needs to learn how to think!" So, I instruction-tuned it on 59k high-quality reasoning problems generated by Qwen2.5-Coder-32B. I taught it a strict format: read the instruction, write out the reasoning, then write the code.

The Result: It learned the exact format! It now outputs a beautiful # REASONING: block where it confidently hallucinates absolute nonsense, followed by flawlessly indented code that completely fails the unit tests.

HumanEval score: Still 0.00%. This is the brutal reality of the Data Scaling Wall. To get a 300M parameter model to actually reason zero-shot, you need Trillions of high-quality tokens.

But as an engineering project? It was a massive success. With the help of AI (Gemini and Claude), I tried to build this from the ground up using what I know about Transformers, and the distributed XLA pipeline works perfectly.


πŸ—οΈ Technical Architecture

Model Specifications:

  • Parameters: 300M (24 layers Γ— 1024 hidden dim)
  • Attention: MLA (Multi-Head Latent Attention) with QK-Norm
  • Positional Encoding: RoPE (Rotary Position Embeddings)
  • FFN: SwiGLU activation
  • Context Length: 4096 tokens
  • Tokenizer: Custom 32k BPE trained purely on Python

Training Infrastructure:

  • Hardware: 8Γ— TPU v5e cores (Kaggle free tier)
  • Optimizers: Muon (for weight matrices) + AdamW (for biases/norms)
  • Precision: bfloat16 with FP32-safe RMSNorm
  • Stage 1: 127k steps on 8B tokens β†’ PPL 3.0
  • Stage 2: 690 steps (3 epochs) on 59k instructions β†’ PPL 5.2

Key Design Choices:

  • Used # INSTRUCTION: and # REASONING: format (optimal for Python BPE tokenizer)
  • Masked loss: only trained on model's reasoning + code, not user instruction
  • 10x lower learning rate for Stage 2 vs Stage 1 (prevents catastrophic forgetting)

πŸ“‚ What's In This Repo

File Size Description
checkpoint before instruction traning.pt 3.05 GB Stage 1 Base Model - 127k steps, 0% HumanEval
after instruction trannig.pt 1.4 GB Stage 2 Instruct Model - After 59k instruction tuning
tokenizer_100_python.json 2.21 MB Custom BPE tokenizer (required!)
vocab.json / merges.txt 748 KB Tokenizer vocab files
stage1.75 and hman eval after that all code.ipynb 58.8 KB Stage 1 training loop + HumanEval code
training of 59k instruction set traning stage 2.ipynb 72.8 KB Stage 2 instruction tuning code

πŸ’» How to Load and Use

⚠️ Important Note

This is a custom architecture built from scratch. You cannot use transformers.AutoModel.from_pretrained(). You must load the model class from the Jupyter notebook.

Step 1: Load the Tokenizer

from tokenizers import Tokenizer

# Load the custom tokenizer
tokenizer = Tokenizer.from_file("tokenizer_100_python.json")

# Test it
text = "def fibonacci(n):"
encoded = tokenizer.encode(text)
print(f"Tokens: {encoded.ids}")
print(f"Decoded: {tokenizer.decode(encoded.ids)}")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support