tdw419
/

pixelgpt

+---
+license: mit
+base_model: none
+tags:
+  - geometry-os
+  - pixelgpt
+  - assembly
+  - code-generation
+  - custom-code
+---
+# PixelGPT -- Geometry OS Assembly Generator
+A bilingual (English + GeOS assembly) GPT-2 model that generates Geometry OS bytecode assembly from natural language descriptions.
+## Model Architecture
+| Version | Params | Layers | Heads | Embd | Context | Loss | Notes |
+|---------|--------|--------|-------|------|---------|------|-------|
+| V5 | 13.7M | 6 | 8 | 384 | 1024 | 0.374 | Fixed ByteLevel tokenizer |
+| V6 | 29.3M | 8 | 8 | 512 | 1024 | 0.907 | Golden dataset, killed early (ep 2/40) |
+| V8 | 29.3M | 8 | 8 | 512 | 1024 | 0.076 | Overfit to training surface form |
+| V9 | ~4M | 4 | 8 | 256 | 1024 | 0.019 | Compact retrain |
+## Tokenizer
+Bilingual V4 tokenizer with three-tier ID space:
+- **0-267**: Atomic opcodes, registers, structural tokens (BOS/EOS/NL/COMMA/COLON)
+- **270-351**: Character-level literals (digits, letters, symbols)
+- **356+**: BPE text tokens (prose, comments, descriptions)
+Round-trip fidelity: 89.6% exact, 94.5% preserved, 99.6% code-portion.
+## Usage
+```python
+import torch
+import sys, os
+sys.path.insert(0, ".")
+from bilingual_tokenizer import BilingualTokenizer
+from train_opcode_llm import OpcodeGPT, generate_asm
+# Load tokenizer
+tokenizer = BilingualTokenizer.load("bilingual_tokenizer_v4/")
+# Load model (V8 example)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+checkpoint = torch.load("bilingual_llm_v8_ckpt.pt", map_location=device)
+m_args = checkpoint["args"]
+model = OpcodeGPT(
+    vocab_size=m_args["vocab_size"],
+    n_embd=m_args["embd"],
+    n_head=m_args["heads"],
+    n_layer=m_args["layers"],
+    block_size=m_args["context_len"]
+).to(device)
+model.load_state_dict(checkpoint["model"])
+model.eval()
+# Generate from natural language prompt
+asm = generate_asm(model, tokenizer, "; Draw a red circle at the center", device, max_tokens=256, temperature=0.7)
+print(asm)
+```
+## Training Data
+Trained on 5,211 annotated GeOS assembly programs (211 real + ~5,000 synthetic). Each program includes a natural language description comment. Dataset: `bilingual_dataset.npz` (7,824 samples, 2.91M tokens, context=1024, stride=512).
+## Geometry OS
+PixelGPT targets the Geometry OS bytecode VM -- a pixel-native operating system with 150+ opcodes, 32 registers, and a 256x256 RGB framebuffer. Programs are written in a custom assembly language and assembled to bytecode.
+Repo: https://github.com/tdw419/geometry_os
+## Status
+**Research preview.** The model generates syntactically plausible assembly but does not yet produce consistently working programs. Active development is improving training data quality and constrained decoding.
+## Developed by
+Built with [Hermes Agent](https://github.com/nousresearch/hermes-agent) autonomous development pipeline on an NVIDIA RTX 5090.