| --- |
| license: mit |
| base_model: none |
| tags: |
| - geometry-os |
| - pixelgpt |
| - assembly |
| - code-generation |
| - custom-code |
| --- |
| |
| # PixelGPT -- Geometry OS Assembly Generator |
|
|
| A bilingual (English + GeOS assembly) GPT-2 model that generates Geometry OS bytecode assembly from natural language descriptions. |
|
|
| ## Model Architecture |
|
|
| | Version | Params | Layers | Heads | Embd | Context | Loss | Notes | |
| |---------|--------|--------|-------|------|---------|------|-------| |
| | V5 | 13.7M | 6 | 8 | 384 | 1024 | 0.374 | Fixed ByteLevel tokenizer | |
| | V6 | 29.3M | 8 | 8 | 512 | 1024 | 0.907 | Golden dataset, killed early (ep 2/40) | |
| | V8 | 29.3M | 8 | 8 | 512 | 1024 | 0.076 | Overfit to training surface form | |
| | V9 | ~4M | 4 | 8 | 256 | 1024 | 0.019 | Compact retrain | |
|
|
| ## Tokenizer |
|
|
| Bilingual V4 tokenizer with three-tier ID space: |
| - **0-267**: Atomic opcodes, registers, structural tokens (BOS/EOS/NL/COMMA/COLON) |
| - **270-351**: Character-level literals (digits, letters, symbols) |
| - **356+**: BPE text tokens (prose, comments, descriptions) |
|
|
| Round-trip fidelity: 89.6% exact, 94.5% preserved, 99.6% code-portion. |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| import sys, os |
| sys.path.insert(0, ".") |
| from bilingual_tokenizer import BilingualTokenizer |
| from train_opcode_llm import OpcodeGPT, generate_asm |
| |
| # Load tokenizer |
| tokenizer = BilingualTokenizer.load("bilingual_tokenizer_v4/") |
| |
| # Load model (V8 example) |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| checkpoint = torch.load("bilingual_llm_v8_ckpt.pt", map_location=device) |
| m_args = checkpoint["args"] |
| |
| model = OpcodeGPT( |
| vocab_size=m_args["vocab_size"], |
| n_embd=m_args["embd"], |
| n_head=m_args["heads"], |
| n_layer=m_args["layers"], |
| block_size=m_args["context_len"] |
| ).to(device) |
| model.load_state_dict(checkpoint["model"]) |
| model.eval() |
| |
| # Generate from natural language prompt |
| asm = generate_asm(model, tokenizer, "; Draw a red circle at the center", device, max_tokens=256, temperature=0.7) |
| print(asm) |
| ``` |
|
|
| ## Training Data |
|
|
| Trained on 5,211 annotated GeOS assembly programs (211 real + ~5,000 synthetic). Each program includes a natural language description comment. Dataset: `bilingual_dataset.npz` (7,824 samples, 2.91M tokens, context=1024, stride=512). |
|
|
| ## Geometry OS |
|
|
| PixelGPT targets the Geometry OS bytecode VM -- a pixel-native operating system with 150+ opcodes, 32 registers, and a 256x256 RGB framebuffer. Programs are written in a custom assembly language and assembled to bytecode. |
|
|
| Repo: https://github.com/tdw419/geometry_os |
| |
| ## Status |
| |
| **Research preview.** The model generates syntactically plausible assembly but does not yet produce consistently working programs. Active development is improving training data quality and constrained decoding. |
| |
| ## Developed by |
| |
| Built with [Hermes Agent](https://github.com/nousresearch/hermes-agent) autonomous development pipeline on an NVIDIA RTX 5090. |
| |