Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
base_model: none
|
| 4 |
+
tags:
|
| 5 |
+
- geometry-os
|
| 6 |
+
- pixelgpt
|
| 7 |
+
- assembly
|
| 8 |
+
- code-generation
|
| 9 |
+
- custom-code
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# PixelGPT -- Geometry OS Assembly Generator
|
| 13 |
+
|
| 14 |
+
A bilingual (English + GeOS assembly) GPT-2 model that generates Geometry OS bytecode assembly from natural language descriptions.
|
| 15 |
+
|
| 16 |
+
## Model Architecture
|
| 17 |
+
|
| 18 |
+
| Version | Params | Layers | Heads | Embd | Context | Loss | Notes |
|
| 19 |
+
|---------|--------|--------|-------|------|---------|------|-------|
|
| 20 |
+
| V5 | 13.7M | 6 | 8 | 384 | 1024 | 0.374 | Fixed ByteLevel tokenizer |
|
| 21 |
+
| V6 | 29.3M | 8 | 8 | 512 | 1024 | 0.907 | Golden dataset, killed early (ep 2/40) |
|
| 22 |
+
| V8 | 29.3M | 8 | 8 | 512 | 1024 | 0.076 | Overfit to training surface form |
|
| 23 |
+
| V9 | ~4M | 4 | 8 | 256 | 1024 | 0.019 | Compact retrain |
|
| 24 |
+
|
| 25 |
+
## Tokenizer
|
| 26 |
+
|
| 27 |
+
Bilingual V4 tokenizer with three-tier ID space:
|
| 28 |
+
- **0-267**: Atomic opcodes, registers, structural tokens (BOS/EOS/NL/COMMA/COLON)
|
| 29 |
+
- **270-351**: Character-level literals (digits, letters, symbols)
|
| 30 |
+
- **356+**: BPE text tokens (prose, comments, descriptions)
|
| 31 |
+
|
| 32 |
+
Round-trip fidelity: 89.6% exact, 94.5% preserved, 99.6% code-portion.
|
| 33 |
+
|
| 34 |
+
## Usage
|
| 35 |
+
|
| 36 |
+
```python
|
| 37 |
+
import torch
|
| 38 |
+
import sys, os
|
| 39 |
+
sys.path.insert(0, ".")
|
| 40 |
+
from bilingual_tokenizer import BilingualTokenizer
|
| 41 |
+
from train_opcode_llm import OpcodeGPT, generate_asm
|
| 42 |
+
|
| 43 |
+
# Load tokenizer
|
| 44 |
+
tokenizer = BilingualTokenizer.load("bilingual_tokenizer_v4/")
|
| 45 |
+
|
| 46 |
+
# Load model (V8 example)
|
| 47 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 48 |
+
checkpoint = torch.load("bilingual_llm_v8_ckpt.pt", map_location=device)
|
| 49 |
+
m_args = checkpoint["args"]
|
| 50 |
+
|
| 51 |
+
model = OpcodeGPT(
|
| 52 |
+
vocab_size=m_args["vocab_size"],
|
| 53 |
+
n_embd=m_args["embd"],
|
| 54 |
+
n_head=m_args["heads"],
|
| 55 |
+
n_layer=m_args["layers"],
|
| 56 |
+
block_size=m_args["context_len"]
|
| 57 |
+
).to(device)
|
| 58 |
+
model.load_state_dict(checkpoint["model"])
|
| 59 |
+
model.eval()
|
| 60 |
+
|
| 61 |
+
# Generate from natural language prompt
|
| 62 |
+
asm = generate_asm(model, tokenizer, "; Draw a red circle at the center", device, max_tokens=256, temperature=0.7)
|
| 63 |
+
print(asm)
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
## Training Data
|
| 67 |
+
|
| 68 |
+
Trained on 5,211 annotated GeOS assembly programs (211 real + ~5,000 synthetic). Each program includes a natural language description comment. Dataset: `bilingual_dataset.npz` (7,824 samples, 2.91M tokens, context=1024, stride=512).
|
| 69 |
+
|
| 70 |
+
## Geometry OS
|
| 71 |
+
|
| 72 |
+
PixelGPT targets the Geometry OS bytecode VM -- a pixel-native operating system with 150+ opcodes, 32 registers, and a 256x256 RGB framebuffer. Programs are written in a custom assembly language and assembled to bytecode.
|
| 73 |
+
|
| 74 |
+
Repo: https://github.com/tdw419/geometry_os
|
| 75 |
+
|
| 76 |
+
## Status
|
| 77 |
+
|
| 78 |
+
**Research preview.** The model generates syntactically plausible assembly but does not yet produce consistently working programs. Active development is improving training data quality and constrained decoding.
|
| 79 |
+
|
| 80 |
+
## Developed by
|
| 81 |
+
|
| 82 |
+
Built with [Hermes Agent](https://github.com/nousresearch/hermes-agent) autonomous development pipeline on an NVIDIA RTX 5090.
|