tdw419
/

pixelgpt

code-generation

Model card Files Files and versions

pixelgpt / README.md

tdw419's picture

Upload README.md with huggingface_hub

8690148 verified 17 days ago

|

history blame contribute delete

2.89 kB

	---
	license: mit
	base_model: none
	tags:
	- geometry-os
	- pixelgpt
	- assembly
	- code-generation
	- custom-code
	---

	# PixelGPT -- Geometry OS Assembly Generator

	A bilingual (English + GeOS assembly) GPT-2 model that generates Geometry OS bytecode assembly from natural language descriptions.

	## Model Architecture

	\| Version \| Params \| Layers \| Heads \| Embd \| Context \| Loss \| Notes \|
	\|---------\|--------\|--------\|-------\|------\|---------\|------\|-------\|
	\| V5 \| 13.7M \| 6 \| 8 \| 384 \| 1024 \| 0.374 \| Fixed ByteLevel tokenizer \|
	\| V6 \| 29.3M \| 8 \| 8 \| 512 \| 1024 \| 0.907 \| Golden dataset, killed early (ep 2/40) \|
	\| V8 \| 29.3M \| 8 \| 8 \| 512 \| 1024 \| 0.076 \| Overfit to training surface form \|
	\| V9 \| ~4M \| 4 \| 8 \| 256 \| 1024 \| 0.019 \| Compact retrain \|

	## Tokenizer

	Bilingual V4 tokenizer with three-tier ID space:
	- 0-267: Atomic opcodes, registers, structural tokens (BOS/EOS/NL/COMMA/COLON)
	- 270-351: Character-level literals (digits, letters, symbols)
	- 356+: BPE text tokens (prose, comments, descriptions)

	Round-trip fidelity: 89.6% exact, 94.5% preserved, 99.6% code-portion.

	## Usage

	```python
	import torch
	import sys, os
	sys.path.insert(0, ".")
	from bilingual_tokenizer import BilingualTokenizer
	from train_opcode_llm import OpcodeGPT, generate_asm

	# Load tokenizer
	tokenizer = BilingualTokenizer.load("bilingual_tokenizer_v4/")

	# Load model (V8 example)
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	checkpoint = torch.load("bilingual_llm_v8_ckpt.pt", map_location=device)
	m_args = checkpoint["args"]

	model = OpcodeGPT(
	vocab_size=m_args["vocab_size"],
	n_embd=m_args["embd"],
	n_head=m_args["heads"],
	n_layer=m_args["layers"],
	block_size=m_args["context_len"]
	).to(device)
	model.load_state_dict(checkpoint["model"])
	model.eval()

	# Generate from natural language prompt
	asm = generate_asm(model, tokenizer, "; Draw a red circle at the center", device, max_tokens=256, temperature=0.7)
	print(asm)
	```

	## Training Data

	Trained on 5,211 annotated GeOS assembly programs (211 real + ~5,000 synthetic). Each program includes a natural language description comment. Dataset: `bilingual_dataset.npz` (7,824 samples, 2.91M tokens, context=1024, stride=512).

	## Geometry OS

	PixelGPT targets the Geometry OS bytecode VM -- a pixel-native operating system with 150+ opcodes, 32 registers, and a 256x256 RGB framebuffer. Programs are written in a custom assembly language and assembled to bytecode.

	Repo: https://github.com/tdw419/geometry_os

	## Status

	Research preview. The model generates syntactically plausible assembly but does not yet produce consistently working programs. Active development is improving training data quality and constrained decoding.

	## Developed by

	Built with [Hermes Agent](https://github.com/nousresearch/hermes-agent) autonomous development pipeline on an NVIDIA RTX 5090.