tdw419 commited on
Commit
8690148
·
verified ·
1 Parent(s): 0d61344

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: none
4
+ tags:
5
+ - geometry-os
6
+ - pixelgpt
7
+ - assembly
8
+ - code-generation
9
+ - custom-code
10
+ ---
11
+
12
+ # PixelGPT -- Geometry OS Assembly Generator
13
+
14
+ A bilingual (English + GeOS assembly) GPT-2 model that generates Geometry OS bytecode assembly from natural language descriptions.
15
+
16
+ ## Model Architecture
17
+
18
+ | Version | Params | Layers | Heads | Embd | Context | Loss | Notes |
19
+ |---------|--------|--------|-------|------|---------|------|-------|
20
+ | V5 | 13.7M | 6 | 8 | 384 | 1024 | 0.374 | Fixed ByteLevel tokenizer |
21
+ | V6 | 29.3M | 8 | 8 | 512 | 1024 | 0.907 | Golden dataset, killed early (ep 2/40) |
22
+ | V8 | 29.3M | 8 | 8 | 512 | 1024 | 0.076 | Overfit to training surface form |
23
+ | V9 | ~4M | 4 | 8 | 256 | 1024 | 0.019 | Compact retrain |
24
+
25
+ ## Tokenizer
26
+
27
+ Bilingual V4 tokenizer with three-tier ID space:
28
+ - **0-267**: Atomic opcodes, registers, structural tokens (BOS/EOS/NL/COMMA/COLON)
29
+ - **270-351**: Character-level literals (digits, letters, symbols)
30
+ - **356+**: BPE text tokens (prose, comments, descriptions)
31
+
32
+ Round-trip fidelity: 89.6% exact, 94.5% preserved, 99.6% code-portion.
33
+
34
+ ## Usage
35
+
36
+ ```python
37
+ import torch
38
+ import sys, os
39
+ sys.path.insert(0, ".")
40
+ from bilingual_tokenizer import BilingualTokenizer
41
+ from train_opcode_llm import OpcodeGPT, generate_asm
42
+
43
+ # Load tokenizer
44
+ tokenizer = BilingualTokenizer.load("bilingual_tokenizer_v4/")
45
+
46
+ # Load model (V8 example)
47
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
48
+ checkpoint = torch.load("bilingual_llm_v8_ckpt.pt", map_location=device)
49
+ m_args = checkpoint["args"]
50
+
51
+ model = OpcodeGPT(
52
+ vocab_size=m_args["vocab_size"],
53
+ n_embd=m_args["embd"],
54
+ n_head=m_args["heads"],
55
+ n_layer=m_args["layers"],
56
+ block_size=m_args["context_len"]
57
+ ).to(device)
58
+ model.load_state_dict(checkpoint["model"])
59
+ model.eval()
60
+
61
+ # Generate from natural language prompt
62
+ asm = generate_asm(model, tokenizer, "; Draw a red circle at the center", device, max_tokens=256, temperature=0.7)
63
+ print(asm)
64
+ ```
65
+
66
+ ## Training Data
67
+
68
+ Trained on 5,211 annotated GeOS assembly programs (211 real + ~5,000 synthetic). Each program includes a natural language description comment. Dataset: `bilingual_dataset.npz` (7,824 samples, 2.91M tokens, context=1024, stride=512).
69
+
70
+ ## Geometry OS
71
+
72
+ PixelGPT targets the Geometry OS bytecode VM -- a pixel-native operating system with 150+ opcodes, 32 registers, and a 256x256 RGB framebuffer. Programs are written in a custom assembly language and assembled to bytecode.
73
+
74
+ Repo: https://github.com/tdw419/geometry_os
75
+
76
+ ## Status
77
+
78
+ **Research preview.** The model generates syntactically plausible assembly but does not yet produce consistently working programs. Active development is improving training data quality and constrained decoding.
79
+
80
+ ## Developed by
81
+
82
+ Built with [Hermes Agent](https://github.com/nousresearch/hermes-agent) autonomous development pipeline on an NVIDIA RTX 5090.