BitPixelLM
BitPixelLM is a small autoregressive language model trained to generate 32×32 pixel art from short text prompts. It was built entirely from scratch as a research and learning project, with no pre-trained weights or external image encoders.
Example prompts:
a red pixel art sword·a blue pixel art knight·a green pixel art dragon
Model Architecture
BitPixelLM is a text-conditioned autoregressive decoder based on BitNet b1.58. Instead of full-precision weights, the decoder uses ternary weights {−1, 0, +1}, making it extremely parameter-efficient.
| Component | Details |
|---|---|
| Parameters | 7,382,274 (~7.4M total) |
| Decoder layers | 6 (BitNet b1.58 — ternary weights) |
| Text encoder layers | 3 (standard FP32 transformer) |
| Model dimension | 256 |
| Attention heads | 8 |
| Feed-forward dim | 512 |
| Weight format | ~75% ternary (1.58 bits/weight), ~25% FP32 |
| Output | 32×32 RGB pixel art (256-color palette) |
Key design choices:
- BitLinear b1.58: decoder attention and feed-forward weights are quantized to {−1, 0, +1} during the forward pass via
sign(W / mean(|W|)). Embeddings, norms, and the text encoder remain FP32. - RMSNorm instead of LayerNorm (pre-norm architecture).
- SwiGLU activation in feed-forward blocks.
- 2D positional encoding: the 32×32 pixel grid is encoded with separate row/column sinusoidal embeddings, injected at each decoder layer.
- Cross-attention: the decoder attends to text encoder outputs at every layer.
- Palette tokenization: each pixel is one discrete token from a learned 256-color palette. Generation is autoregressive over 1,024 pixel tokens (32×32) per image.
Training
The model was trained on a fully synthetic procedurally-generated dataset of labeled 32×32 pixel art sprites.
| Dataset | Samples | Categories | Vocabulary |
|---|---|---|---|
| v3 (current) | 23,648 | 199 | 222 words |
Categories include characters, creatures, items, terrain, food, vehicles, buildings, nature, UI icons, furniture, and instruments.
Each image has a short descriptive label (e.g. a red pixel art sword) used as the text conditioning signal.
Training configuration:
| Setting | Value |
|---|---|
| Epochs | 60 |
| Batch size | 32 |
| Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.01) |
| Learning rate | 5×10⁻⁴ with cosine annealing + 500-step warmup |
| Hardware | NVIDIA RTX 3080 (10 GB VRAM) |
Results (v3 dataset, best at epoch 32):
| Metric | Value |
|---|---|
| Best validation loss | 0.4015 |
| Perplexity | ~1.49 |
Usage
Requirements
torch
numpy
Pillow
Load and generate
import json, torch
from PIL import Image
from model.tokenizer import PaletteTokenizer
from model.text_encoder import TextTokenizer, TextEncoder
from model.bit_pixel_decoder import BitPixelLMDecoder, BitPixelLM
# Load tokenizers
palette_tok = PaletteTokenizer(palette_path="palette_256.npy")
with open("vocab.json") as f:
vocab = json.load(f)
text_tok = TextTokenizer(vocab)
# Build model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
text_encoder = TextEncoder(
vocab_size=text_tok.vocab_size, d_model=256, nhead=8,
num_layers=3, dim_feedforward=512, max_seq_len=32,
)
pixel_decoder = BitPixelLMDecoder(
vocab_size=palette_tok.vocab_size, d_model=256, nhead=8,
num_layers=6, dim_feedforward=512, img_size=32,
)
model = BitPixelLM(text_encoder, pixel_decoder).to(device)
# Load weights
ckpt = torch.load("best.pt", map_location=device, weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
# Generate
prompt = "a red pixel art sword"
text_tokens = text_tok.encode(prompt).unsqueeze(0).to(device)
with torch.no_grad():
generated = model.generate(
text_tokens,
sos_token=palette_tok.sos_token,
eos_token=palette_tok.eos_token,
temperature=0.8,
top_k=40,
top_p=0.9,
)
# Decode to image
img_array = palette_tok.decode_tokens(generated[0].cpu().tolist())
img = Image.fromarray(img_array, "RGB").resize((256, 256), Image.NEAREST)
img.save("output.png")
Vocabulary
The model knows 222 words covering colors, object types, and descriptors. Prompts that follow the training label pattern (a <color> pixel art <object>) work best. Words outside the vocabulary are silently mapped to <unk>.
Sample supported words:
red, blue, green, yellow, orange, purple, gold, dark, teal, silver ·
sword, shield, bow, axe, staff, wand, armour ·
knight, wizard, archer, dragon, goblin, skeleton, ghost, vampire ·
castle, tree, flower, mushroom, chest, potion, gem, key, crown, ship, horse, and more.
Limitations
- Outputs are 32×32 pixels. Upscale with
Image.NEARESTto preserve the pixel art look. - Trained entirely on procedurally-generated synthetic data — no real-world artwork knowledge.
- Generation quality is best for prompts close to training label patterns.
- Color fidelity is bounded by the 256-color learned palette.
Citations
@article{wang2023bitnet,
title={BitNet: Scaling 1-bit Transformers for Large Language Models},
author={Wang, Hongyu and others},
journal={arXiv:2310.11453},
year={2023}
}
@article{ma2024bitnet158,
title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
author={Ma, Shuming and others},
journal={arXiv:2402.17764},
year={2024}
}
License
MIT
- Downloads last month
- 39