Rewrite model card with architecture, training stats, and usage

a9e7f27 verified 16 days ago

5.93 kB

language:
  - en
license: mit
tags:
  - pixel-art
  - image-generation
  - bitnet
  - ternary
  - autoregressive
  - text-to-image
pipeline_tag: text-to-image

BitPixelLM

BitPixelLM is a small autoregressive language model trained to generate 32×32 pixel art from short text prompts. It was built entirely from scratch as a research and learning project, with no pre-trained weights or external image encoders.

Example prompts: a red pixel art sword · a blue pixel art knight · a green pixel art dragon

Model Architecture

BitPixelLM is a text-conditioned autoregressive decoder based on BitNet b1.58. Instead of full-precision weights, the decoder uses ternary weights {−1, 0, +1}, making it extremely parameter-efficient.

Component	Details
Parameters	7,382,274 (~7.4M total)
Decoder layers	6 (BitNet b1.58 — ternary weights)
Text encoder layers	3 (standard FP32 transformer)
Model dimension	256
Attention heads	8
Feed-forward dim	512
Weight format	~75% ternary (1.58 bits/weight), ~25% FP32
Output	32×32 RGB pixel art (256-color palette)

Key design choices:

BitLinear b1.58: decoder attention and feed-forward weights are quantized to {−1, 0, +1} during the forward pass via sign(W / mean(|W|)). Embeddings, norms, and the text encoder remain FP32.
RMSNorm instead of LayerNorm (pre-norm architecture).
SwiGLU activation in feed-forward blocks.
2D positional encoding: the 32×32 pixel grid is encoded with separate row/column sinusoidal embeddings, injected at each decoder layer.
Cross-attention: the decoder attends to text encoder outputs at every layer.
Palette tokenization: each pixel is one discrete token from a learned 256-color palette. Generation is autoregressive over 1,024 pixel tokens (32×32) per image.

Training

The model was trained on a fully synthetic procedurally-generated dataset of labeled 32×32 pixel art sprites.

Dataset	Samples	Categories	Vocabulary
v3 (current)	23,648	199	222 words

Categories include characters, creatures, items, terrain, food, vehicles, buildings, nature, UI icons, furniture, and instruments. Each image has a short descriptive label (e.g. a red pixel art sword) used as the text conditioning signal.

Training configuration:

Setting	Value
Epochs	60
Batch size	32
Optimizer	AdamW (β₁=0.9, β₂=0.95, wd=0.01)
Learning rate	5×10⁻⁴ with cosine annealing + 500-step warmup
Hardware	NVIDIA RTX 3080 (10 GB VRAM)

Results (v3 dataset, best at epoch 32):

Metric	Value
Best validation loss	0.4015
Perplexity	~1.49

Usage

Requirements

torch
numpy
Pillow

Load and generate

import json, torch
from PIL import Image
from model.tokenizer import PaletteTokenizer
from model.text_encoder import TextTokenizer, TextEncoder
from model.bit_pixel_decoder import BitPixelLMDecoder, BitPixelLM

# Load tokenizers
palette_tok = PaletteTokenizer(palette_path="palette_256.npy")
with open("vocab.json") as f:
    vocab = json.load(f)
text_tok = TextTokenizer(vocab)

# Build model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
text_encoder = TextEncoder(
    vocab_size=text_tok.vocab_size, d_model=256, nhead=8,
    num_layers=3, dim_feedforward=512, max_seq_len=32,
)
pixel_decoder = BitPixelLMDecoder(
    vocab_size=palette_tok.vocab_size, d_model=256, nhead=8,
    num_layers=6, dim_feedforward=512, img_size=32,
)
model = BitPixelLM(text_encoder, pixel_decoder).to(device)

# Load weights
ckpt = torch.load("best.pt", map_location=device, weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

# Generate
prompt = "a red pixel art sword"
text_tokens = text_tok.encode(prompt).unsqueeze(0).to(device)
with torch.no_grad():
    generated = model.generate(
        text_tokens,
        sos_token=palette_tok.sos_token,
        eos_token=palette_tok.eos_token,
        temperature=0.8,
        top_k=40,
        top_p=0.9,
    )

# Decode to image
img_array = palette_tok.decode_tokens(generated[0].cpu().tolist())
img = Image.fromarray(img_array, "RGB").resize((256, 256), Image.NEAREST)
img.save("output.png")

Vocabulary

The model knows 222 words covering colors, object types, and descriptors. Prompts that follow the training label pattern (a <color> pixel art <object>) work best. Words outside the vocabulary are silently mapped to <unk>.

Sample supported words: red, blue, green, yellow, orange, purple, gold, dark, teal, silver · sword, shield, bow, axe, staff, wand, armour · knight, wizard, archer, dragon, goblin, skeleton, ghost, vampire · castle, tree, flower, mushroom, chest, potion, gem, key, crown, ship, horse, and more.

Limitations

Outputs are 32×32 pixels. Upscale with Image.NEAREST to preserve the pixel art look.
Trained entirely on procedurally-generated synthetic data — no real-world artwork knowledge.
Generation quality is best for prompts close to training label patterns.
Color fidelity is bounded by the 256-color learned palette.

Citations

@article{wang2023bitnet,
  title={BitNet: Scaling 1-bit Transformers for Large Language Models},
  author={Wang, Hongyu and others},
  journal={arXiv:2310.11453},
  year={2023}
}

@article{ma2024bitnet158,
  title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
  author={Ma, Shuming and others},
  journal={arXiv:2402.17764},
  year={2024}
}

License

MIT