--- language: - en license: mit tags: - pixel-art - image-generation - bitnet - ternary - autoregressive - text-to-image pipeline_tag: text-to-image --- # BitPixelLM BitPixelLM is a small autoregressive language model trained to generate **32×32 pixel art** from short text prompts. It was built entirely from scratch as a research and learning project, with no pre-trained weights or external image encoders. > **Example prompts:** `a red pixel art sword` · `a blue pixel art knight` · `a green pixel art dragon` --- ## Model Architecture BitPixelLM is a text-conditioned autoregressive decoder based on [BitNet b1.58](https://arxiv.org/abs/2402.17764). Instead of full-precision weights, the decoder uses **ternary weights {−1, 0, +1}**, making it extremely parameter-efficient. | Component | Details | |---|---| | Parameters | 7,382,274 (~7.4M total) | | Decoder layers | 6 (BitNet b1.58 — ternary weights) | | Text encoder layers | 3 (standard FP32 transformer) | | Model dimension | 256 | | Attention heads | 8 | | Feed-forward dim | 512 | | Weight format | ~75% ternary (1.58 bits/weight), ~25% FP32 | | Output | 32×32 RGB pixel art (256-color palette) | **Key design choices:** - **BitLinear b1.58**: decoder attention and feed-forward weights are quantized to {−1, 0, +1} during the forward pass via `sign(W / mean(|W|))`. Embeddings, norms, and the text encoder remain FP32. - **RMSNorm** instead of LayerNorm (pre-norm architecture). - **SwiGLU** activation in feed-forward blocks. - **2D positional encoding**: the 32×32 pixel grid is encoded with separate row/column sinusoidal embeddings, injected at each decoder layer. - **Cross-attention**: the decoder attends to text encoder outputs at every layer. - **Palette tokenization**: each pixel is one discrete token from a learned 256-color palette. Generation is autoregressive over 1,024 pixel tokens (32×32) per image. --- ## Training The model was trained on a fully synthetic procedurally-generated dataset of labeled 32×32 pixel art sprites. | Dataset | Samples | Categories | Vocabulary | |---|---|---|---| | v3 (current) | 23,648 | 199 | 222 words | Categories include characters, creatures, items, terrain, food, vehicles, buildings, nature, UI icons, furniture, and instruments. Each image has a short descriptive label (e.g. `a red pixel art sword`) used as the text conditioning signal. **Training configuration:** | Setting | Value | |---|---| | Epochs | 60 | | Batch size | 32 | | Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.01) | | Learning rate | 5×10⁻⁴ with cosine annealing + 500-step warmup | | Hardware | NVIDIA RTX 3080 (10 GB VRAM) | **Results (v3 dataset, best at epoch 32):** | Metric | Value | |---|---| | Best validation loss | 0.4015 | | Perplexity | ~1.49 | --- ## Usage ### Requirements ``` torch numpy Pillow ``` ### Load and generate ```python import json, torch from PIL import Image from model.tokenizer import PaletteTokenizer from model.text_encoder import TextTokenizer, TextEncoder from model.bit_pixel_decoder import BitPixelLMDecoder, BitPixelLM # Load tokenizers palette_tok = PaletteTokenizer(palette_path="palette_256.npy") with open("vocab.json") as f: vocab = json.load(f) text_tok = TextTokenizer(vocab) # Build model device = torch.device("cuda" if torch.cuda.is_available() else "cpu") text_encoder = TextEncoder( vocab_size=text_tok.vocab_size, d_model=256, nhead=8, num_layers=3, dim_feedforward=512, max_seq_len=32, ) pixel_decoder = BitPixelLMDecoder( vocab_size=palette_tok.vocab_size, d_model=256, nhead=8, num_layers=6, dim_feedforward=512, img_size=32, ) model = BitPixelLM(text_encoder, pixel_decoder).to(device) # Load weights ckpt = torch.load("best.pt", map_location=device, weights_only=False) model.load_state_dict(ckpt["model_state_dict"]) model.eval() # Generate prompt = "a red pixel art sword" text_tokens = text_tok.encode(prompt).unsqueeze(0).to(device) with torch.no_grad(): generated = model.generate( text_tokens, sos_token=palette_tok.sos_token, eos_token=palette_tok.eos_token, temperature=0.8, top_k=40, top_p=0.9, ) # Decode to image img_array = palette_tok.decode_tokens(generated[0].cpu().tolist()) img = Image.fromarray(img_array, "RGB").resize((256, 256), Image.NEAREST) img.save("output.png") ``` ### Vocabulary The model knows 222 words covering colors, object types, and descriptors. Prompts that follow the training label pattern (`a pixel art `) work best. Words outside the vocabulary are silently mapped to ``. Sample supported words: `red`, `blue`, `green`, `yellow`, `orange`, `purple`, `gold`, `dark`, `teal`, `silver` · `sword`, `shield`, `bow`, `axe`, `staff`, `wand`, `armour` · `knight`, `wizard`, `archer`, `dragon`, `goblin`, `skeleton`, `ghost`, `vampire` · `castle`, `tree`, `flower`, `mushroom`, `chest`, `potion`, `gem`, `key`, `crown`, `ship`, `horse`, and more. --- ## Limitations - Outputs are 32×32 pixels. Upscale with `Image.NEAREST` to preserve the pixel art look. - Trained entirely on procedurally-generated synthetic data — no real-world artwork knowledge. - Generation quality is best for prompts close to training label patterns. - Color fidelity is bounded by the 256-color learned palette. --- ## Citations ```bibtex @article{wang2023bitnet, title={BitNet: Scaling 1-bit Transformers for Large Language Models}, author={Wang, Hongyu and others}, journal={arXiv:2310.11453}, year={2023} } @article{ma2024bitnet158, title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits}, author={Ma, Shuming and others}, journal={arXiv:2402.17764}, year={2024} } ``` --- ## License MIT