| ---
|
| language:
|
| - en
|
| license: mit
|
| tags:
|
| - pixel-art
|
| - image-generation
|
| - bitnet
|
| - ternary
|
| - autoregressive
|
| - text-to-image
|
| pipeline_tag: text-to-image
|
| ---
|
|
|
| # BitPixelLM
|
|
|
| BitPixelLM is a small autoregressive language model trained to generate **32×32 pixel art** from short text prompts.
|
| It was built entirely from scratch as a research and learning project, with no pre-trained weights or external image encoders.
|
|
|
| > **Example prompts:** `a red pixel art sword` · `a blue pixel art knight` · `a green pixel art dragon`
|
|
|
| ---
|
|
|
| ## Model Architecture
|
|
|
| BitPixelLM is a text-conditioned autoregressive decoder based on [BitNet b1.58](https://arxiv.org/abs/2402.17764).
|
| Instead of full-precision weights, the decoder uses **ternary weights {−1, 0, +1}**, making it extremely parameter-efficient.
|
|
|
| | Component | Details |
|
| |---|---|
|
| | Parameters | 7,382,274 (~7.4M total) |
|
| | Decoder layers | 6 (BitNet b1.58 — ternary weights) |
|
| | Text encoder layers | 3 (standard FP32 transformer) |
|
| | Model dimension | 256 |
|
| | Attention heads | 8 |
|
| | Feed-forward dim | 512 |
|
| | Weight format | ~75% ternary (1.58 bits/weight), ~25% FP32 |
|
| | Output | 32×32 RGB pixel art (256-color palette) |
|
|
|
| **Key design choices:**
|
| - **BitLinear b1.58**: decoder attention and feed-forward weights are quantized to {−1, 0, +1} during the forward pass via `sign(W / mean(|W|))`. Embeddings, norms, and the text encoder remain FP32.
|
| - **RMSNorm** instead of LayerNorm (pre-norm architecture).
|
| - **SwiGLU** activation in feed-forward blocks.
|
| - **2D positional encoding**: the 32×32 pixel grid is encoded with separate row/column sinusoidal embeddings, injected at each decoder layer.
|
| - **Cross-attention**: the decoder attends to text encoder outputs at every layer.
|
| - **Palette tokenization**: each pixel is one discrete token from a learned 256-color palette. Generation is autoregressive over 1,024 pixel tokens (32×32) per image.
|
|
|
| ---
|
|
|
| ## Training
|
|
|
| The model was trained on a fully synthetic procedurally-generated dataset of labeled 32×32 pixel art sprites.
|
|
|
| | Dataset | Samples | Categories | Vocabulary |
|
| |---|---|---|---|
|
| | v3 (current) | 23,648 | 199 | 222 words |
|
|
|
| Categories include characters, creatures, items, terrain, food, vehicles, buildings, nature, UI icons, furniture, and instruments.
|
| Each image has a short descriptive label (e.g. `a red pixel art sword`) used as the text conditioning signal.
|
|
|
| **Training configuration:**
|
|
|
| | Setting | Value |
|
| |---|---|
|
| | Epochs | 60 |
|
| | Batch size | 32 |
|
| | Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.01) |
|
| | Learning rate | 5×10⁻⁴ with cosine annealing + 500-step warmup |
|
| | Hardware | NVIDIA RTX 3080 (10 GB VRAM) |
|
|
|
| **Results (v3 dataset, best at epoch 32):**
|
|
|
| | Metric | Value |
|
| |---|---|
|
| | Best validation loss | 0.4015 |
|
| | Perplexity | ~1.49 |
|
|
|
| ---
|
|
|
| ## Usage
|
|
|
| ### Requirements
|
|
|
| ```
|
| torch
|
| numpy
|
| Pillow
|
| ```
|
|
|
| ### Load and generate
|
|
|
| ```python
|
| import json, torch
|
| from PIL import Image
|
| from model.tokenizer import PaletteTokenizer
|
| from model.text_encoder import TextTokenizer, TextEncoder
|
| from model.bit_pixel_decoder import BitPixelLMDecoder, BitPixelLM
|
|
|
| # Load tokenizers
|
| palette_tok = PaletteTokenizer(palette_path="palette_256.npy")
|
| with open("vocab.json") as f:
|
| vocab = json.load(f)
|
| text_tok = TextTokenizer(vocab)
|
|
|
| # Build model
|
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| text_encoder = TextEncoder(
|
| vocab_size=text_tok.vocab_size, d_model=256, nhead=8,
|
| num_layers=3, dim_feedforward=512, max_seq_len=32,
|
| )
|
| pixel_decoder = BitPixelLMDecoder(
|
| vocab_size=palette_tok.vocab_size, d_model=256, nhead=8,
|
| num_layers=6, dim_feedforward=512, img_size=32,
|
| )
|
| model = BitPixelLM(text_encoder, pixel_decoder).to(device)
|
|
|
| # Load weights
|
| ckpt = torch.load("best.pt", map_location=device, weights_only=False)
|
| model.load_state_dict(ckpt["model_state_dict"])
|
| model.eval()
|
|
|
| # Generate
|
| prompt = "a red pixel art sword"
|
| text_tokens = text_tok.encode(prompt).unsqueeze(0).to(device)
|
| with torch.no_grad():
|
| generated = model.generate(
|
| text_tokens,
|
| sos_token=palette_tok.sos_token,
|
| eos_token=palette_tok.eos_token,
|
| temperature=0.8,
|
| top_k=40,
|
| top_p=0.9,
|
| )
|
|
|
| # Decode to image
|
| img_array = palette_tok.decode_tokens(generated[0].cpu().tolist())
|
| img = Image.fromarray(img_array, "RGB").resize((256, 256), Image.NEAREST)
|
| img.save("output.png")
|
| ```
|
|
|
| ### Vocabulary
|
|
|
| The model knows 222 words covering colors, object types, and descriptors. Prompts that follow the training label pattern (`a <color> pixel art <object>`) work best. Words outside the vocabulary are silently mapped to `<unk>`.
|
|
|
| Sample supported words:
|
| `red`, `blue`, `green`, `yellow`, `orange`, `purple`, `gold`, `dark`, `teal`, `silver` ·
|
| `sword`, `shield`, `bow`, `axe`, `staff`, `wand`, `armour` ·
|
| `knight`, `wizard`, `archer`, `dragon`, `goblin`, `skeleton`, `ghost`, `vampire` ·
|
| `castle`, `tree`, `flower`, `mushroom`, `chest`, `potion`, `gem`, `key`, `crown`, `ship`, `horse`, and more.
|
|
|
| ---
|
|
|
| ## Limitations
|
|
|
| - Outputs are 32×32 pixels. Upscale with `Image.NEAREST` to preserve the pixel art look.
|
| - Trained entirely on procedurally-generated synthetic data — no real-world artwork knowledge.
|
| - Generation quality is best for prompts close to training label patterns.
|
| - Color fidelity is bounded by the 256-color learned palette.
|
|
|
| ---
|
|
|
| ## Citations
|
|
|
| ```bibtex
|
| @article{wang2023bitnet,
|
| title={BitNet: Scaling 1-bit Transformers for Large Language Models},
|
| author={Wang, Hongyu and others},
|
| journal={arXiv:2310.11453},
|
| year={2023}
|
| }
|
|
|
| @article{ma2024bitnet158,
|
| title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
|
| author={Ma, Shuming and others},
|
| journal={arXiv:2402.17764},
|
| year={2024}
|
| }
|
| ```
|
|
|
| ---
|
|
|
| ## License
|
|
|
| MIT
|
|
|