File size: 5,926 Bytes
a271056 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 72e872c a9e7f27 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | ---
language:
- en
license: mit
tags:
- pixel-art
- image-generation
- bitnet
- ternary
- autoregressive
- text-to-image
pipeline_tag: text-to-image
---
# BitPixelLM
BitPixelLM is a small autoregressive language model trained to generate **32×32 pixel art** from short text prompts.
It was built entirely from scratch as a research and learning project, with no pre-trained weights or external image encoders.
> **Example prompts:** `a red pixel art sword` · `a blue pixel art knight` · `a green pixel art dragon`
---
## Model Architecture
BitPixelLM is a text-conditioned autoregressive decoder based on [BitNet b1.58](https://arxiv.org/abs/2402.17764).
Instead of full-precision weights, the decoder uses **ternary weights {−1, 0, +1}**, making it extremely parameter-efficient.
| Component | Details |
|---|---|
| Parameters | 7,382,274 (~7.4M total) |
| Decoder layers | 6 (BitNet b1.58 — ternary weights) |
| Text encoder layers | 3 (standard FP32 transformer) |
| Model dimension | 256 |
| Attention heads | 8 |
| Feed-forward dim | 512 |
| Weight format | ~75% ternary (1.58 bits/weight), ~25% FP32 |
| Output | 32×32 RGB pixel art (256-color palette) |
**Key design choices:**
- **BitLinear b1.58**: decoder attention and feed-forward weights are quantized to {−1, 0, +1} during the forward pass via `sign(W / mean(|W|))`. Embeddings, norms, and the text encoder remain FP32.
- **RMSNorm** instead of LayerNorm (pre-norm architecture).
- **SwiGLU** activation in feed-forward blocks.
- **2D positional encoding**: the 32×32 pixel grid is encoded with separate row/column sinusoidal embeddings, injected at each decoder layer.
- **Cross-attention**: the decoder attends to text encoder outputs at every layer.
- **Palette tokenization**: each pixel is one discrete token from a learned 256-color palette. Generation is autoregressive over 1,024 pixel tokens (32×32) per image.
---
## Training
The model was trained on a fully synthetic procedurally-generated dataset of labeled 32×32 pixel art sprites.
| Dataset | Samples | Categories | Vocabulary |
|---|---|---|---|
| v3 (current) | 23,648 | 199 | 222 words |
Categories include characters, creatures, items, terrain, food, vehicles, buildings, nature, UI icons, furniture, and instruments.
Each image has a short descriptive label (e.g. `a red pixel art sword`) used as the text conditioning signal.
**Training configuration:**
| Setting | Value |
|---|---|
| Epochs | 60 |
| Batch size | 32 |
| Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.01) |
| Learning rate | 5×10⁻⁴ with cosine annealing + 500-step warmup |
| Hardware | NVIDIA RTX 3080 (10 GB VRAM) |
**Results (v3 dataset, best at epoch 32):**
| Metric | Value |
|---|---|
| Best validation loss | 0.4015 |
| Perplexity | ~1.49 |
---
## Usage
### Requirements
```
torch
numpy
Pillow
```
### Load and generate
```python
import json, torch
from PIL import Image
from model.tokenizer import PaletteTokenizer
from model.text_encoder import TextTokenizer, TextEncoder
from model.bit_pixel_decoder import BitPixelLMDecoder, BitPixelLM
# Load tokenizers
palette_tok = PaletteTokenizer(palette_path="palette_256.npy")
with open("vocab.json") as f:
vocab = json.load(f)
text_tok = TextTokenizer(vocab)
# Build model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
text_encoder = TextEncoder(
vocab_size=text_tok.vocab_size, d_model=256, nhead=8,
num_layers=3, dim_feedforward=512, max_seq_len=32,
)
pixel_decoder = BitPixelLMDecoder(
vocab_size=palette_tok.vocab_size, d_model=256, nhead=8,
num_layers=6, dim_feedforward=512, img_size=32,
)
model = BitPixelLM(text_encoder, pixel_decoder).to(device)
# Load weights
ckpt = torch.load("best.pt", map_location=device, weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
# Generate
prompt = "a red pixel art sword"
text_tokens = text_tok.encode(prompt).unsqueeze(0).to(device)
with torch.no_grad():
generated = model.generate(
text_tokens,
sos_token=palette_tok.sos_token,
eos_token=palette_tok.eos_token,
temperature=0.8,
top_k=40,
top_p=0.9,
)
# Decode to image
img_array = palette_tok.decode_tokens(generated[0].cpu().tolist())
img = Image.fromarray(img_array, "RGB").resize((256, 256), Image.NEAREST)
img.save("output.png")
```
### Vocabulary
The model knows 222 words covering colors, object types, and descriptors. Prompts that follow the training label pattern (`a <color> pixel art <object>`) work best. Words outside the vocabulary are silently mapped to `<unk>`.
Sample supported words:
`red`, `blue`, `green`, `yellow`, `orange`, `purple`, `gold`, `dark`, `teal`, `silver` ·
`sword`, `shield`, `bow`, `axe`, `staff`, `wand`, `armour` ·
`knight`, `wizard`, `archer`, `dragon`, `goblin`, `skeleton`, `ghost`, `vampire` ·
`castle`, `tree`, `flower`, `mushroom`, `chest`, `potion`, `gem`, `key`, `crown`, `ship`, `horse`, and more.
---
## Limitations
- Outputs are 32×32 pixels. Upscale with `Image.NEAREST` to preserve the pixel art look.
- Trained entirely on procedurally-generated synthetic data — no real-world artwork knowledge.
- Generation quality is best for prompts close to training label patterns.
- Color fidelity is bounded by the 256-color learned palette.
---
## Citations
```bibtex
@article{wang2023bitnet,
title={BitNet: Scaling 1-bit Transformers for Large Language Models},
author={Wang, Hongyu and others},
journal={arXiv:2310.11453},
year={2023}
}
@article{ma2024bitnet158,
title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
author={Ma, Shuming and others},
journal={arXiv:2402.17764},
year={2024}
}
```
---
## License
MIT
|