BitPixelLM / README.md
BlakePeavy's picture
Rewrite model card with architecture, training stats, and usage
a9e7f27 verified
---
language:
- en
license: mit
tags:
- pixel-art
- image-generation
- bitnet
- ternary
- autoregressive
- text-to-image
pipeline_tag: text-to-image
---
# BitPixelLM
BitPixelLM is a small autoregressive language model trained to generate **32×32 pixel art** from short text prompts.
It was built entirely from scratch as a research and learning project, with no pre-trained weights or external image encoders.
> **Example prompts:** `a red pixel art sword` · `a blue pixel art knight` · `a green pixel art dragon`
---
## Model Architecture
BitPixelLM is a text-conditioned autoregressive decoder based on [BitNet b1.58](https://arxiv.org/abs/2402.17764).
Instead of full-precision weights, the decoder uses **ternary weights {−1, 0, +1}**, making it extremely parameter-efficient.
| Component | Details |
|---|---|
| Parameters | 7,382,274 (~7.4M total) |
| Decoder layers | 6 (BitNet b1.58 — ternary weights) |
| Text encoder layers | 3 (standard FP32 transformer) |
| Model dimension | 256 |
| Attention heads | 8 |
| Feed-forward dim | 512 |
| Weight format | ~75% ternary (1.58 bits/weight), ~25% FP32 |
| Output | 32×32 RGB pixel art (256-color palette) |
**Key design choices:**
- **BitLinear b1.58**: decoder attention and feed-forward weights are quantized to {−1, 0, +1} during the forward pass via `sign(W / mean(|W|))`. Embeddings, norms, and the text encoder remain FP32.
- **RMSNorm** instead of LayerNorm (pre-norm architecture).
- **SwiGLU** activation in feed-forward blocks.
- **2D positional encoding**: the 32×32 pixel grid is encoded with separate row/column sinusoidal embeddings, injected at each decoder layer.
- **Cross-attention**: the decoder attends to text encoder outputs at every layer.
- **Palette tokenization**: each pixel is one discrete token from a learned 256-color palette. Generation is autoregressive over 1,024 pixel tokens (32×32) per image.
---
## Training
The model was trained on a fully synthetic procedurally-generated dataset of labeled 32×32 pixel art sprites.
| Dataset | Samples | Categories | Vocabulary |
|---|---|---|---|
| v3 (current) | 23,648 | 199 | 222 words |
Categories include characters, creatures, items, terrain, food, vehicles, buildings, nature, UI icons, furniture, and instruments.
Each image has a short descriptive label (e.g. `a red pixel art sword`) used as the text conditioning signal.
**Training configuration:**
| Setting | Value |
|---|---|
| Epochs | 60 |
| Batch size | 32 |
| Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.01) |
| Learning rate | 5×10⁻⁴ with cosine annealing + 500-step warmup |
| Hardware | NVIDIA RTX 3080 (10 GB VRAM) |
**Results (v3 dataset, best at epoch 32):**
| Metric | Value |
|---|---|
| Best validation loss | 0.4015 |
| Perplexity | ~1.49 |
---
## Usage
### Requirements
```
torch
numpy
Pillow
```
### Load and generate
```python
import json, torch
from PIL import Image
from model.tokenizer import PaletteTokenizer
from model.text_encoder import TextTokenizer, TextEncoder
from model.bit_pixel_decoder import BitPixelLMDecoder, BitPixelLM
# Load tokenizers
palette_tok = PaletteTokenizer(palette_path="palette_256.npy")
with open("vocab.json") as f:
vocab = json.load(f)
text_tok = TextTokenizer(vocab)
# Build model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
text_encoder = TextEncoder(
vocab_size=text_tok.vocab_size, d_model=256, nhead=8,
num_layers=3, dim_feedforward=512, max_seq_len=32,
)
pixel_decoder = BitPixelLMDecoder(
vocab_size=palette_tok.vocab_size, d_model=256, nhead=8,
num_layers=6, dim_feedforward=512, img_size=32,
)
model = BitPixelLM(text_encoder, pixel_decoder).to(device)
# Load weights
ckpt = torch.load("best.pt", map_location=device, weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
# Generate
prompt = "a red pixel art sword"
text_tokens = text_tok.encode(prompt).unsqueeze(0).to(device)
with torch.no_grad():
generated = model.generate(
text_tokens,
sos_token=palette_tok.sos_token,
eos_token=palette_tok.eos_token,
temperature=0.8,
top_k=40,
top_p=0.9,
)
# Decode to image
img_array = palette_tok.decode_tokens(generated[0].cpu().tolist())
img = Image.fromarray(img_array, "RGB").resize((256, 256), Image.NEAREST)
img.save("output.png")
```
### Vocabulary
The model knows 222 words covering colors, object types, and descriptors. Prompts that follow the training label pattern (`a <color> pixel art <object>`) work best. Words outside the vocabulary are silently mapped to `<unk>`.
Sample supported words:
`red`, `blue`, `green`, `yellow`, `orange`, `purple`, `gold`, `dark`, `teal`, `silver` ·
`sword`, `shield`, `bow`, `axe`, `staff`, `wand`, `armour` ·
`knight`, `wizard`, `archer`, `dragon`, `goblin`, `skeleton`, `ghost`, `vampire` ·
`castle`, `tree`, `flower`, `mushroom`, `chest`, `potion`, `gem`, `key`, `crown`, `ship`, `horse`, and more.
---
## Limitations
- Outputs are 32×32 pixels. Upscale with `Image.NEAREST` to preserve the pixel art look.
- Trained entirely on procedurally-generated synthetic data — no real-world artwork knowledge.
- Generation quality is best for prompts close to training label patterns.
- Color fidelity is bounded by the 256-color learned palette.
---
## Citations
```bibtex
@article{wang2023bitnet,
title={BitNet: Scaling 1-bit Transformers for Large Language Models},
author={Wang, Hongyu and others},
journal={arXiv:2310.11453},
year={2023}
}
@article{ma2024bitnet158,
title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
author={Ma, Shuming and others},
journal={arXiv:2402.17764},
year={2024}
}
```
---
## License
MIT