---
language:
- en
license: mit
tags:
- pixel-art
- image-generation
- bitnet
- ternary
- autoregressive
- text-to-image
pipeline_tag: text-to-image
---

# BitPixelLM

BitPixelLM is a small autoregressive language model trained to generate **32×32 pixel art** from short text prompts.
It was built entirely from scratch as a research and learning project, with no pre-trained weights or external image encoders.

> **Example prompts:** `a red pixel art sword` · `a blue pixel art knight` · `a green pixel art dragon`

---

## Model Architecture

BitPixelLM is a text-conditioned autoregressive decoder based on [BitNet b1.58](https://arxiv.org/abs/2402.17764).
Instead of full-precision weights, the decoder uses **ternary weights {−1, 0, +1}**, making it extremely parameter-efficient.

| Component | Details |
|---|---|
| Parameters | 7,382,274 (~7.4M total) |
| Decoder layers | 6 (BitNet b1.58 — ternary weights) |
| Text encoder layers | 3 (standard FP32 transformer) |
| Model dimension | 256 |
| Attention heads | 8 |
| Feed-forward dim | 512 |
| Weight format | ~75% ternary (1.58 bits/weight), ~25% FP32 |
| Output | 32×32 RGB pixel art (256-color palette) |

**Key design choices:**
- **BitLinear b1.58**: decoder attention and feed-forward weights are quantized to {−1, 0, +1} during the forward pass via `sign(W / mean(|W|))`. Embeddings, norms, and the text encoder remain FP32.
- **RMSNorm** instead of LayerNorm (pre-norm architecture).
- **SwiGLU** activation in feed-forward blocks.
- **2D positional encoding**: the 32×32 pixel grid is encoded with separate row/column sinusoidal embeddings, injected at each decoder layer.
- **Cross-attention**: the decoder attends to text encoder outputs at every layer.
- **Palette tokenization**: each pixel is one discrete token from a learned 256-color palette. Generation is autoregressive over 1,024 pixel tokens (32×32) per image.

---

## Training

The model was trained on a fully synthetic procedurally-generated dataset of labeled 32×32 pixel art sprites.

| Dataset | Samples | Categories | Vocabulary |
|---|---|---|---|
| v3 (current) | 23,648 | 199 | 222 words |

Categories include characters, creatures, items, terrain, food, vehicles, buildings, nature, UI icons, furniture, and instruments.
Each image has a short descriptive label (e.g. `a red pixel art sword`) used as the text conditioning signal.

**Training configuration:**

| Setting | Value |
|---|---|
| Epochs | 60 |
| Batch size | 32 |
| Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.01) |
| Learning rate | 5×10⁻⁴ with cosine annealing + 500-step warmup |
| Hardware | NVIDIA RTX 3080 (10 GB VRAM) |

**Results (v3 dataset, best at epoch 32):**

| Metric | Value |
|---|---|
| Best validation loss | 0.4015 |
| Perplexity | ~1.49 |

---

## Usage

### Requirements

```
torch
numpy
Pillow
```

### Load and generate

```python
import json, torch
from PIL import Image
from model.tokenizer import PaletteTokenizer
from model.text_encoder import TextTokenizer, TextEncoder
from model.bit_pixel_decoder import BitPixelLMDecoder, BitPixelLM

# Load tokenizers
palette_tok = PaletteTokenizer(palette_path="palette_256.npy")
with open("vocab.json") as f:
    vocab = json.load(f)
text_tok = TextTokenizer(vocab)

# Build model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
text_encoder = TextEncoder(
    vocab_size=text_tok.vocab_size, d_model=256, nhead=8,
    num_layers=3, dim_feedforward=512, max_seq_len=32,
)
pixel_decoder = BitPixelLMDecoder(
    vocab_size=palette_tok.vocab_size, d_model=256, nhead=8,
    num_layers=6, dim_feedforward=512, img_size=32,
)
model = BitPixelLM(text_encoder, pixel_decoder).to(device)

# Load weights
ckpt = torch.load("best.pt", map_location=device, weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

# Generate
prompt = "a red pixel art sword"
text_tokens = text_tok.encode(prompt).unsqueeze(0).to(device)
with torch.no_grad():
    generated = model.generate(
        text_tokens,
        sos_token=palette_tok.sos_token,
        eos_token=palette_tok.eos_token,
        temperature=0.8,
        top_k=40,
        top_p=0.9,
    )

# Decode to image
img_array = palette_tok.decode_tokens(generated[0].cpu().tolist())
img = Image.fromarray(img_array, "RGB").resize((256, 256), Image.NEAREST)
img.save("output.png")
```

### Vocabulary

The model knows 222 words covering colors, object types, and descriptors. Prompts that follow the training label pattern (`a <color> pixel art <object>`) work best. Words outside the vocabulary are silently mapped to `<unk>`.

Sample supported words:
`red`, `blue`, `green`, `yellow`, `orange`, `purple`, `gold`, `dark`, `teal`, `silver` ·
`sword`, `shield`, `bow`, `axe`, `staff`, `wand`, `armour` ·
`knight`, `wizard`, `archer`, `dragon`, `goblin`, `skeleton`, `ghost`, `vampire` ·
`castle`, `tree`, `flower`, `mushroom`, `chest`, `potion`, `gem`, `key`, `crown`, `ship`, `horse`, and more.

---

## Limitations

- Outputs are 32×32 pixels. Upscale with `Image.NEAREST` to preserve the pixel art look.
- Trained entirely on procedurally-generated synthetic data — no real-world artwork knowledge.
- Generation quality is best for prompts close to training label patterns.
- Color fidelity is bounded by the 256-color learned palette.

---

## Citations

```bibtex
@article{wang2023bitnet,
  title={BitNet: Scaling 1-bit Transformers for Large Language Models},
  author={Wang, Hongyu and others},
  journal={arXiv:2310.11453},
  year={2023}
}

@article{ma2024bitnet158,
  title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
  author={Ma, Shuming and others},
  journal={arXiv:2402.17764},
  year={2024}
}
```

---

## License

MIT