File size: 5,926 Bytes
a271056
 
 
 
 
 
 
 
 
 
 
 
 
 
72e872c
 
a9e7f27
 
72e872c
a9e7f27
72e872c
a9e7f27
72e872c
a9e7f27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72e872c
a9e7f27
72e872c
a9e7f27
72e872c
a9e7f27
72e872c
a9e7f27
 
 
72e872c
a9e7f27
 
72e872c
a9e7f27
72e872c
a9e7f27
 
 
 
 
 
 
72e872c
a9e7f27
72e872c
a9e7f27
 
 
 
72e872c
a9e7f27
72e872c
a9e7f27
72e872c
a9e7f27
72e872c
a9e7f27
 
 
 
72e872c
 
a9e7f27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72e872c
 
a9e7f27
 
 
 
 
 
 
 
 
72e872c
a9e7f27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72e872c
a9e7f27
72e872c
a9e7f27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---

language:
- en
license: mit
tags:
- pixel-art
- image-generation
- bitnet
- ternary
- autoregressive
- text-to-image
pipeline_tag: text-to-image
---


# BitPixelLM

BitPixelLM is a small autoregressive language model trained to generate **32×32 pixel art** from short text prompts.
It was built entirely from scratch as a research and learning project, with no pre-trained weights or external image encoders.

> **Example prompts:** `a red pixel art sword` · `a blue pixel art knight` · `a green pixel art dragon`

---

## Model Architecture

BitPixelLM is a text-conditioned autoregressive decoder based on [BitNet b1.58](https://arxiv.org/abs/2402.17764).
Instead of full-precision weights, the decoder uses **ternary weights {−1, 0, +1}**, making it extremely parameter-efficient.

| Component | Details |
|---|---|
| Parameters | 7,382,274 (~7.4M total) |
| Decoder layers | 6 (BitNet b1.58 — ternary weights) |
| Text encoder layers | 3 (standard FP32 transformer) |
| Model dimension | 256 |
| Attention heads | 8 |
| Feed-forward dim | 512 |
| Weight format | ~75% ternary (1.58 bits/weight), ~25% FP32 |
| Output | 32×32 RGB pixel art (256-color palette) |

**Key design choices:**
- **BitLinear b1.58**: decoder attention and feed-forward weights are quantized to {−1, 0, +1} during the forward pass via `sign(W / mean(|W|))`. Embeddings, norms, and the text encoder remain FP32.
- **RMSNorm** instead of LayerNorm (pre-norm architecture).
- **SwiGLU** activation in feed-forward blocks.
- **2D positional encoding**: the 32×32 pixel grid is encoded with separate row/column sinusoidal embeddings, injected at each decoder layer.
- **Cross-attention**: the decoder attends to text encoder outputs at every layer.
- **Palette tokenization**: each pixel is one discrete token from a learned 256-color palette. Generation is autoregressive over 1,024 pixel tokens (32×32) per image.

---

## Training

The model was trained on a fully synthetic procedurally-generated dataset of labeled 32×32 pixel art sprites.

| Dataset | Samples | Categories | Vocabulary |
|---|---|---|---|
| v3 (current) | 23,648 | 199 | 222 words |

Categories include characters, creatures, items, terrain, food, vehicles, buildings, nature, UI icons, furniture, and instruments.
Each image has a short descriptive label (e.g. `a red pixel art sword`) used as the text conditioning signal.

**Training configuration:**

| Setting | Value |
|---|---|
| Epochs | 60 |
| Batch size | 32 |
| Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.01) |
| Learning rate | 5×10⁻⁴ with cosine annealing + 500-step warmup |
| Hardware | NVIDIA RTX 3080 (10 GB VRAM) |

**Results (v3 dataset, best at epoch 32):**

| Metric | Value |
|---|---|
| Best validation loss | 0.4015 |
| Perplexity | ~1.49 |

---

## Usage

### Requirements

```

torch

numpy

Pillow

```

### Load and generate

```python

import json, torch

from PIL import Image

from model.tokenizer import PaletteTokenizer

from model.text_encoder import TextTokenizer, TextEncoder

from model.bit_pixel_decoder import BitPixelLMDecoder, BitPixelLM



# Load tokenizers

palette_tok = PaletteTokenizer(palette_path="palette_256.npy")

with open("vocab.json") as f:

    vocab = json.load(f)

text_tok = TextTokenizer(vocab)



# Build model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

text_encoder = TextEncoder(

    vocab_size=text_tok.vocab_size, d_model=256, nhead=8,

    num_layers=3, dim_feedforward=512, max_seq_len=32,

)

pixel_decoder = BitPixelLMDecoder(

    vocab_size=palette_tok.vocab_size, d_model=256, nhead=8,

    num_layers=6, dim_feedforward=512, img_size=32,

)

model = BitPixelLM(text_encoder, pixel_decoder).to(device)



# Load weights

ckpt = torch.load("best.pt", map_location=device, weights_only=False)

model.load_state_dict(ckpt["model_state_dict"])

model.eval()



# Generate

prompt = "a red pixel art sword"

text_tokens = text_tok.encode(prompt).unsqueeze(0).to(device)

with torch.no_grad():

    generated = model.generate(

        text_tokens,

        sos_token=palette_tok.sos_token,

        eos_token=palette_tok.eos_token,

        temperature=0.8,

        top_k=40,

        top_p=0.9,

    )



# Decode to image

img_array = palette_tok.decode_tokens(generated[0].cpu().tolist())

img = Image.fromarray(img_array, "RGB").resize((256, 256), Image.NEAREST)

img.save("output.png")

```

### Vocabulary

The model knows 222 words covering colors, object types, and descriptors. Prompts that follow the training label pattern (`a <color> pixel art <object>`) work best. Words outside the vocabulary are silently mapped to `<unk>`.

Sample supported words:
`red`, `blue`, `green`, `yellow`, `orange`, `purple`, `gold`, `dark`, `teal`, `silver` ·
`sword`, `shield`, `bow`, `axe`, `staff`, `wand`, `armour` ·
`knight`, `wizard`, `archer`, `dragon`, `goblin`, `skeleton`, `ghost`, `vampire` ·
`castle`, `tree`, `flower`, `mushroom`, `chest`, `potion`, `gem`, `key`, `crown`, `ship`, `horse`, and more.

---

## Limitations

- Outputs are 32×32 pixels. Upscale with `Image.NEAREST` to preserve the pixel art look.
- Trained entirely on procedurally-generated synthetic data — no real-world artwork knowledge.
- Generation quality is best for prompts close to training label patterns.
- Color fidelity is bounded by the 256-color learned palette.

---

## Citations

```bibtex

@article{wang2023bitnet,

  title={BitNet: Scaling 1-bit Transformers for Large Language Models},

  author={Wang, Hongyu and others},

  journal={arXiv:2310.11453},

  year={2023}

}



@article{ma2024bitnet158,

  title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},

  author={Ma, Shuming and others},

  journal={arXiv:2402.17764},

  year={2024}

}

```

---

## License

MIT