Rewrite model card with architecture, training stats, and usage

a9e7f27 verified 17 days ago

5.93 kB

	---
	language:
	- en
	license: mit
	tags:
	- pixel-art
	- image-generation
	- bitnet
	- ternary
	- autoregressive
	- text-to-image
	pipeline_tag: text-to-image
	---

	# BitPixelLM

	BitPixelLM is a small autoregressive language model trained to generate 32×32 pixel art from short text prompts.
	It was built entirely from scratch as a research and learning project, with no pre-trained weights or external image encoders.

	> Example prompts: `a red pixel art sword` · `a blue pixel art knight` · `a green pixel art dragon`

	---

	## Model Architecture

	BitPixelLM is a text-conditioned autoregressive decoder based on [BitNet b1.58](https://arxiv.org/abs/2402.17764).
	Instead of full-precision weights, the decoder uses ternary weights {−1, 0, +1}, making it extremely parameter-efficient.

	\| Component \| Details \|
	\|---\|---\|
	\| Parameters \| 7,382,274 (~7.4M total) \|
	\| Decoder layers \| 6 (BitNet b1.58 — ternary weights) \|
	\| Text encoder layers \| 3 (standard FP32 transformer) \|
	\| Model dimension \| 256 \|
	\| Attention heads \| 8 \|
	\| Feed-forward dim \| 512 \|
	\| Weight format \| ~75% ternary (1.58 bits/weight), ~25% FP32 \|
	\| Output \| 32×32 RGB pixel art (256-color palette) \|

	Key design choices:
	- BitLinear b1.58: decoder attention and feed-forward weights are quantized to {−1, 0, +1} during the forward pass via `sign(W / mean(\|W\|))`. Embeddings, norms, and the text encoder remain FP32.
	- RMSNorm instead of LayerNorm (pre-norm architecture).
	- SwiGLU activation in feed-forward blocks.
	- 2D positional encoding: the 32×32 pixel grid is encoded with separate row/column sinusoidal embeddings, injected at each decoder layer.
	- Cross-attention: the decoder attends to text encoder outputs at every layer.
	- Palette tokenization: each pixel is one discrete token from a learned 256-color palette. Generation is autoregressive over 1,024 pixel tokens (32×32) per image.

	---

	## Training

	The model was trained on a fully synthetic procedurally-generated dataset of labeled 32×32 pixel art sprites.

	\| Dataset \| Samples \| Categories \| Vocabulary \|
	\|---\|---\|---\|---\|
	\| v3 (current) \| 23,648 \| 199 \| 222 words \|

	Categories include characters, creatures, items, terrain, food, vehicles, buildings, nature, UI icons, furniture, and instruments.
	Each image has a short descriptive label (e.g. `a red pixel art sword`) used as the text conditioning signal.

	Training configuration:

	\| Setting \| Value \|
	\|---\|---\|
	\| Epochs \| 60 \|
	\| Batch size \| 32 \|
	\| Optimizer \| AdamW (β₁=0.9, β₂=0.95, wd=0.01) \|
	\| Learning rate \| 5×10⁻⁴ with cosine annealing + 500-step warmup \|
	\| Hardware \| NVIDIA RTX 3080 (10 GB VRAM) \|

	Results (v3 dataset, best at epoch 32):

	\| Metric \| Value \|
	\|---\|---\|
	\| Best validation loss \| 0.4015 \|
	\| Perplexity \| ~1.49 \|

	---

	## Usage

	### Requirements

	```
	torch
	numpy
	Pillow
	```

	### Load and generate

	```python
	import json, torch
	from PIL import Image
	from model.tokenizer import PaletteTokenizer
	from model.text_encoder import TextTokenizer, TextEncoder
	from model.bit_pixel_decoder import BitPixelLMDecoder, BitPixelLM

	# Load tokenizers
	palette_tok = PaletteTokenizer(palette_path="palette_256.npy")
	with open("vocab.json") as f:
	vocab = json.load(f)
	text_tok = TextTokenizer(vocab)

	# Build model
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	text_encoder = TextEncoder(
	vocab_size=text_tok.vocab_size, d_model=256, nhead=8,
	num_layers=3, dim_feedforward=512, max_seq_len=32,
	)
	pixel_decoder = BitPixelLMDecoder(
	vocab_size=palette_tok.vocab_size, d_model=256, nhead=8,
	num_layers=6, dim_feedforward=512, img_size=32,
	)
	model = BitPixelLM(text_encoder, pixel_decoder).to(device)

	# Load weights
	ckpt = torch.load("best.pt", map_location=device, weights_only=False)
	model.load_state_dict(ckpt["model_state_dict"])
	model.eval()

	# Generate
	prompt = "a red pixel art sword"
	text_tokens = text_tok.encode(prompt).unsqueeze(0).to(device)
	with torch.no_grad():
	generated = model.generate(
	text_tokens,
	sos_token=palette_tok.sos_token,
	eos_token=palette_tok.eos_token,
	temperature=0.8,
	top_k=40,
	top_p=0.9,
	)

	# Decode to image
	img_array = palette_tok.decode_tokens(generated[0].cpu().tolist())
	img = Image.fromarray(img_array, "RGB").resize((256, 256), Image.NEAREST)
	img.save("output.png")
	```

	### Vocabulary

	The model knows 222 words covering colors, object types, and descriptors. Prompts that follow the training label pattern (`a <color> pixel art <object>`) work best. Words outside the vocabulary are silently mapped to `<unk>`.

	Sample supported words:
	`red`, `blue`, `green`, `yellow`, `orange`, `purple`, `gold`, `dark`, `teal`, `silver` ·
	`sword`, `shield`, `bow`, `axe`, `staff`, `wand`, `armour` ·
	`knight`, `wizard`, `archer`, `dragon`, `goblin`, `skeleton`, `ghost`, `vampire` ·
	`castle`, `tree`, `flower`, `mushroom`, `chest`, `potion`, `gem`, `key`, `crown`, `ship`, `horse`, and more.

	---

	## Limitations

	- Outputs are 32×32 pixels. Upscale with `Image.NEAREST` to preserve the pixel art look.
	- Trained entirely on procedurally-generated synthetic data — no real-world artwork knowledge.
	- Generation quality is best for prompts close to training label patterns.
	- Color fidelity is bounded by the 256-color learned palette.

	---

	## Citations

	```bibtex
	@article{wang2023bitnet,
	title={BitNet: Scaling 1-bit Transformers for Large Language Models},
	author={Wang, Hongyu and others},
	journal={arXiv:2310.11453},
	year={2023}
	}

	@article{ma2024bitnet158,
	title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
	author={Ma, Shuming and others},
	journal={arXiv:2402.17764},
	year={2024}
	}
	```

	---

	## License

	MIT