File size: 4,209 Bytes
e6551d4 71e947b e6551d4 1898ced 5331054 1898ced 5331054 1898ced 5331054 1898ced 5331054 1898ced 5331054 1898ced 5331054 1898ced 5331054 1898ced 5331054 1898ced 5331054 1898ced | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | ---
license: apache-2.0
datasets:
- TopAI-1/Image-Dataset
language:
- en
pipeline_tag: text-to-image
library_name: transformers
tags:
- art
---

# Pixel-1: From-Scratch Text-to-Image Generator ๐จ
Pixel-1 is a lightweight, experimental text-to-image model built and trained entirely from scratch. Unlike many modern generators that rely on massive pre-trained diffusion backbones, Pixel-1 explores the potential of a compact architecture to understand and render complex semantic prompts.
## ๐ The Achievement
Pixel-1 was designed to prove that even a small model can achieve high logical alignment with user prompts. It successfully renders complex concepts like **window bars**, **fence shadows**, and **specific color contrasts**โfeatures usually reserved for much larger models.
### Key Features:
* **Built from Scratch:** The Generator architecture (Upsampling, Residual Blocks, and Projections) was designed and trained without pre-trained image weights.
* **High Prompt Adherence:** Exceptional ability to "listen" to complex instructions (e.g., "Window with metal bars and fence shadow").
* **Efficient Architecture:** Optimized for fast inference and training on consumer-grade GPUs (like Kaggle's T4).
* **Latent Understanding:** Uses a CLIP-based text encoder to bridge the gap between human language and pixel space.
---
## ๐๏ธ Architecture
The model uses a series of Transposed Convolutional layers combined with Residual Blocks to upsample a latent text vector into a 128x128 image.
* **Encoder:** CLIP (OpenAI/clip-vit-large-patch14)
* **Decoder:** Custom CNN-based Generator with Skip Connections
* **Loss Function:** L1/MSE transition
* **Resolution:** 128x128 (v1)
---
## ๐ผ๏ธ Samples & Prompting
Pixel-1 shines when given high-contrast, descriptive prompts.
**Recommended Prompting Style:**
> *"Window with metal bars and fence shadow, high contrast, vivid colors, detailed structure"*
**Observations:**
While the current version (v1) produces stylistic, slightly "painterly" or "pixelated" results, its spatial reasoning is remarkably accurate, correctly placing shadows and structural elements according to the text.
---
## ๐ ๏ธ How to use
```python
import torch
import matplotlib.pyplot as plt
import numpy as np
import os
import shutil
from transformers import AutoTokenizer, CLIPTextModel, AutoModel, AutoConfig
def generate_fixed_from_hub(prompt, model_id="TopAI-1/Pixel-1"):
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"๐ Working on {device}...")
# 1. ื ืืงืื Cache ืืื ืืืืื ืฉืืชื ืืืฉื ืืช ืืชืืงืื ืื ืืืืฉืื ืื-Hub
cache_path = os.path.expanduser(f"~/.cache/huggingface/hub/models--{model_id.replace('/', '--')}")
if os.path.exists(cache_path):
print("๐งน Clearing old cache to fetch your latest fixes...")
shutil.rmtree(cache_path)
# 2. ืืขืื ืช CLIP
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(device)
# 3. ืืขืื ืช ืืืืื ืืืงืื ืคืื ืืืืืืืืช ืื-Hub
# ืืืืืช ื-auto_map ื-config.json, transformers ืืืฆื ืืื ืืช ืืืืืงืืช
print("๐ฅ Downloading architecture and weights directly from Hub...")
model = AutoModel.from_pretrained(
model_id,
trust_remote_code=True,
force_download=True
).to(device)
model.eval()
print("โ
Model loaded successfully!")
# 4. ืืฆืืจื
print(f"๐จ Generating: {prompt}")
inputs = tokenizer(prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt").to(device)
with torch.no_grad():
emb = text_encoder(inputs.input_ids).pooler_output
out = model(emb)
# 5. ืชืฆืืื
img = (out.squeeze(0).cpu().permute(1, 2, 0).numpy() + 1.0) / 2.0
plt.figure(figsize=(8, 8))
plt.imshow(np.clip(img, 0, 1))
plt.axis('off')
plt.show()
# ืืจืฆื
generate_fixed_from_hub("Window with metal bars and fence shadow") |