| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - TopAI-1/Image-Dataset |
| | language: |
| | - en |
| | pipeline_tag: text-to-image |
| | library_name: transformers |
| | tags: |
| | - art |
| | --- |
| | |
| |  |
| |
|
| | # Pixel-1: From-Scratch Text-to-Image Generator ๐จ |
| |
|
| | Pixel-1 is a lightweight, experimental text-to-image model built and trained entirely from scratch. Unlike many modern generators that rely on massive pre-trained diffusion backbones, Pixel-1 explores the potential of a compact architecture to understand and render complex semantic prompts. |
| |
|
| | ## ๐ The Achievement |
| | Pixel-1 was designed to prove that even a small model can achieve high logical alignment with user prompts. It successfully renders complex concepts like **window bars**, **fence shadows**, and **specific color contrasts**โfeatures usually reserved for much larger models. |
| |
|
| | ### Key Features: |
| | * **Built from Scratch:** The Generator architecture (Upsampling, Residual Blocks, and Projections) was designed and trained without pre-trained image weights. |
| | * **High Prompt Adherence:** Exceptional ability to "listen" to complex instructions (e.g., "Window with metal bars and fence shadow"). |
| | * **Efficient Architecture:** Optimized for fast inference and training on consumer-grade GPUs (like Kaggle's T4). |
| | * **Latent Understanding:** Uses a CLIP-based text encoder to bridge the gap between human language and pixel space. |
| |
|
| | --- |
| |
|
| | ## ๐๏ธ Architecture |
| | The model uses a series of Transposed Convolutional layers combined with Residual Blocks to upsample a latent text vector into a 128x128 image. |
| |
|
| | * **Encoder:** CLIP (OpenAI/clip-vit-large-patch14) |
| | * **Decoder:** Custom CNN-based Generator with Skip Connections |
| | * **Loss Function:** L1/MSE transition |
| | * **Resolution:** 128x128 (v1) |
| |
|
| |
|
| |
|
| | --- |
| |
|
| | ## ๐ผ๏ธ Samples & Prompting |
| | Pixel-1 shines when given high-contrast, descriptive prompts. |
| |
|
| | **Recommended Prompting Style:** |
| | > *"Window with metal bars and fence shadow, high contrast, vivid colors, detailed structure"* |
| |
|
| | **Observations:** |
| | While the current version (v1) produces stylistic, slightly "painterly" or "pixelated" results, its spatial reasoning is remarkably accurate, correctly placing shadows and structural elements according to the text. |
| |
|
| | --- |
| |
|
| | ## ๐ ๏ธ How to use |
| | ```python |
| | import torch |
| | import matplotlib.pyplot as plt |
| | import numpy as np |
| | import os |
| | import shutil |
| | from transformers import AutoTokenizer, CLIPTextModel, AutoModel, AutoConfig |
| | |
| | def generate_fixed_from_hub(prompt, model_id="TopAI-1/Pixel-1"): |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | print(f"๐ Working on {device}...") |
| | |
| | # 1. ื ืืงืื Cache ืืื ืืืืื ืฉืืชื ืืืฉื ืืช ืืชืืงืื ืื ืืืืฉืื ืื-Hub |
| | cache_path = os.path.expanduser(f"~/.cache/huggingface/hub/models--{model_id.replace('/', '--')}") |
| | if os.path.exists(cache_path): |
| | print("๐งน Clearing old cache to fetch your latest fixes...") |
| | shutil.rmtree(cache_path) |
| | |
| | # 2. ืืขืื ืช CLIP |
| | tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-large-patch14") |
| | text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(device) |
| | |
| | # 3. ืืขืื ืช ืืืืื ืืืงืื ืคืื ืืืืืืืืช ืื-Hub |
| | # ืืืืืช ื-auto_map ื-config.json, transformers ืืืฆื ืืื ืืช ืืืืืงืืช |
| | print("๐ฅ Downloading architecture and weights directly from Hub...") |
| | model = AutoModel.from_pretrained( |
| | model_id, |
| | trust_remote_code=True, |
| | force_download=True |
| | ).to(device) |
| | |
| | model.eval() |
| | print("โ
Model loaded successfully!") |
| | |
| | # 4. ืืฆืืจื |
| | print(f"๐จ Generating: {prompt}") |
| | inputs = tokenizer(prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt").to(device) |
| | |
| | with torch.no_grad(): |
| | emb = text_encoder(inputs.input_ids).pooler_output |
| | out = model(emb) |
| | |
| | # 5. ืชืฆืืื |
| | img = (out.squeeze(0).cpu().permute(1, 2, 0).numpy() + 1.0) / 2.0 |
| | plt.figure(figsize=(8, 8)) |
| | plt.imshow(np.clip(img, 0, 1)) |
| | plt.axis('off') |
| | plt.show() |
| | |
| | # ืืจืฆื |
| | generate_fixed_from_hub("Window with metal bars and fence shadow") |