File size: 4,209 Bytes
e6551d4
 
 
 
 
 
 
 
 
 
 
 
71e947b
e6551d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1898ced
 
 
5331054
 
 
1898ced
 
 
 
 
5331054
 
 
 
 
1898ced
5331054
1898ced
 
 
5331054
 
 
 
 
 
 
 
1898ced
5331054
 
1898ced
5331054
 
1898ced
5331054
1898ced
 
 
 
5331054
1898ced
 
 
 
 
 
5331054
1898ced
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
license: apache-2.0
datasets:
- TopAI-1/Image-Dataset
language:
- en
pipeline_tag: text-to-image
library_name: transformers
tags:
- art
---

![image](https://cdn-uploads.huggingface.co/production/uploads/6883b03536f0c50bc30bda75/M3WTacD5y08KCgTHaCD0e.png)

# Pixel-1: From-Scratch Text-to-Image Generator ๐ŸŽจ

Pixel-1 is a lightweight, experimental text-to-image model built and trained entirely from scratch. Unlike many modern generators that rely on massive pre-trained diffusion backbones, Pixel-1 explores the potential of a compact architecture to understand and render complex semantic prompts.

## ๐Ÿš€ The Achievement
Pixel-1 was designed to prove that even a small model can achieve high logical alignment with user prompts. It successfully renders complex concepts like **window bars**, **fence shadows**, and **specific color contrasts**โ€”features usually reserved for much larger models.

### Key Features:
* **Built from Scratch:** The Generator architecture (Upsampling, Residual Blocks, and Projections) was designed and trained without pre-trained image weights.
* **High Prompt Adherence:** Exceptional ability to "listen" to complex instructions (e.g., "Window with metal bars and fence shadow").
* **Efficient Architecture:** Optimized for fast inference and training on consumer-grade GPUs (like Kaggle's T4).
* **Latent Understanding:** Uses a CLIP-based text encoder to bridge the gap between human language and pixel space.

---

## ๐Ÿ—๏ธ Architecture
The model uses a series of Transposed Convolutional layers combined with Residual Blocks to upsample a latent text vector into a 128x128 image.

* **Encoder:** CLIP (OpenAI/clip-vit-large-patch14)
* **Decoder:** Custom CNN-based Generator with Skip Connections
* **Loss Function:** L1/MSE transition
* **Resolution:** 128x128 (v1)



---

## ๐Ÿ–ผ๏ธ Samples & Prompting
Pixel-1 shines when given high-contrast, descriptive prompts. 

**Recommended Prompting Style:**
> *"Window with metal bars and fence shadow, high contrast, vivid colors, detailed structure"*

**Observations:**
While the current version (v1) produces stylistic, slightly "painterly" or "pixelated" results, its spatial reasoning is remarkably accurate, correctly placing shadows and structural elements according to the text.

---

## ๐Ÿ› ๏ธ How to use
```python
import torch
import matplotlib.pyplot as plt
import numpy as np
import os
import shutil
from transformers import AutoTokenizer, CLIPTextModel, AutoModel, AutoConfig

def generate_fixed_from_hub(prompt, model_id="TopAI-1/Pixel-1"):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"๐Ÿš€ Working on {device}...")

    # 1. ื ื™ืงื•ื™ Cache ื›ื“ื™ ืœื•ื•ื“ื ืฉืืชื” ืžื•ืฉืš ืืช ื”ืชื™ืงื•ื ื™ื ื”ื—ื“ืฉื™ื ืžื”-Hub
    cache_path = os.path.expanduser(f"~/.cache/huggingface/hub/models--{model_id.replace('/', '--')}")
    if os.path.exists(cache_path):
        print("๐Ÿงน Clearing old cache to fetch your latest fixes...")
        shutil.rmtree(cache_path)

    # 2. ื˜ืขื™ื ืช CLIP
    tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-large-patch14")
    text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(device)

    # 3. ื˜ืขื™ื ืช ื”ืžื•ื“ืœ ื•ื”ืงื•ื ืคื™ื’ ืื•ื˜ื•ืžื˜ื™ืช ืžื”-Hub
    # ื‘ื–ื›ื•ืช ื”-auto_map ื‘-config.json, transformers ื™ืžืฆื ืœื‘ื“ ืืช ื”ืžื—ืœืงื•ืช
    print("๐Ÿ“ฅ Downloading architecture and weights directly from Hub...")
    model = AutoModel.from_pretrained(
        model_id, 
        trust_remote_code=True, 
        force_download=True
    ).to(device)
    
    model.eval()
    print("โœ… Model loaded successfully!")

    # 4. ื™ืฆื™ืจื”
    print(f"๐ŸŽจ Generating: {prompt}")
    inputs = tokenizer(prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt").to(device)
    
    with torch.no_grad():
        emb = text_encoder(inputs.input_ids).pooler_output
        out = model(emb)

    # 5. ืชืฆื•ื’ื”
    img = (out.squeeze(0).cpu().permute(1, 2, 0).numpy() + 1.0) / 2.0
    plt.figure(figsize=(8, 8))
    plt.imshow(np.clip(img, 0, 1))
    plt.axis('off')
    plt.show()

# ื”ืจืฆื”
generate_fixed_from_hub("Window with metal bars and fence shadow")