Pixel-1 / README.md

Update README.md

71e947b verified 15 days ago

4.21 kB

	---
	license: apache-2.0
	datasets:
	- TopAI-1/Image-Dataset
	language:
	- en
	pipeline_tag: text-to-image
	library_name: transformers
	tags:
	- art
	---

	![image](https://cdn-uploads.huggingface.co/production/uploads/6883b03536f0c50bc30bda75/M3WTacD5y08KCgTHaCD0e.png)

	# Pixel-1: From-Scratch Text-to-Image Generator 🎨

	Pixel-1 is a lightweight, experimental text-to-image model built and trained entirely from scratch. Unlike many modern generators that rely on massive pre-trained diffusion backbones, Pixel-1 explores the potential of a compact architecture to understand and render complex semantic prompts.

	## 🚀 The Achievement
	Pixel-1 was designed to prove that even a small model can achieve high logical alignment with user prompts. It successfully renders complex concepts like window bars, fence shadows, and specific color contrasts—features usually reserved for much larger models.

	### Key Features:
	* Built from Scratch: The Generator architecture (Upsampling, Residual Blocks, and Projections) was designed and trained without pre-trained image weights.
	* High Prompt Adherence: Exceptional ability to "listen" to complex instructions (e.g., "Window with metal bars and fence shadow").
	* Efficient Architecture: Optimized for fast inference and training on consumer-grade GPUs (like Kaggle's T4).
	* Latent Understanding: Uses a CLIP-based text encoder to bridge the gap between human language and pixel space.

	---

	## 🏗️ Architecture
	The model uses a series of Transposed Convolutional layers combined with Residual Blocks to upsample a latent text vector into a 128x128 image.

	* Encoder: CLIP (OpenAI/clip-vit-large-patch14)
	* Decoder: Custom CNN-based Generator with Skip Connections
	* Loss Function: L1/MSE transition
	* Resolution: 128x128 (v1)



	---

	## 🖼️ Samples & Prompting
	Pixel-1 shines when given high-contrast, descriptive prompts.

	Recommended Prompting Style:
	> "Window with metal bars and fence shadow, high contrast, vivid colors, detailed structure"

	Observations:
	While the current version (v1) produces stylistic, slightly "painterly" or "pixelated" results, its spatial reasoning is remarkably accurate, correctly placing shadows and structural elements according to the text.

	---

	## 🛠️ How to use
	```python
	import torch
	import matplotlib.pyplot as plt
	import numpy as np
	import os
	import shutil
	from transformers import AutoTokenizer, CLIPTextModel, AutoModel, AutoConfig

	def generate_fixed_from_hub(prompt, model_id="TopAI-1/Pixel-1"):
	device = "cuda" if torch.cuda.is_available() else "cpu"
	print(f"🚀 Working on {device}...")

	# 1. ניקוי Cache כדי לוודא שאתה מושך את התיקונים החדשים מה-Hub
	cache_path = os.path.expanduser(f"~/.cache/huggingface/hub/models--{model_id.replace('/', '--')}")
	if os.path.exists(cache_path):
	print("🧹 Clearing old cache to fetch your latest fixes...")
	shutil.rmtree(cache_path)

	# 2. טעינת CLIP
	tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-large-patch14")
	text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(device)

	# 3. טעינת המודל והקונפיג אוטומטית מה-Hub
	# בזכות ה-auto_map ב-config.json, transformers ימצא לבד את המחלקות
	print("📥 Downloading architecture and weights directly from Hub...")
	model = AutoModel.from_pretrained(
	model_id,
	trust_remote_code=True,
	force_download=True
	).to(device)

	model.eval()
	print("✅ Model loaded successfully!")

	# 4. יצירה
	print(f"🎨 Generating: {prompt}")
	inputs = tokenizer(prompt, padding="max_length", max_length=77, truncation=True, return_tensors="pt").to(device)

	with torch.no_grad():
	emb = text_encoder(inputs.input_ids).pooler_output
	out = model(emb)

	# 5. תצוגה
	img = (out.squeeze(0).cpu().permute(1, 2, 0).numpy() + 1.0) / 2.0
	plt.figure(figsize=(8, 8))
	plt.imshow(np.clip(img, 0, 1))
	plt.axis('off')
	plt.show()

	# הרצה
	generate_fixed_from_hub("Window with metal bars and fence shadow")