Spaces:
Sleeping
Create app.py
Browse filesMultimodal AI Image Studio Guide
Welcome to the Multimodal AI Image Studio! This tool provides an integrated platform for generating, comparing, and analyzing AI-generated images. Whether you're an artist, researcher, or just exploring AI, this interface gives you everything you need to work with AI-generated images and text.
How to Use
Step 1: Upload Reference Image
Simply drag and drop your image into the uploader.
The system will automatically generate a descriptive caption for your image.
Step 2: Generate Images from Caption
After generating a caption, you can use it to generate new images with:
SD-Turbo (for realistic images)
DreamShaper (for artistic, stylized creations)
Optionally, add a custom prompt enhancer (e.g., "with a futuristic city background") to refine the generated images.
Step 3: Compare Image Metrics
Once you've generated multiple images, you can compute metrics to compare their similarity:
CLIP: Measures how similar the images are to each other.
LPIPS: Evaluates perceptual differences between images.
BERTScore: Compares the captions of the images for textual similarity.
Step 4: NLP Analysis of Captions
You can analyze the captions of your images for:
Sentiment Analysis: Get a sense of the emotional tone of the caption (positive, negative, neutral).
Named Entity Recognition: Identify key entities mentioned in the captions (such as people, places, or organizations).
Topic Classification: Classify the caption into categories like "people," "nature," or "food."
Step 5: Ask Questions with VQA
If you have a reference image, you can ask it questions, such as "What color is the sky?" or "What is the animal in the image?"
The system will answer based on the content of the image.
|
@@ -0,0 +1,574 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# **Purpose**
|
| 2 |
+
|
| 3 |
+
# =====================================================
|
| 4 |
+
# Multimodal AI Image Studio
|
| 5 |
+
# =====================================================
|
| 6 |
+
# Purpose:
|
| 7 |
+
# This script provides a unified interface for generating,
|
| 8 |
+
# comparing, and analyzing AI-generated images.
|
| 9 |
+
#
|
| 10 |
+
# Key Features:
|
| 11 |
+
# 1. Upload a reference image and automatically generate captions.
|
| 12 |
+
# 2. Enhance prompts to generate images using:
|
| 13 |
+
# - SD-Turbo (Stability AI)
|
| 14 |
+
# - DreamShaper (Artistic style model)
|
| 15 |
+
# 3. Compute pairwise metrics between images:
|
| 16 |
+
# - CLIP similarity
|
| 17 |
+
# - LPIPS perceptual similarity
|
| 18 |
+
# - BERTScore textual similarity
|
| 19 |
+
# 4. NLP analysis of captions:
|
| 20 |
+
# - Sentiment analysis
|
| 21 |
+
# - Named entity recognition
|
| 22 |
+
# - Topic classification
|
| 23 |
+
# 5. Visual Question Answering (VQA) on the reference image.
|
| 24 |
+
#
|
| 25 |
+
# Requirements:
|
| 26 |
+
# - Python >= 3.9
|
| 27 |
+
# - GPU recommended for faster image generation
|
| 28 |
+
#
|
| 29 |
+
# Usage:
|
| 30 |
+
# 1. Install dependencies (see requirements.txt)
|
| 31 |
+
# 2. Run this script
|
| 32 |
+
# 3. Access the Gradio web interface for interactive exploration
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
# **Section One**
|
| 36 |
+
|
| 37 |
+
# ==============================
|
| 38 |
+
# SECTION 1
|
| 39 |
+
# ==============================
|
| 40 |
+
# Install
|
| 41 |
+
!pip install -qq git+https://github.com/openai/CLIP.git
|
| 42 |
+
!pip install -qq lpips
|
| 43 |
+
!pip install -qq bert-score
|
| 44 |
+
!pip install -qq transformers accelerate
|
| 45 |
+
!pip install -qq diffusers gradio
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
# Libraries
|
| 49 |
+
import torch
|
| 50 |
+
import gradio as gr
|
| 51 |
+
from PIL import Image
|
| 52 |
+
from diffusers import DiffusionPipeline
|
| 53 |
+
from transformers import pipeline, BlipProcessor, BlipForQuestionAnswering
|
| 54 |
+
import lpips
|
| 55 |
+
import clip
|
| 56 |
+
from bert_score import score
|
| 57 |
+
import torchvision.transforms as T
|
| 58 |
+
|
| 59 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 60 |
+
|
| 61 |
+
def free_gpu_cache():
|
| 62 |
+
if device == "cuda":
|
| 63 |
+
torch.cuda.empty_cache()
|
| 64 |
+
|
| 65 |
+
# ==============================
|
| 66 |
+
# MODELS
|
| 67 |
+
# ==============================
|
| 68 |
+
gen_pipe = DiffusionPipeline.from_pretrained(
|
| 69 |
+
"stabilityai/sdxl-turbo",
|
| 70 |
+
torch_dtype=torch.float16 if device=="cuda" else torch.float32
|
| 71 |
+
).to(device)
|
| 72 |
+
|
| 73 |
+
dreamshaper_pipe = DiffusionPipeline.from_pretrained(
|
| 74 |
+
"Lykon/dreamshaper-7",
|
| 75 |
+
torch_dtype=torch.float16 if device=="cuda" else torch.float32
|
| 76 |
+
).to(device)
|
| 77 |
+
|
| 78 |
+
captioner = pipeline(
|
| 79 |
+
"image-to-text",
|
| 80 |
+
model="Salesforce/blip-image-captioning-large",
|
| 81 |
+
device=0 if device=="cuda" else -1,)
|
| 82 |
+
#generate_kwargs={"max_new_tokens":256, "num_beams":5, "temperature":0.7})
|
| 83 |
+
|
| 84 |
+
sentiment_model = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english",
|
| 85 |
+
device=0 if device=="cuda" else -1)
|
| 86 |
+
ner_model = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english",
|
| 87 |
+
aggregation_strategy="simple", device=0 if device=="cuda" else -1)
|
| 88 |
+
topic_model = pipeline("zero-shot-classification", model="facebook/bart-large-mnli",
|
| 89 |
+
device=0 if device=="cuda" else -1)
|
| 90 |
+
|
| 91 |
+
vqa_processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
|
| 92 |
+
vqa_model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base").to("cpu")
|
| 93 |
+
|
| 94 |
+
clip_model, clip_preprocess = clip.load("ViT-B/32", device=device)
|
| 95 |
+
lpips_model = lpips.LPIPS(net='alex').to(device)
|
| 96 |
+
lpips_transform = T.Compose([T.ToTensor(), T.Resize((256,256))])
|
| 97 |
+
|
| 98 |
+
style_map = {
|
| 99 |
+
"Photorealistic": "photorealistic, ultra-detailed, 8k, cinematic lighting",
|
| 100 |
+
"Real Life": "natural lighting, true-to-life colors, DSLR",
|
| 101 |
+
"Documentary": "documentary handheld muted colors",
|
| 102 |
+
"iPhone Camera": "iPhone photo natural HDR",
|
| 103 |
+
"Street Photography": "candid street ambient shadows",
|
| 104 |
+
"Cinematic": "cinematic lighting dramatic depth",
|
| 105 |
+
"Anime": "anime cel shaded vibrant",
|
| 106 |
+
"Watercolor": "watercolor soft wash art",
|
| 107 |
+
"Macro": "macro lens shallow DOF",
|
| 108 |
+
"Cyberpunk": "neon cyberpunk futuristic",
|
| 109 |
+
}
|
| 110 |
+
|
| 111 |
+
# **Section Two**
|
| 112 |
+
|
| 113 |
+
# ==============================
|
| 114 |
+
# SECTION 2 — FUNCTIONS
|
| 115 |
+
# ==============================
|
| 116 |
+
def generate_image_with_enhancer(base_caption, enhancer, negative, seed, style, images):
|
| 117 |
+
images = images or []
|
| 118 |
+
base_caption = base_caption or ""
|
| 119 |
+
enhancer = enhancer or ""
|
| 120 |
+
|
| 121 |
+
final_prompt = f"{base_caption}, {enhancer}".strip(", ")
|
| 122 |
+
final_prompt = f"{final_prompt}, {style_map.get(style,'')}".strip(", ")
|
| 123 |
+
|
| 124 |
+
try:
|
| 125 |
+
seed = int(seed)
|
| 126 |
+
except:
|
| 127 |
+
seed = 42
|
| 128 |
+
|
| 129 |
+
generator = torch.Generator(device="cpu").manual_seed(seed)
|
| 130 |
+
|
| 131 |
+
try:
|
| 132 |
+
with torch.no_grad():
|
| 133 |
+
out = gen_pipe(prompt=final_prompt, negative_prompt=negative, generator=generator)
|
| 134 |
+
img = out.images[0]
|
| 135 |
+
except Exception as e:
|
| 136 |
+
print("SD Turbo failed:", e)
|
| 137 |
+
img = None
|
| 138 |
+
|
| 139 |
+
if img:
|
| 140 |
+
images.append(img)
|
| 141 |
+
|
| 142 |
+
free_gpu_cache()
|
| 143 |
+
return img, images
|
| 144 |
+
|
| 145 |
+
def generate_dreamshaper_with_enhancer(base_caption, enhancer, negative, seed, style, images):
|
| 146 |
+
images = images or []
|
| 147 |
+
base_caption = base_caption or ""
|
| 148 |
+
enhancer = enhancer or ""
|
| 149 |
+
|
| 150 |
+
final_prompt = f"{base_caption}, {enhancer}".strip(", ")
|
| 151 |
+
final_prompt = f"{final_prompt}, {style_map.get(style,'')}".strip(", ")
|
| 152 |
+
|
| 153 |
+
try:
|
| 154 |
+
seed = int(seed)
|
| 155 |
+
except:
|
| 156 |
+
seed = 42
|
| 157 |
+
|
| 158 |
+
generator = torch.Generator(device="cpu").manual_seed(seed)
|
| 159 |
+
|
| 160 |
+
try:
|
| 161 |
+
with torch.no_grad():
|
| 162 |
+
out = dreamshaper_pipe(prompt=final_prompt, negative_prompt=negative, generator=generator)
|
| 163 |
+
img = out.images[0]
|
| 164 |
+
except Exception as e:
|
| 165 |
+
print("DreamShaper failed:", e)
|
| 166 |
+
img = None
|
| 167 |
+
|
| 168 |
+
if img:
|
| 169 |
+
images.append(img)
|
| 170 |
+
|
| 171 |
+
free_gpu_cache()
|
| 172 |
+
return img, images
|
| 173 |
+
|
| 174 |
+
def caption_for_image(img):
|
| 175 |
+
try:
|
| 176 |
+
out = captioner(img)
|
| 177 |
+
return out[0]["generated_text"]
|
| 178 |
+
except:
|
| 179 |
+
return "Caption failed."
|
| 180 |
+
|
| 181 |
+
def answer_vqa(question, image):
|
| 182 |
+
if not image or not question.strip():
|
| 183 |
+
return "Provide image + question."
|
| 184 |
+
try:
|
| 185 |
+
inputs_raw = vqa_processor(images=image, text=question, return_tensors="pt")
|
| 186 |
+
inputs = {k:v.to("cpu") for k,v in inputs_raw.items()}
|
| 187 |
+
with torch.no_grad():
|
| 188 |
+
out = vqa_model(**inputs)
|
| 189 |
+
ans_id = out.logits.argmax(-1)
|
| 190 |
+
return vqa_processor.decode(ans_id[0], skip_special_tokens=True)
|
| 191 |
+
except:
|
| 192 |
+
return "VQA failed."
|
| 193 |
+
|
| 194 |
+
def compute_metrics(images, captions, i1, i2):
|
| 195 |
+
img1 = images[i1]
|
| 196 |
+
img2 = images[i2]
|
| 197 |
+
cap1 = captions[i1]
|
| 198 |
+
cap2 = captions[i2]
|
| 199 |
+
|
| 200 |
+
# CLIP
|
| 201 |
+
t1 = clip_preprocess(img1).unsqueeze(0).to("cpu")
|
| 202 |
+
t2 = clip_preprocess(img2).unsqueeze(0).to("cpu")
|
| 203 |
+
with torch.no_grad():
|
| 204 |
+
f1 = clip_model.encode_image(t1)
|
| 205 |
+
f2 = clip_model.encode_image(t2)
|
| 206 |
+
clip_sim = float(torch.cosine_similarity(f1, f2))
|
| 207 |
+
|
| 208 |
+
# LPIPS
|
| 209 |
+
L1 = (lpips_transform(img1).unsqueeze(0)*2 - 1)
|
| 210 |
+
L2 = (lpips_transform(img2).unsqueeze(0)*2 - 1)
|
| 211 |
+
with torch.no_grad():
|
| 212 |
+
lp = float(lpips_model(L1, L2))
|
| 213 |
+
|
| 214 |
+
# BERTScore
|
| 215 |
+
if cap1 and cap2:
|
| 216 |
+
_, _, F = score([cap1],[cap2], lang="en", verbose=False)
|
| 217 |
+
bert_f1 = float(F.mean())
|
| 218 |
+
else:
|
| 219 |
+
bert_f1 = 0.0
|
| 220 |
+
|
| 221 |
+
return clip_sim, lp, bert_f1
|
| 222 |
+
|
| 223 |
+
# **Section Three**
|
| 224 |
+
|
| 225 |
+
# ==============================
|
| 226 |
+
# Section Three
|
| 227 |
+
# ==============================
|
| 228 |
+
|
| 229 |
+
# 1
|
| 230 |
+
# ---------------- Build Gradio UI with Custom Look ----------------
|
| 231 |
+
def build_ui_with_custom_ui():
|
| 232 |
+
with gr.Blocks(title="Multimodal AI Image Studio") as demo:
|
| 233 |
+
|
| 234 |
+
# ---------------- CSS Styling ----------------
|
| 235 |
+
gr.HTML("""
|
| 236 |
+
<style>
|
| 237 |
+
.heading-orange h2, .heading-orange h3 { color: #ff5500 !important; }
|
| 238 |
+
.orange-btn button {
|
| 239 |
+
background-color: #ff5500 !important;
|
| 240 |
+
color: white !important;
|
| 241 |
+
border-radius: 6px !important;
|
| 242 |
+
height: 36px !important;
|
| 243 |
+
font-weight: bold;
|
| 244 |
+
}
|
| 245 |
+
.teal-btn button {
|
| 246 |
+
background-color: #008080 !important;
|
| 247 |
+
color: white !important;
|
| 248 |
+
border-radius: 6px !important;
|
| 249 |
+
height: 40px !important;
|
| 250 |
+
font-weight: bold;
|
| 251 |
+
}
|
| 252 |
+
|
| 253 |
+
/* Horizontal thin spinner */
|
| 254 |
+
.loading-line {
|
| 255 |
+
height: 4px;
|
| 256 |
+
background: linear-gradient(90deg, #008080 0%, #00cccc 50%, #008080 100%);
|
| 257 |
+
background-size: 200% 100%;
|
| 258 |
+
animation: loading 1s linear infinite;
|
| 259 |
+
}
|
| 260 |
+
@keyframes loading {
|
| 261 |
+
0% { background-position: 200% 0; }
|
| 262 |
+
100% { background-position: -200% 0; }
|
| 263 |
+
}
|
| 264 |
+
|
| 265 |
+
/* Match enhancer box to upload button */
|
| 266 |
+
.enhancer-box textarea {
|
| 267 |
+
width: 100% !important;
|
| 268 |
+
height: 36px !important;
|
| 269 |
+
box-sizing: border-box;
|
| 270 |
+
font-size: 14px;
|
| 271 |
+
}
|
| 272 |
+
|
| 273 |
+
/* Equal-height styling for Step-1 columns */
|
| 274 |
+
.equal-height-row {
|
| 275 |
+
display: flex;
|
| 276 |
+
align-items: stretch;
|
| 277 |
+
}
|
| 278 |
+
.equal-height-row > .gr-column {
|
| 279 |
+
display: flex;
|
| 280 |
+
flex-direction: column;
|
| 281 |
+
}
|
| 282 |
+
|
| 283 |
+
/* Target Gradio image container */
|
| 284 |
+
.stretch-img .gr-image-container {
|
| 285 |
+
flex-grow: 1;
|
| 286 |
+
display: flex;
|
| 287 |
+
}
|
| 288 |
+
|
| 289 |
+
.stretch-img .gr-image-container img {
|
| 290 |
+
width: 100% !important;
|
| 291 |
+
height: 100% !important;
|
| 292 |
+
object-fit: contain; /* or cover */
|
| 293 |
+
}
|
| 294 |
+
|
| 295 |
+
|
| 296 |
+
|
| 297 |
+
</style>
|
| 298 |
+
""")
|
| 299 |
+
|
| 300 |
+
# ---------------- Heading ----------------
|
| 301 |
+
gr.Markdown(
|
| 302 |
+
"## Multimodal AI Image Studio: An Integrated Comparative Perspective",
|
| 303 |
+
elem_classes="heading-orange"
|
| 304 |
+
)
|
| 305 |
+
|
| 306 |
+
# ---------------- States ----------------
|
| 307 |
+
images_state = gr.State([])
|
| 308 |
+
captions_state = gr.State([])
|
| 309 |
+
|
| 310 |
+
# ---------------- Step 1: Upload Reference Image ----------------
|
| 311 |
+
gr.Markdown("### Upload Reference Image", elem_classes="heading-orange")
|
| 312 |
+
|
| 313 |
+
with gr.Row(elem_classes="equal-height-row"):
|
| 314 |
+
with gr.Column(scale=1):
|
| 315 |
+
upload_input = gr.Image(label="Drag & Drop Image", type="pil")
|
| 316 |
+
upload_btn = gr.Button(
|
| 317 |
+
"Upload Image & Generate Caption",
|
| 318 |
+
elem_classes="orange-btn"
|
| 319 |
+
)
|
| 320 |
+
|
| 321 |
+
with gr.Column(scale=1):
|
| 322 |
+
upload_preview = gr.Image(
|
| 323 |
+
label="Uploaded Image",
|
| 324 |
+
interactive=False, elem_classes="stretch-img"
|
| 325 |
+
)
|
| 326 |
+
|
| 327 |
+
enhancer_box = gr.Textbox(
|
| 328 |
+
label="Add Prompt Enhancer (Optional)",
|
| 329 |
+
placeholder="Example: 'at night with neon lights', 'wearing a red jacket', etc.",
|
| 330 |
+
elem_classes="enhancer-box"
|
| 331 |
+
)
|
| 332 |
+
|
| 333 |
+
caption_out = gr.Markdown(label="Generated Caption")
|
| 334 |
+
|
| 335 |
+
# ---------------- Robust Captioning ----------------
|
| 336 |
+
def upload_and_generate_caption_ui(img, images_state, captions_state):
|
| 337 |
+
if img is None:
|
| 338 |
+
return None, "No image uploaded.", [], []
|
| 339 |
+
|
| 340 |
+
images = [img]
|
| 341 |
+
try:
|
| 342 |
+
output = captioner(img)
|
| 343 |
+
caption = (
|
| 344 |
+
output[0]["generated_text"]
|
| 345 |
+
if len(output) > 0 and "generated_text" in output[0]
|
| 346 |
+
else "Caption failed."
|
| 347 |
+
)
|
| 348 |
+
except Exception as e:
|
| 349 |
+
print("Captioning error:", e)
|
| 350 |
+
caption = "Caption failed."
|
| 351 |
+
|
| 352 |
+
captions = [caption]
|
| 353 |
+
return img, caption, images, captions
|
| 354 |
+
|
| 355 |
+
upload_btn.click(
|
| 356 |
+
upload_and_generate_caption_ui,
|
| 357 |
+
inputs=[upload_input, images_state, captions_state],
|
| 358 |
+
outputs=[upload_preview, caption_out, images_state, captions_state]
|
| 359 |
+
)
|
| 360 |
+
|
| 361 |
+
# ---------------- Step 2: Generate SD-Turbo & DreamShaper ----------------
|
| 362 |
+
gr.Markdown("### Generate Images from Caption", elem_classes="heading-orange")
|
| 363 |
+
|
| 364 |
+
with gr.Row():
|
| 365 |
+
with gr.Column(scale=1, min_width=300):
|
| 366 |
+
sd_btn = gr.Button(
|
| 367 |
+
"Generate SD-Turbo Image",
|
| 368 |
+
elem_classes="orange-btn"
|
| 369 |
+
)
|
| 370 |
+
sd_preview = gr.Image(
|
| 371 |
+
label="SD-Turbo Image",
|
| 372 |
+
interactive=False
|
| 373 |
+
)
|
| 374 |
+
|
| 375 |
+
with gr.Column(scale=1, min_width=300):
|
| 376 |
+
ds_btn = gr.Button(
|
| 377 |
+
"Generate DreamShaper Image",
|
| 378 |
+
elem_classes="orange-btn"
|
| 379 |
+
)
|
| 380 |
+
ds_preview = gr.Image(
|
| 381 |
+
label="DreamShaper Image",
|
| 382 |
+
interactive=False
|
| 383 |
+
)
|
| 384 |
+
|
| 385 |
+
def generate_sd_from_caption_ui(caption, enhancer, images_state, captions_state):
|
| 386 |
+
final_prompt = f"{caption}, {enhancer}".strip(", ")
|
| 387 |
+
img, images = generate_image_with_enhancer(
|
| 388 |
+
final_prompt,
|
| 389 |
+
enhancer="",
|
| 390 |
+
negative="",
|
| 391 |
+
seed=42,
|
| 392 |
+
style="Photorealistic",
|
| 393 |
+
images=images_state
|
| 394 |
+
)
|
| 395 |
+
try:
|
| 396 |
+
generated_caption = captioner(img)[0]["generated_text"]
|
| 397 |
+
except:
|
| 398 |
+
generated_caption = "Caption failed."
|
| 399 |
+
|
| 400 |
+
captions_state[1:2] = [generated_caption]
|
| 401 |
+
return img, images, captions_state
|
| 402 |
+
|
| 403 |
+
def generate_ds_from_caption_ui(caption, enhancer, images_state, captions_state):
|
| 404 |
+
final_prompt = f"{caption}, {enhancer}".strip(", ")
|
| 405 |
+
img, images = generate_dreamshaper_with_enhancer(
|
| 406 |
+
final_prompt,
|
| 407 |
+
enhancer="",
|
| 408 |
+
negative="",
|
| 409 |
+
seed=123,
|
| 410 |
+
style="Photorealistic",
|
| 411 |
+
images=images_state
|
| 412 |
+
)
|
| 413 |
+
try:
|
| 414 |
+
generated_caption = captioner(img)[0]["generated_text"]
|
| 415 |
+
except:
|
| 416 |
+
generated_caption = "Caption failed."
|
| 417 |
+
|
| 418 |
+
captions_state[2:3] = [generated_caption]
|
| 419 |
+
return img, images, captions_state
|
| 420 |
+
|
| 421 |
+
sd_btn.click(
|
| 422 |
+
generate_sd_from_caption_ui,
|
| 423 |
+
inputs=[caption_out, enhancer_box, images_state, captions_state],
|
| 424 |
+
outputs=[sd_preview, images_state, captions_state]
|
| 425 |
+
)
|
| 426 |
+
|
| 427 |
+
ds_btn.click(
|
| 428 |
+
generate_ds_from_caption_ui,
|
| 429 |
+
inputs=[caption_out, enhancer_box, images_state, captions_state],
|
| 430 |
+
outputs=[ds_preview, images_state, captions_state]
|
| 431 |
+
)
|
| 432 |
+
|
| 433 |
+
# ---------------- Step 3: Compute Pairwise Metrics ----------------
|
| 434 |
+
gr.Markdown("### Compute Pairwise Metrics", elem_classes="heading-orange")
|
| 435 |
+
|
| 436 |
+
metrics_btn = gr.Button(
|
| 437 |
+
"Compute Metrics for All Pairs",
|
| 438 |
+
elem_classes="teal-btn"
|
| 439 |
+
)
|
| 440 |
+
|
| 441 |
+
with gr.Row():
|
| 442 |
+
metrics_A = gr.Markdown()
|
| 443 |
+
metrics_B = gr.Markdown()
|
| 444 |
+
metrics_C = gr.Markdown()
|
| 445 |
+
|
| 446 |
+
def compute_metrics_all_pairs_ui(images, captions):
|
| 447 |
+
yield (
|
| 448 |
+
"<div class='loading-line'></div>",
|
| 449 |
+
"<div class='loading-line'></div>",
|
| 450 |
+
"<div class='loading-line'></div>"
|
| 451 |
+
)
|
| 452 |
+
|
| 453 |
+
if len(images) < 3:
|
| 454 |
+
msg = "All three images and captions are required to compute metrics."
|
| 455 |
+
yield msg, msg, msg
|
| 456 |
+
else:
|
| 457 |
+
A = compute_metrics(images, captions, 0, 1)
|
| 458 |
+
B = compute_metrics(images, captions, 0, 2)
|
| 459 |
+
C = compute_metrics(images, captions, 1, 2)
|
| 460 |
+
yield (
|
| 461 |
+
f"**Reference ↔ SD-Turbo**\n{A}",
|
| 462 |
+
f"**Reference ↔ DreamShaper**\n{B}",
|
| 463 |
+
f"**SD-Turbo ↔ DreamShaper**\n{C}"
|
| 464 |
+
)
|
| 465 |
+
|
| 466 |
+
metrics_btn.click(
|
| 467 |
+
compute_metrics_all_pairs_ui,
|
| 468 |
+
inputs=[images_state, captions_state],
|
| 469 |
+
outputs=[metrics_A, metrics_B, metrics_C]
|
| 470 |
+
)
|
| 471 |
+
|
| 472 |
+
# ---------------- Step 4: NLP Analysis ----------------
|
| 473 |
+
gr.Markdown("### NLP Analysis of Captions", elem_classes="heading-orange")
|
| 474 |
+
|
| 475 |
+
nlp_btn = gr.Button(
|
| 476 |
+
"Analyze Captions",
|
| 477 |
+
elem_classes="teal-btn"
|
| 478 |
+
)
|
| 479 |
+
|
| 480 |
+
nlp_out = gr.HTML()
|
| 481 |
+
|
| 482 |
+
def analyze_caption_pipeline_ui(captions):
|
| 483 |
+
yield "<div class='loading-line'></div>"
|
| 484 |
+
|
| 485 |
+
if len(captions) < 3:
|
| 486 |
+
yield "<b>All three captions are required for NLP analysis.</b>"
|
| 487 |
+
else:
|
| 488 |
+
labels = ["Reference Image", "SD-Turbo", "DreamShaper"]
|
| 489 |
+
blocks = []
|
| 490 |
+
|
| 491 |
+
for label, caption in zip(labels, captions):
|
| 492 |
+
sentiment = "<br>".join(
|
| 493 |
+
[f"{s['label']}: {s['score']:.2f}"
|
| 494 |
+
for s in sentiment_model(caption)]
|
| 495 |
+
)
|
| 496 |
+
|
| 497 |
+
ents = (
|
| 498 |
+
"<br>".join(
|
| 499 |
+
[f"{e['entity_group']}: {e['word']}"
|
| 500 |
+
for e in ner_model(caption)]
|
| 501 |
+
) or "None"
|
| 502 |
+
)
|
| 503 |
+
|
| 504 |
+
topics_data = topic_model(
|
| 505 |
+
caption,
|
| 506 |
+
candidate_labels=[
|
| 507 |
+
"people", "animals", "objects", "food", "nature"
|
| 508 |
+
]
|
| 509 |
+
)
|
| 510 |
+
|
| 511 |
+
topics = "<br>".join(
|
| 512 |
+
[f"{l}: {sc:.2f}"
|
| 513 |
+
for l, sc in zip(
|
| 514 |
+
topics_data["labels"],
|
| 515 |
+
topics_data["scores"]
|
| 516 |
+
)]
|
| 517 |
+
)
|
| 518 |
+
|
| 519 |
+
block = f"""
|
| 520 |
+
<div style='flex:1;padding:10px;min-width:250px;'>
|
| 521 |
+
<h3><u>{label}</u></h3>
|
| 522 |
+
<b>Sentiment</b><br>{sentiment}<br><br>
|
| 523 |
+
<b>Entities</b><br>{ents}<br><br>
|
| 524 |
+
<b>Topics</b><br>{topics}
|
| 525 |
+
</div>
|
| 526 |
+
"""
|
| 527 |
+
blocks.append(block)
|
| 528 |
+
|
| 529 |
+
yield (
|
| 530 |
+
"<div style='display:flex; gap:20px; justify-content:space-between;'>"
|
| 531 |
+
+ "".join(blocks) +
|
| 532 |
+
"</div>"
|
| 533 |
+
)
|
| 534 |
+
|
| 535 |
+
nlp_btn.click(
|
| 536 |
+
analyze_caption_pipeline_ui,
|
| 537 |
+
inputs=[captions_state],
|
| 538 |
+
outputs=[nlp_out]
|
| 539 |
+
)
|
| 540 |
+
|
| 541 |
+
# ---------------- Step 5: Visual Question Answering ----------------
|
| 542 |
+
gr.Markdown("### Visual Question Answering (VQA)", elem_classes="heading-orange")
|
| 543 |
+
|
| 544 |
+
with gr.Row():
|
| 545 |
+
with gr.Column(scale=1):
|
| 546 |
+
vqa_input = gr.Textbox(
|
| 547 |
+
label="Enter a question about the reference image"
|
| 548 |
+
)
|
| 549 |
+
vqa_btn = gr.Button(
|
| 550 |
+
"Get Answer",
|
| 551 |
+
elem_classes="teal-btn"
|
| 552 |
+
)
|
| 553 |
+
|
| 554 |
+
with gr.Column(scale=1):
|
| 555 |
+
vqa_out = gr.Markdown(label="VQA Output")
|
| 556 |
+
|
| 557 |
+
def answer_vqa_ui(question, image):
|
| 558 |
+
yield "<div class='loading-line'></div>"
|
| 559 |
+
ans = answer_vqa(question, image)
|
| 560 |
+
yield ans
|
| 561 |
+
|
| 562 |
+
vqa_btn.click(
|
| 563 |
+
answer_vqa_ui,
|
| 564 |
+
inputs=[vqa_input, upload_preview],
|
| 565 |
+
outputs=[vqa_out]
|
| 566 |
+
)
|
| 567 |
+
|
| 568 |
+
return demo
|
| 569 |
+
|
| 570 |
+
|
| 571 |
+
# ---------------- Launch ----------------
|
| 572 |
+
demo = build_ui_with_custom_ui()
|
| 573 |
+
demo.launch()
|
| 574 |
+
|