Image to LongCLIP
The vision tagger splits the reference image to 4 equal-sized squares, tags them, then it uses these vision embeddings to generate an equivalent LongCLIP text embedding.
The vision model scans for booru tags from the reference image. Finally, using these tags and their 2D coordinates, the LongCLIP-compatible text embedding is created for the UNet model.
(Left: Image to LongCLIP, Right: LongCLIP)
Difference from Klein
The Flux.2 Klein repo takes a screenshot of a caption, and converts it into text embeddings.
This LongCLIP repo takes an image, generates the sequence of vision embeddings, and then converts them to text embeddings.
The former is based on a loose form of character recognition (OCR), while the latter uses (booru) vision recognition.
Inference
embedder = VisionEmbedder()
model = CLIP2DModel.load_model(
'AiArtLab/sdxs-1b',
resume='path/to/this/repo/model.safetensor'
).to('cuda')
pipeline = SdxsPipeline.from_pretrained(
'AiArtLab/sdxs-1b',
torch_dtype=torch.float16
).to('cuda')
with torch.no_grad():
# We need a reference image and a random caption.
# The caption is only used to determine the token length and is then discarded.
reference = Image.open(reference_image_path)
vision_embeds = model.encode_image(embedder, reference)
_, mask = model.tokenize(pipeline.tokenizer, caption, vision_embeds)
text_embeddings = model.forward(vision_embeds, mask)
image = pipeline(
caption,
negative_prompt='bad quality, low resolution',
text_embeddings=text_embeddings
).images[0]
Datasets
- pixiv rank
- Downloads last month
- 9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
