Image to LongCLIP

The vision tagger splits the reference image to 4 equal-sized squares, tags them, then it uses these vision embeddings to generate an equivalent LongCLIP text embedding.

The vision model scans for booru tags from the reference image. Finally, using these tags and their 2D coordinates, the LongCLIP-compatible text embedding is created for the UNet model.

(Left: Image to LongCLIP, Right: LongCLIP)

Difference from Klein

The Flux.2 Klein repo takes a screenshot of a caption, and converts it into text embeddings.

This LongCLIP repo takes an image, generates the sequence of vision embeddings, and then converts them to text embeddings.

The former is based on a loose form of character recognition (OCR), while the latter uses (booru) vision recognition.

Inference

embedder = VisionEmbedder()
model = CLIP2DModel.load_model(
    'AiArtLab/sdxs-1b',
    resume='path/to/this/repo/model.safetensor'
).to('cuda')
pipeline = SdxsPipeline.from_pretrained(
    'AiArtLab/sdxs-1b',
    torch_dtype=torch.float16
).to('cuda')
with torch.no_grad():
    # We need a reference image and a random caption.
    # The caption is only used to determine the token length and is then discarded.
    reference = Image.open(reference_image_path)
    vision_embeds = model.encode_image(embedder, reference)
    _, mask = model.tokenize(pipeline.tokenizer, caption, vision_embeds)
    text_embeddings = model.forward(vision_embeds, mask)
image = pipeline(
    caption,
    negative_prompt='bad quality, low resolution',
    text_embeddings=text_embeddings
).images[0]

Datasets

  • pixiv rank
Downloads last month
9
Safetensors
Model size
73.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support