YOFO-Qwen3-VL-2B

Hightlights

Efficient Compositional Judging: Built on Qwen3-VL-2B-Instruct, YOFO accepts an image and a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement.
Fine-Grained Cross-Modal Reranking: Existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Leveraging the cross-modal understanding ability of Qwen3-VL and our template-conditioned method, our model judges all requirements decomposed from the query concurrently and accurately, facilitating the accuracy and interpretability of reranking.
Univeral Judging Capabilities: Trained on general-purpose data, YOFO learns general-purpose judging capabilities that transfer effectively across domains. Crucially, it can be deployed directly in specialized subdomains—such as fashion—without any fine-tuning or domain adaptation. This zero-shot generalization underscores the model’s practical utility in real-world scenarios where labeled data is scarce or domain shifts are common, positioning YOFO as a versatile solution for cross-domain recommendation systems.

For more details, including model architecture, implementation details and experimental results, please refer to our paper.

Usage

requirements

transformers>=4.57.0
torch==2.5.1

Example

from transformers import AutoModel

model_path = "Accio-Lab/yofo-Qwen3-VL-2B-Instruct"
yofo = AutoModel.from_pretrained(
    model_path,
    torch_dtype="bfloat16",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
)
yofo.eval()
yofo.cuda()

Now you can use the model's function compute_score to evaluate how well an image satisfies a given set of requirements. The function accepts a list of input pairs, where each pair consists of an image and a corresponding list of textual requirements. For each input pair, it returns a list of relevance scores—one for each requirement—indicating the model's confidence that the requirement is met by the image.

data = [
    {
        "image": "../../datasets/laion-reranker/images/605257.jpg",
        "requirements": [
            "The item has a visible pattern.",
            "The item has long sleeves.",
            "The item has an A-line silhouette.",
            "The item's primary color is red."
        ],
    },
    {
        "image": "../../datasets/laion-reranker/images/780764.jpg",
        "requirements": [
            "The item is a dress.",
            "The item has long sleeves.",
            "The item features lace-up details.",
            "The item is black."
        ],
    },
]
scores = yofo.compute_score(data, batch_size=2, num_workers=2)
# [[0.890625, 0.9765625, 0.7578125, 5.424022674560547e-06], [0.9921875, 0.90234375, 0.0091552734375, 1.0]]

Contact

If you have any questions about this model, please feel free to contact: tattoo.ysl@gmail.com.
We are actively seeking self-motivated researchers and research interns to join our team!

Downloads last month: 4

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Accio-Lab/yofo-Qwen3-VL-2B-Instruct

Quantizations

1 model

Paper for Accio-Lab/yofo-Qwen3-VL-2B-Instruct

You Only Forward Once: An Efficient Compositional Judging Paradigm

Paper • 2511.16600 • Published Nov 20, 2025 • 7