You Only Forward Once: An Efficient Compositional Judging Paradigm
Paper • 2511.16600 • Published • 7
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
For more details, including model architecture, implementation details and experimental results, please refer to our paper.
transformers>=4.57.0
torch==2.5.1
from transformers import AutoModel
model_path = "Accio-Lab/yofo-Qwen3-VL-2B-Instruct"
yofo = AutoModel.from_pretrained(
model_path,
torch_dtype="bfloat16",
trust_remote_code=True,
attn_implementation="flash_attention_2"
)
yofo.eval()
yofo.cuda()
Now you can use the model's function compute_score to evaluate how well an image satisfies a given set of requirements. The function accepts a list of input pairs, where each pair consists of an image and a corresponding list of textual requirements. For each input pair, it returns a list of relevance scores—one for each requirement—indicating the model's confidence that the requirement is met by the image.
data = [
{
"image": "../../datasets/laion-reranker/images/605257.jpg",
"requirements": [
"The item has a visible pattern.",
"The item has long sleeves.",
"The item has an A-line silhouette.",
"The item's primary color is red."
],
},
{
"image": "../../datasets/laion-reranker/images/780764.jpg",
"requirements": [
"The item is a dress.",
"The item has long sleeves.",
"The item features lace-up details.",
"The item is black."
],
},
]
scores = yofo.compute_score(data, batch_size=2, num_workers=2)
# [[0.890625, 0.9765625, 0.7578125, 5.424022674560547e-06], [0.9921875, 0.90234375, 0.0091552734375, 1.0]]