YOLO-World-V2.1 Update Blog

Contribtor: Tianheng Cheng, Haokun Lin, and Yixiao Ge.
Date: 2025.02.05
Note: Yixiao Ge is the project leader.

Summary

Hey guys, long time no see. Recently, we've made a series of updates to YOLO-World, including improvements of the pre-trained models, and we've also fully released the training code for YOLO-World Image Prompts. We will continue to optimize YOLO-World in the future.

Technical Details

We made some detailed optimizations to YOLO-World overall and updated the training for different model sizes from S to X. Specifically, we implemented the following technical updates:

1. Fixed Padding

In previous versions, we trained the model with vocabulary size 80. For samples with fewer than 80 words, we added padding (" ") to fill the vocabulary to enable training optimization. Typically, we would add masks to ignore the padding's influence. However, our experiments revealed that not masking the padding brought some "significant" benefits, including: (1) Overall accuracy improvement (approximately 2~3 LVIS AP) (2) Better detection capability for open-vocabulary categories
This is perhaps a counter-intuitive feature, but we have continued using this setting to enhance the model's recognition ability for open-vocabulary objects. According to our analysis, padding can serve as background embeddings and play a role in the text-to-image module (T-CSPLayer), strengthening the representation of open-vocabulary features.

However, this introduced a significant issue. During usage, we typically need to add extra padding (" ") to obtain reasonable results from the model. Yet, adding padding can sometimes introduce uncertainties. Therefore, we aimed to optimize this issue in this version.

In previous versions, we introduced padding in both the T-CSPLayer used for text-to-image fusion and the classification loss. In this version, we implemented masks to ignore the impact of padding on classification loss, which has alleviated the aforementioned issues and brought further performance improvements.

Currently, users still need to consider padding (" ") in the input vocabulary. We will thoroughly optimize this in our upcoming work.

2. Optimized Data Pipeline

In previous versions, we used named entity extraction to annotate image-text data, such as CC3M. This approach introduced considerable noise and resulted in sparse image annotations, leading to low image utilization. To address this, we employed RAM++ for image tag annotation and combined RAM++ annotations with extracted named entities to form the annotation vocabulary. Additionally, we used the YOLO-World-X-v2 model for annotation, generating an equal amount of 250k data samples.

3. YOLO-World-Image: Image Prompts

Image Prompt: We've noticed that many users are very interested in using image prompts with YOLO-World. Previously, we provided a preview version. Therefore, in this update, we will provide a detailed introduction to the Image Prompt model and its training process.

Image Prompt Adapter: YOLO-World uses CLIP-Text as the text encoder to encode text prompts into text embeddings. Since CLIP's pre-training has aligned the text and visual encoders, it naturally follows that we can directly use CLIP's visual encoder to encode image prompts into corresponding image embeddings, replacing text embeddings to achieve object detection with image prompts. After obtaining the image embeddings, all subsequent steps remain identical to those in YOLO-World with text embeddings, including the text-to-image T-CSPLayer.

Prompt Adapter: While this approach is feasible, its actual performance is relatively mediocre. This can be attributed to the fact that CLIP's visual and textual alignments only exist at the contrastive level, making direct substitution ineffective. To this end, we introduced a simple adapter, consisting of a straightforward MLP, to further align the visual prompt embeddings with text embeddings, as shown in the below figure.

Training: Taking the COCO dataset as an example, for each existing category in each image, we randomly select a query bbox and crop out the corresponding image region. We then use the CLIP Image Encoder with a MLP-Adapter to extract the corresponding image embeddings.

Subsequently, the image embeddings corresponding to different category query boxes will replace text embeddings in forward computation. For non-existent categories, we continue to use their text embeddings, which helps achieve alignment between image prompts and text prompts. Objects are matched to their respective query boxes based on categories (where one of the ground truth bboxes of the same category is sampled as a query bbox). Then, the loss is calculated to optimize the adapter's parameters.

Evaluation: For each category in the training set, we randomly selected 32 object bounding box samples (with area > 100*100) and extracted their corresponding CLIP Image embeddings. We then calculated the average embeddings for each category as input, which, after passing through the adapter, replaced the text embeddings for network forward inference. Finally, we conducted testing following COCO's default evaluation protocol.

Zero-shot Evaluation Results for Pre-trained Models

We evaluate all YOLO-World-V2.1 models on LVIS, LVIS-mini, and COCO in the zero-shot manner.

Model	Resolution	LVIS AP				LVIS-mini				COCO
Model	Resolution	AP	AP_r	AP_c	AP_f	AP	AP_r	AP_c	AP_f	AP	AP₅₀	AP₇₅
YOLO-World-S	640	18.5^+1.2	12.6	15.8	24.1	23.6^+0.9	16.4	21.5	26.6	36.6	51.0	39.7
YOLO-World-S	1280	19.7^+0.9	13.5	16.3	26.3	25.5^+1.4	19.1	22.6	29.3	38.2	54.2	41.6
YOLO-World-M	640	24.1^+0.6	16.9	21.1	30.6	30.6^+0.6	19.7	29.0	34.1	43.0	58.6	46.7
YOLO-World-M	1280	26.0^+0.7	19.9	22.5	32.7	32.7^+1.1	24.4	30.2	36.4	43.8	60.3	47.7
YOLO-World-L	640	26.8^+0.7	19.8	23.6	33.4	33.8^+0.9	24.5	32.3	36.8	44.9	60.4	48.9
YOLO-World-L	800	28.3	22.5	24.4	35.1	35.2	27.8	32.6	38.8	47.4	63.3	51.8
YOLO-World-L	1280	28.7^+1.1	22.9	24.9	35.4	35.5^+1.2	24.4	34.0	38.8	46.0	62.5	50.0
YOLO-World-X	640	28.6^+0.2	22.0	25.6	34.9	35.8^+0.4	31.0	33.7	38.5	46.7	62.5	51.0
YOLO-World-X-1280 is coming soon.

Model Card

Model	Resolution	Training	Data	Model Weights
YOLO-World-S	640	PT (100e)	O365v1+GoldG+CC-LiteV2	🤗 HuggingFace
YOLO-World-S	1280	CPT (40e)	O365v1+GoldG+CC-LiteV2	🤗 HuggingFace
YOLO-World-M	640	PT (100e)	O365v1+GoldG+CC-LiteV2	🤗 HuggingFace
YOLO-World-M	1280	CPT (40e)	O365v1+GoldG+CC-LiteV2	🤗 HuggingFace
YOLO-World-L	640	PT (100e)	O365v1+GoldG+CC-LiteV2	🤗 HuggingFace
YOLO-World-L	800 / 1280	CPT (40e)	O365v1+GoldG+CC-LiteV2	🤗 HuggingFace
YOLO-World-X	640	PT (100e)	O365v1+GoldG+CC-LiteV2	🤗 HuggingFace

Notes:

PT: Pre-training, CPT: continuing pre-training
CC-LiteV2: the newly-annotated CC3M subset, including 250k images.