Training Qwen3 VL to label bbox : synthetic data, environment and training analysis

Community Article Published February 9, 2026

TLDR

Motivation

Training vision models (and small vision models) to properly perceive their visual environment is becoming key to expanding the capabilities and versatility of models and to applying them to real-world use cases. While huge models have a broad spectrum of grounding capabilities, smaller ones tend to be less reliable, yet could be very specialized and be used for a fraction of the cost and even locally for embedded systems.

One of the first “visual grounded” models I saw was in the Molmo paper, where the model was able to point at some part of the image to recognize a specific location:

image-1

Recent advancements in small VL models allow models to perform bounding box prediction and captioning of specific entities in an image. These capabilities are mandatory for real-life use cases (process automation, robotics, …) with some key features:

The model needs to detect all / most of the occurrences of an object

The model needs to be as precise and as robust as possible for the bounding box area and coverage

It needs to be able to label / caption the objects

One test case I had was window detection on architectural images of facades: it needs the model to understand the architecture / housing domain and handle several occurrences of different shapes.

I tested these capabilities with large Qwen VL models and it was working well, yet small Qwen models had a tendency to miscount the number of occurrences (either fewer or way more occurrences) and to miss the target area.

image-2

I decided to further train Qwen 3 VL Instruct with RL on this task, using a reusable bounding box RL environment. I will share in this blog how I created the data, designed the environment, and trained this model.

Base model and qwen behavior

I used Qwen 3 VL 2B Instruct as the base model for my training.

Qwen 3 VL is trained on bounding box extraction and captioning in the instruct format. A few things to be aware of: Qwen assumes the image is of size 1000 × 1000, so there is a need to take this into account when creating / labeling data and when displaying the results on an image.

The prompt given in the documentation is the following and is the one I used for training and testing:

locate every instance that belongs to the following categories: CATEGORY. For each window, report bbox coordinates, in JSON format like this: {"bbox_2d": [x1, y1, x2, y2], "label": CATEGORY”

The output then is supposed to be with the following format :

json\\n[\\n\\t{"bbox_2d": [448, 621, 486, 755], "label": "window"},\\n\\t{"bbox_2d": [511, 621, 548, 755], "label": "window"},\\n\\t{"bbox_2d": [574, 621, 609, 755], "label": "window"},\\n\\t{"bbox_2d": [511, 388, 547, 527], "label": "window"},\\n\\t{"bbox_2d": [448, 388, 486, 527], "label": "window"},\\n\\t{"bbox_2d": [574, 388, 609, 527], "label": "window"},\\n\\t{"bbox_2d": [768, 429, 804, 567], "label": "window"},\\n\\t{"bbox_2d": [768, 638, 804, 775], "label": "window"},\\n\\t{"bbox_2d": [701, 638, 736, 775], "label": "window"},\\n\\t{"bbox_2d": [324, 685, 362, 818], "label": "window"},\\n\\t{"bbox_2d": [194, 685, 230, 818], "label": "window"},\\n\\t{"bbox_2d": [698, 419, 736, 567], "label": "window"}\\n]\\n

Infinite synthetic data generation

When working on this project, I started by trying to find public datasets of architecture or CAD images with labels. However, there are very few, and they are not open source. Then I tried to find some on the internet to label myself, but there are several issues:

There are very few elevation datasets that are easily collectible.

Quality is often very bad, and resolution too.

Labeling is really time-consuming, even with a convenient application, and it does not reach 100% precision.

In parallel with this project, I was working on 3D applications in Three.js, so I realized there is a way to have infinite, auto-labeled, and high-quality data. This is the case for this project, but also for quite a lot of VLM tuning tasks.

Core ideas

The procedure of each generation is determined by a set of randomized variables that ensure no two samples are identical:

  • Volumetric variance: The script decides on a number of volumes / building parts (usually 2 or 3) and assigns each a random width (W) and height (H).

  • Architectural style (Palettes): A random selection from a PALETTES array determines the hex codes for the walls, accent trims, and window glass.

  • Roof morphology: A probability check (40/60 split) determines if a section receives a Gable (triangular) or Flat (modern slab) roof.

  • Procedural fenestration / Grid Logic: Windows are not placed randomly, they are calculated based on the wall's dimensions to ensure they look structurally "sane.

  • Window geometry: A style toggle switches between standard rectangles and arched tops (using absarc geometry).

  • Orthographic projection: The use of an OrthographicCamera removes perspective distortion, ensuring the architectural elevation remains flat and measurable, which is standard for technical datasets.

Generation and export flow

To build a sample pair (PNG + JSON), the code follows this specific sequence:

  • Scene reset: The previous building group is purged from the Three.js scene to free up memory and clear the canvas.
  • Constrained build: The script iterates through the volumes, drawing the walls first. It then "stamps" windows and doors based on the calculated grid. Every window mesh created is stored into an array for tracking.
  • Bounding box: The camera automatically calculates the bounding box of the entire building and adjusts its zoom to ensure the structure is perfectly centered and scaled.
  • Coordinate mapping: For every object in the window array, the script calculates its 3D world position.It uses the .project(camera) method to translate that 3D position into 2D screen coordinates.These are normalized to a 1000 * 1000 scale to match qwen object detection formats.
  • Synchronized export: The renderer.domElement is converted to a DataURL to create the PNG. The mapped coordinates are stringified into a json file. A shared exportId (timestamp) is applied to both filenames to ensure the image and its labels stay paired.

Because the labels are generated by the same math that generates the image, the bounding boxes have 100% precision. This is the primary advantage of synthetic data over manual human labeling.

I really think this method to create infinite synthetic data from a 3D environment in Three js with a procedural / constrained approach can really help building RL environment and could be used in an online setup for training and improvment of models.

Here is how the app looks like :

image

It produces a png that is the view of the house and a json with the bbox like this :

[{"bbox_2d":[355,660,380,771],"label":"window"},{"bbox_2d":[445,660,469,771],"label":"window"},{"bbox_2d":[355,465,380,576],"label":"window"},{"bbox_2d":[400,465,425,576],"label":"window"},{"bbox_2d":[445,465,469,576],"label":"window"},{"bbox_2d":[533,690,558,801],"label":"window"},{"bbox_2d":[577,690,602,801],"label":"window"},{"bbox_2d":[621,690,645,801],"label":"window"},{"bbox_2d":[533,525,558,636],"label":"window"},{"bbox_2d":[577,525,602,636],"label":"window"},{"bbox_2d":[621,525,645,636],"label":"window"}]

I created and with manual check a synthetic dataset : https://huggingface.co/datasets/UlrickBL/elevation-dataset-synthetic-v2

And also a manually annotated hard dataset : https://huggingface.co/datasets/UlrickBL/elevation-dataset

Find the code here : https://github.com/UlrickBL/segmentation

Reusable environment and reward design

Now that we have enough data to start training a model, we need to design the rest of the environment for the RL training.

Since I am doing bounding box detection, I used IoU with Hungarian matching, since bounding boxes can be in any order.

This part of the blog explains the two reward functions I tested for object detection with Hungarian matching:

  • strict IoU reward
  • smooth geometry and IoU reward

I will use those notations :

  • Predictions: P = {p_1, ..., p_n}

  • Ground truth: G = {g_1, ..., g_m}

Each object has bbox(p) the bounding box and label(p) the class label.

We call IoU(p, g) the intersection over Union metric in [0, 1]

strict IoU reward

The first reward I tried was a strict, evaluation style reward that measures the quality of the best matched predictions.

For each prediction p_i and ground truth g_j:

cost(i,j)=[IoU(pi,gj)+0.5label_match(pi,gj)] cost(i, j) = - [ IoU(p_i, g_j) + 0.5 * label\_match(p_i, g_j) ]

Hungarian matching finds the assignment π* that minimizes total cost (equivalently, maximizes IoU + label correctness) so that even if the order is not the same as the ground truth, the score is the max of the permutations.

For each matched pair (p_i, g_{π*(i)}):

scorei=0.7IoU(pi,gπ(i))0.3label_match(pi,gπ(i)) score_i = 0.7 * IoU(p_i, g_{π*(i)}) - 0.3 *label\_match(p_i, g_{π*(i)})

The final reward is the mean score over matched pairs only.

Few things to note :

  • Uses only matched pairs
  • No penalty for extra predicted boxes
  • No penalty for missing ground-truth objects
  • Zero geometry signal when IoU = 0

The fact that it does not penalized extra boxes or missing boxes and that it has no signal when IoU = 0 made my first runs hard to receive advantage signal and the exploration phase was painfull. That's why I changed this reward to another.

smooth geometry and IoU reward

This reward provides a non-zero signal even when boxes do not overlap and penalizes hallucinated or missing objects.

It is composed of a distance hint : h(p,g)=exp(d(p,g)/200) h(p, g) = exp( - d(p, g) / 200 )

And it combines it in the geometry score: geom(p,g)=max(IoU(p,g),0.1h(p,g)) geom(p, g) = max(IoU(p, g),0.1 * h(p, g))

This ensures warmer / colder feedback even when IoU = 0.

The the airwise score: S(p,g)=0.8geom(p,g)0.2label_score(p,g) S(p, g) = 0.8 * geom(p, g) - 0.2 * label\_score(p, g)

Is used to calculate the cost matrix: cost(i,j)=1S(pi,gj) cost(i, j) = 1 - S(p_i, g_j)

Hungarian matching finds the assignment π* minimizing total cost (just as for the previous reward).

The final reward is normalized by scene size (to penalize extrat or missing boxes):

Reward=(1/max(n,m))sum(S(pi,gπ(i))) Reward = (1 / max(n, m)) * sum(S(p_i, g_{π*(i)}))

This reward :

  • Penalizes extra or missing predictions
  • Penalizes missed ground-truth objects
  • Smooth gradients when boxes are close

Comparison

Aspect reward_matching reward_geometry_smooth_hungarian
Primary use Evaluation Training / RL
Geometry signal IoU only IoU + distance hint
Signal when IoU = 0 No Yes
Penalizes extra predictions No Yes
Penalizes missed ground truth No Yes
Normalization Mean over matches Global (max(n, m))
Early training behavior Sparse, unstable Smooth, guided

I went with the smooth one for my successful runs.

You can find the RL env on the prime intellect hub, it is usable with the verifiers library and easily extendible to any kind of bbox with captioning use case : https://app.primeintellect.ai/dashboard/environments/ulrick-bl/object-detection-vl

Successful run

Now, I will describe my final successful run. As explained before, I started from the Qwen VL model Qwen/Qwen3-VL-2B-Instruct with LoRa adapter of rank 16 and alpha 16 on qkv, up and down linear layers. Training was done with GRPO.

Training parameters are the following :

  • learning rate : 1e-4
  • temperature : 1.0
  • scheduler : linear warmup and cosine decay
  • warmup : 30 steps
  • beta : 0.02 (so quite a small impact of the KL to allow the model to diverge and specialize on the task)
  • batch size of 3 and gradient accumulation of 10 (so 30 effective batch size)
  • number of rollouts : 15
  • rewards : parsable reward (weight of 0.2) and smooth geometry and IoU reward (weight of 0.8)

Training went smooth with a reward going from 0.2 to a plateau of 0.8 after 80 steps with no outliers arround the rolling mean reward showcasing a correct effective batch size :

image-3

Standard deviation of rollouts stayed between 0.05 and 0.1 highlighting that the number of rollouts and temperature allowed a correct diversity in completion and so a meaningfull advantage and a non null loss:

image-4

image-5

Entropy started decreasing from 0.35 to 0.25 after the "correct" path was found by policy (leading to the reward plateau) :

image-8

Model started to diverge from the base instruct one especially when finding the correct path to a KL of 0.2 with some spikes :

image-6

During the exploration phase, the min completion length decreased then started to increase again probably showcasing the impact of the penalization of the extra / missing boxes:

image-7

You can find the code for the run in this repo : https://github.com/UlrickBL/rl_bbox_training

Results

Here a some examples of before / after improvments on OOD examples :

Real low quality and low resolution example

Before :

image-15

After :

image-9

Miss count, different type of window

Before :

image-10

After :

image-11

Very small image, assymetrie

Before :

image-12

After :

image-13

A hard case :

image-14

Community

Sign up or log in to comment