File size: 2,164 Bytes
e0543ce
 
30d5421
e0543ce
 
cdbf51d
 
30d5421
e0543ce
 
30d5421
e0543ce
30d5421
e0543ce
30d5421
e0543ce
 
 
 
30d5421
e0543ce
30d5421
 
e0543ce
30d5421
 
e0543ce
30d5421
 
cdbf51d
30d5421
e0543ce
 
 
 
 
 
 
 
 
 
 
 
30d5421
 
 
 
 
 
 
 
 
 
 
cdbf51d
bf9b139
 
cdbf51d
 
 
 
 
 
 
bf9b139
cdbf51d
bf9b139
e0543ce
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - vision-language-model
  - image-to-text
  - bounding-box-detection
  - synlayers
---

# SynLayers Bbox-Caption Model

This repository hosts the **Stage 1 bbox-caption model** for SynLayers.

Given an input image, the model predicts:

- a whole-image caption
- bounding boxes for visible objects or layers

This repository is only for the Stage 1 detector. The full SynLayers system has two stages:

1. bbox + whole-caption prediction from this repo
2. layer decomposition into transparent RGBA outputs using the Stage 2 checkpoints

For the complete demo, please use our public Space:
[SynLayers/synlayers](https://huggingface.co/spaces/SynLayers/synlayers)

For the Stage 2 decomposition checkpoints and runtime assets, please see:
[SynLayers/synlayers](https://huggingface.co/SynLayers/synlayers)

This repo is not intended to be loaded as a generic `DiffusionPipeline(prompt)` model. If you only want the Stage 1 model, you can load it with `transformers`:

```python
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "SynLayers/Bbox-caption-8b",
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("SynLayers/Bbox-caption-8b")
```

This repository also includes lightweight inference helpers under `demo/infer/`. To run whole-caption and bbox inference on a folder of images:

```bash
python demo/infer/run_caption_bbox_infer.py \
  --model SynLayers/Bbox-caption-8b \
  --data-dir /path/to/images \
  --output outputs/caption_bbox_infer.jsonl \
  --vis-dir outputs/bbox_vis
```

For more details, please check our paper:
[https://arxiv.org/abs/2605.15167](https://arxiv.org/abs/2605.15167)

If you find our work useful, please consider citing:

```bibtex
@article{wu2026does,
  title={Does Synthetic Layered Design Data Benefit Layered Design Decomposition?},
  author={Wu, Kam Man and Yang, Haolin and Chen, Qingyu and Tang, Yihu and Chen, Jingye and Chen, Qifeng},
  journal={arXiv preprint arXiv:2605.15167},
  year={2026}
}
```

Thanks for trying SynLayers.