File size: 4,191 Bytes
efe6ec3
 
 
6d8a836
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
library_name: diffusers
---

# Florence-2 Image Annotator

A custom [Modular Diffusers](https://huggingface.co/docs/diffusers/modular_diffusers/overview) block that uses [Florence-2](https://huggingface.co/docs/transformers/model_doc/florence2) for image annotation tasks like segmentation, object detection, and captioning.

## Usage

### Basic Usage

```python
import torch
from diffusers import ModularPipeline
from diffusers.utils import load_image

# Load the block
image_annotator = ModularPipeline.from_pretrained(
    "diffusers/Florence2-image-Annotator",
    trust_remote_code=True
)
image_annotator.load_components(torch_dtype=torch.bfloat16)
image_annotator.to("cuda")

# Load an image
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg")
image = image.resize((1024, 1024))

# Generate a segmentation mask
output = image_annotator(
    image=image,
    annotation_task="<REFERRING_EXPRESSION_SEGMENTATION>",
    annotation_prompt="the car",
    annotation_output_type="mask_image",
)
output.mask_image[0].save("car-mask.png")
```

### Compose with Inpainting Pipeline

```python
from diffusers import ModularPipeline

# Load the annotator
image_annotator = ModularPipeline.from_pretrained(
    "diffusers/Florence2-image-Annotator",
    trust_remote_code=True
)

# Get an inpainting workflow and insert the annotator
# repo_id = .. # you can use SDXL/flux/qwen any pipeline support Inpaint
inpaint_blocks = ModularPipeline.from_pretrained(repo_id).blocks.get_workflow("inpainting")
inpaint_blocks.sub_blocks.insert("image_annotator", image_annotator.blocks, 0)

# Initialize the combined pipeline
pipe = inpaint_blocks.init_pipeline()
pipe.load_components(torch_dtype=torch.float16, device="cuda")

# Inpaint with automatic mask generation
output = pipe(
    prompt=prompt,
    image=image,
    annotation_task="<REFERRING_EXPRESSION_SEGMENTATION>",
    annotation_prompt="the car",
    annotation_output_type="mask_image",
    num_inference_steps=30,
    output="images"
)
output[0].save("inpainted-car.png")
```

## Supported Tasks

| Task | Description |
|------|-------------|
| `<OD>` | Object detection |
| `<REFERRING_EXPRESSION_SEGMENTATION>` | Segment specific objects based on text |
| `<CAPTION>` | Generate image caption |
| `<DETAILED_CAPTION>` | Generate detailed caption |
| `<MORE_DETAILED_CAPTION>` | Generate very detailed caption |
| `<DENSE_REGION_CAPTION>` | Caption different regions |
| `<CAPTION_TO_PHRASE_GROUNDING>` | Ground phrases to regions |
| `<OPEN_VOCABULARY_DETECTION>` | Detect objects from open vocabulary |

## Output Types

| Type | Description |
|------|-------------|
| `mask_image` | Black and white mask image |
| `mask_overlay` | Mask overlaid on original image |
| `bounding_box` | Bounding boxes drawn on image |

## Inputs

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `image` | `PIL.Image` | Yes | - | Image to annotate |
| `annotation_task` | `str` | No | `<REFERRING_EXPRESSION_SEGMENTATION>` | Task to perform |
| `annotation_prompt` | `str` | Yes | - | Text prompt for the task |
| `annotation_output_type` | `str` | No | `mask_image` | Output format |

## Outputs

| Parameter | Type | Description |
|-----------|------|-------------|
| `mask_image` | `PIL.Image` | Generated mask (when output type is `mask_image`) |
| `image` | `PIL.Image` | Annotated image (when output type is `mask_overlay` or `bounding_box`) |
| `annotations` | `dict` | Raw annotation predictions |

## Components

This block uses the following models from [florence-community/Florence-2-base-ft](https://huggingface.co/florence-community/Florence-2-base-ft):

- `image_annotator`: `Florence2ForConditionalGeneration`
- `image_annotator_processor`: `AutoProcessor`

## Learn More

- [Building Custom Blocks Guide](https://huggingface.co/docs/diffusers/modular_diffusers/custom_blocks)
- [Modular Diffusers Overview](https://huggingface.co/docs/diffusers/modular_diffusers/overview)
- [Modular Diffusers Custom Blocks Collection](https://huggingface.co/collections/diffusers/modular-diffusers-custom-blocks)