OzzyGT
/

florence-2-block

Model card Files Files and versions

xet

Community

OzzyGT HF Staff commited on Oct 1, 2025

Commit

6ef09aa

1 Parent(s): 4c1b191

docs

Browse files

Files changed (2) hide show

README.md +273 -0
block.py +4 -0

README.md CHANGED Viewed

@@ -12,6 +12,48 @@ The node can be used with the default installation of Mellon using the `Dynamic
 ## Using it with code
 ### Object Detection
 ```python
@@ -44,3 +86,234 @@ output.save("output.png")
 | Input                                                                                                              | Output                                                                                                            |
 | ------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------- |
 | ![Input](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png) | ![Output](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/object_detection.png) |

 ## Using it with code
+### Captioning
+```python
+import torch
+from diffusers.modular_pipelines import ModularPipeline
+from diffusers.utils import load_image
+pipe = ModularPipeline.from_pretrained("OzzyGT/florence-2-block", trust_remote_code=True)
+pipe.load_components(torch_dtype=torch.float16)
+pipe.to("cuda")
+image = load_image(
+    "https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png"
+)
+annotation_task = "<CAPTION>"  # can also be <DETAILED_CAPTION> or <MORE_DETAILED_CAPTION>
+annotation_prompt = ""
+output = pipe(image=image, annotation_task=annotation_task, annotation_prompt=annotation_prompt).annotations[0]
+print(output)
+```
+#### Caption
+```
+A man and a woman writing on a white board.
+```
+#### Detailed Caption
+```
+In this image we can see a man and a woman holding markers in their hands. We can also see a board with some text on it.
+```
+#### More Detailed Caption
+```
+A man and a woman are standing in front of a whiteboard. The woman is writing on a black marker. The man is wearing a blue shirt. The whiteboard has writing on it. The writing on the whiteboard is black. The people are looking at each other. There is writing in black marker on the board. There are drawings on whiteboard behind the people.
+```
 ### Object Detection
 ```python
 | Input                                                                                                              | Output                                                                                                            |
 | ------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------- |
 | ![Input](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png) | ![Output](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/object_detection.png) |
+### Dense Region Caption
+```python
+import torch
+from diffusers.modular_pipelines import ModularPipeline
+from diffusers.utils import load_image
+pipe = ModularPipeline.from_pretrained("OzzyGT/florence-2-block", trust_remote_code=True)
+pipe.load_components(torch_dtype=torch.float16)
+pipe.to("cuda")
+image = load_image(
+    "https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png"
+)
+annotation_task = "<DENSE_REGION_CAPTION>"
+annotation_prompt = ""
+output = pipe(
+    image=image,
+    annotation_task=annotation_task,
+    annotation_prompt=annotation_prompt,
+    annotation_output_type="bounding_box",
+).images[0]
+output.save("output.png")
+```
+| Input                                                                                                              | Output                                                                                                                |
+| ------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------- |
+| ![Input](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png) | ![Output](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/dense_region_caption.png) |
+### Region Proposal
+```python
+import torch
+from diffusers.modular_pipelines import ModularPipeline
+from diffusers.utils import load_image
+pipe = ModularPipeline.from_pretrained("OzzyGT/florence-2-block", trust_remote_code=True)
+pipe.load_components(torch_dtype=torch.float16)
+pipe.to("cuda")
+image = load_image(
+    "https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png"
+)
+annotation_task = "<REGION_PROPOSAL>"
+annotation_prompt = ""
+output = pipe(
+    image=image,
+    annotation_task=annotation_task,
+    annotation_prompt=annotation_prompt,
+    annotation_output_type="bounding_box",
+).images[0]
+output.save("output.png")
+```
+| Input                                                                                                              | Output                                                                                                           |
+| ------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------- |
+| ![Input](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png) | ![Output](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/region_proposal.png) |
+### Phrase Grounding
+```python
+import torch
+from diffusers.modular_pipelines import ModularPipeline
+from diffusers.utils import load_image
+pipe = ModularPipeline.from_pretrained("OzzyGT/florence-2-block", trust_remote_code=True)
+pipe.load_components(torch_dtype=torch.float16)
+pipe.to("cuda")
+image = load_image(
+    "https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png"
+)
+annotation_task = "<CAPTION_TO_PHRASE_GROUNDING>"
+annotation_prompt = "man"
+output = pipe(
+    image=image,
+    annotation_task=annotation_task,
+    annotation_prompt=annotation_prompt,
+    annotation_output_type="bounding_box",  # can also use `mask_image` and `mask_overlay`
+).images[0]
+output.save("output.png")
+```
+| Input                                                                                                              | Bounding Box                                                                                                          | Mask Image                                                                                                            | Mask Overlay                                                                                                             |
+| ------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
+| ![Input](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png) | ![Input](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/phrase_grounding_bbox.png) | ![Input](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/phrase_grounding_mask.png) | ![Input](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/phrase_grounding_overlay.png) |
+### Referring Expression Segmentation
+```python
+import torch
+from diffusers.modular_pipelines import ModularPipeline
+from diffusers.utils import load_image
+pipe = ModularPipeline.from_pretrained("OzzyGT/florence-2-block", trust_remote_code=True)
+pipe.load_components(torch_dtype=torch.float16)
+pipe.to("cuda")
+image = load_image(
+    "https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png"
+)
+annotation_task = "<REFERRING_EXPRESSION_SEGMENTATION>"
+annotation_prompt = "man"
+output = pipe(
+    image=image,
+    annotation_task=annotation_task,
+    annotation_prompt=annotation_prompt,
+    annotation_output_type="mask_image",  # can also use `mask_overlay`
+).images[0]
+output.save("output.png")
+```
+| Input                                                                                                              | Mask Image                                                                                                       | Mask Overlay                                                                                                        |
+| ------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
+| ![Input](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png) | ![Input](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/ref_exp_seg_mask.png) | ![Input](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/ref_exp_seg_overlay.png) |
+### Open Vocabulary Detection
+```python
+import torch
+from diffusers.modular_pipelines import ModularPipeline
+from diffusers.utils import load_image
+pipe = ModularPipeline.from_pretrained("OzzyGT/florence-2-block", trust_remote_code=True)
+pipe.load_components(torch_dtype=torch.float16)
+pipe.to("cuda")
+image = load_image(
+    "https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png"
+)
+annotation_task = "<OPEN_VOCABULARY_DETECTION>"
+annotation_prompt = "man with a beard"
+output = pipe(
+    image=image,
+    annotation_task=annotation_task,
+    annotation_prompt=annotation_prompt,
+    annotation_output_type="bounding_box",
+).images[0]
+output.save("output.png")
+```
+| Input                                                                                                              | Output                                                                                                           |
+| ------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------- |
+| ![Input](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png) | ![Output](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/open_vocabulary.png) |
+### OCR
+```python
+import torch
+from diffusers.modular_pipelines import ModularPipeline
+from diffusers.utils import load_image
+pipe = ModularPipeline.from_pretrained("OzzyGT/florence-2-block", trust_remote_code=True)
+pipe.load_components(torch_dtype=torch.float16)
+pipe.to("cuda")
+image = load_image(
+    "https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png"
+)
+annotation_task = "<OCR>"
+annotation_prompt = ""
+output = pipe(
+    image=image,
+    annotation_task=annotation_task,
+    annotation_prompt=annotation_prompt,
+    annotation_output_type="bounding_box",
+).annotations[0]
+print(output)
+```
+```
+The Diffuser's library byHugging Face makes it easyfor developers to run imagegeneration and influenceusing state-of-the-astdiffusion models withjust a few lines of codehuman eou
+```
+### OCR with region
+```python
+import torch
+from diffusers.modular_pipelines import ModularPipeline
+from diffusers.utils import load_image
+pipe = ModularPipeline.from_pretrained("OzzyGT/florence-2-block", trust_remote_code=True)
+pipe.load_components(torch_dtype=torch.float16)
+pipe.to("cuda")
+image = load_image(
+    "https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png"
+)
+annotation_task = "<OCR_WITH_REGION>"
+annotation_prompt = ""
+output = pipe(
+    image=image,
+    annotation_task=annotation_task,
+    annotation_prompt=annotation_prompt,
+    annotation_output_type="bounding_box",
+).images[0]
+output.save("output.png")
+```
+| Input                                                                                                              | Output                                                                                                      |
+| ------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------- |
+| ![Input](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/white_board_people.png) | ![Output](https://huggingface.co/datasets/OzzyGT/diffusers-examples/resolve/main/florence-2/ocr_region.png) |

block.py CHANGED Viewed

@@ -270,6 +270,10 @@ class Florence2ImageAnnotatorBlock(ModularPipelineBlocks):
                 # Standard axis-aligned boxes
                 bboxes = _annotation.get("bboxes", [])
                 labels = _annotation.get("labels", [])
                 for i, bbox in enumerate(bboxes):
                     flat = np.array(bbox).flatten().tolist()

                 # Standard axis-aligned boxes
                 bboxes = _annotation.get("bboxes", [])
                 labels = _annotation.get("labels", [])
+                if len(labels) == 0:
+                    labels = _annotation.get("bboxes_labels", [])
                 for i, bbox in enumerate(bboxes):
                     flat = np.array(bbox).flatten().tolist()