tiiuae
/

Falcon-Perception-300M

@@ -1,5 +1,5 @@
 ---
-pipeline_tag: mask-generation
 library_name: transformers
 tags:
 - falcon
@@ -11,33 +11,35 @@ license: apache-2.0
 <img src="main_fig.jpg" width="480" alt="Falcon Perception"/>
-> ![NOTE] This is the smaller version (300M parameters) and only support detection task.
-## Falcon Perception
-Falcon Perception is a 0.3B parameter early-fusion vision-language model for open-vocabulary grounding detection. Given an image and a natural language query, it returns zero, one, or many matching instances with accurate bounding boxes.
-The model is built around a simple interface. Image patches and text tokens are processed together in a single Transformer using a hybrid attention mask: image tokens build bidirectional visual context, while text and task tokens decode causally conditioned on the image. For each instance, the model generates a short structured sequence of task tokens in a fixed order, `<|coord|>` then `<|size|>` then `<|seg|>`. The `<|seg|>` token acts as a mask query whose hidden state is projected and dotted with upsampled image features, producing a full-resolution binary mask without autoregressive mask generation.
 ### Links
-- Code and inference engine: `https://github.com/tiiuae/Falcon-Perception`
 - Tech report: arXiv link coming soon
 - PBench dataset: `tiiuae/PBench`
-- OCR model: `tiiuae/Falcon-OCR`
 ## Quickstart
 ### Installation
 ```bash
-pip install "torch>=2.5" transformers pillow einops pycocotools
 ```
 This model requires PyTorch 2.5 or newer for FlexAttention. The first call can be slower because `torch.compile` may build optimized kernels.
-### Run open-vocabulary segmentation
 ```python
 import torch
@@ -45,7 +47,7 @@ from PIL import Image
 from transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained(
-    "tiiuae/falcon-perception-300m",
     trust_remote_code=True,
     device_map={"": "cuda:0"},
 )
@@ -57,18 +59,31 @@ for p in preds:
     print(p["xy"], p["hw"])
 ```
-### Decode masks
 ```python
-import numpy as np
-from pycocotools import mask as mask_utils
 for p in preds:
-    rle = p["mask_rle"]
-    # pycocotools expects bytes for counts
-    m = {"size": rle["size"], "counts": rle["counts"].encode("utf-8")}
-    mask = mask_utils.decode(m).astype(bool)  # H x W
-    print(mask.shape, mask.sum())
 ```
 ## API
@@ -79,6 +94,7 @@ for p in preds:
 |---|---|---|---|
 | `images` | `PIL.Image` or `list` | required | Single image or list of images |
 | `queries` | `str` or `list[str]` | required | Query string(s), one per image |
 | `max_new_tokens` | `int` | `2048` | Maximum decoding steps |
 | `min_dimension` | `int` | `256` | Minimum image side after resize |
 | `max_dimension` | `int` | `1024` | Maximum image side after resize |
@@ -86,23 +102,26 @@ for p in preds:
 **Returns:** `list[list[dict]]`, one list per image.
-Each prediction dict contains:
 ```python
 {
-  "xy": {"x": float, "y": float},                    # center in normalized coordinates (0 to 1)
-  "hw": {"h": float, "w": float},                    # size in normalized coordinates (0 to 1)
-  "mask_rle": {"counts": str, "size": [H, W]},       # COCO RLE at original resolution
 }
 ```
 ## What the model is for
-Falcon Perception is designed for dense grounding regimes where the main difficulty is localization under open vocabulary. That includes:
 - Natural language driven object selection in images
-- Promptable instance segmentation for downstream pipelines
 - Crowded scenes where the number of instances is large and variable
 It is not intended as a general-purpose vision-language assistant for open-ended reasoning, long-form generation, or multi-step VQA.
@@ -112,24 +131,24 @@ The architecture follows a single-stack early-fusion recipe:
 - One dense Transformer backbone processes image patches and text tokens in a shared space from the first layer
 - Hybrid attention masking: bidirectional among image tokens, causal for text and task tokens conditioned on the image
-- Chain-of-Perception decoding: `<|coord|>` then `<|size|>` then `<|seg|>` per instance
 - Specialized heads for coordinates and size, with geometry conditioning via Fourier features
-- Parallel mask decoding: each `<|seg|>` token becomes a mask query and produces a full-resolution mask via dot product with upsampled image features
-## Evaluation summary
-From the technical report:
-- SA-Co (open-vocabulary segmentation): 68.0 Macro F1 compared to 62.3 for SAM 3, with the main remaining gap being presence calibration (Average MCC 0.64 compared to 0.82 for SAM 3)
-- PBench: a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and includes a dense long-context crowded split
-Full tables, setup details, and ablations are in the report.
 ## Limitations
-- Presence calibration remains a key limitation for autoregressive dense interfaces. False positives are more likely on hard negatives than in DETR like segmentation models.
 - OCR-driven prompts depend on text size and image resolution. Small text and degraded scans are challenging.
 - Dense scenes benefit strongly from high resolution inputs. Low resolution can be sufficient to recognize that a concept is present, but insufficient to localize each instance precisely.
 ## Citation

 ---
+pipeline_tag: object-detection
 library_name: transformers
 tags:
 - falcon
 <img src="main_fig.jpg" width="480" alt="Falcon Perception"/>
+> [!NOTE]
+> This is the **300M parameter** variant of Falcon Perception. It supports **detection only** (bounding boxes). For the full model with segmentation masks, see [`tiiuae/Falcon-Perception`](https://huggingface.co/tiiuae/Falcon-Perception).
+## Falcon Perception 300M
+Falcon Perception 300M is a 0.3B parameter early-fusion vision-language model for open-vocabulary grounding detection. Given an image and a natural language query, it returns zero, one, or many matching instances with accurate bounding boxes.
+The model is built around a simple interface. Image patches and text tokens are processed together in a single Transformer using a hybrid attention mask: image tokens build bidirectional visual context, while text and task tokens decode causally conditioned on the image. For each detected instance, the model generates a short structured sequence of task tokens: `<|coord|>` then `<|size|>`, producing a center point and bounding box size in normalized coordinates.
 ### Links
+- Full model (with segmentation): [`tiiuae/Falcon-Perception`](https://huggingface.co/tiiuae/Falcon-Perception)
+- Code and inference engine: [`github.com/tiiuae/Falcon-Perception`](https://github.com/tiiuae/Falcon-Perception)
 - Tech report: arXiv link coming soon
 - PBench dataset: `tiiuae/PBench`
+- OCR model: [`tiiuae/Falcon-OCR`](https://huggingface.co/tiiuae/Falcon-OCR)
 ## Quickstart
 ### Installation
 ```bash
+pip install "torch>=2.5" transformers pillow einops
 ```
 This model requires PyTorch 2.5 or newer for FlexAttention. The first call can be slower because `torch.compile` may build optimized kernels.
+### Run open-vocabulary detection
 ```python
 import torch
 from transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained(
+    "tiiuae/Falcon-Perception-300M",
     trust_remote_code=True,
     device_map={"": "cuda:0"},
 )
     print(p["xy"], p["hw"])
 ```
+Each prediction is a dict with normalized bounding box coordinates:
 ```python
+{
+  "xy": {"x": float, "y": float},  # center in normalized coordinates (0 to 1)
+  "hw": {"h": float, "w": float},  # size in normalized coordinates (0 to 1)
+}
+```
+### Visualize detections
+```python
+from PIL import ImageDraw
+draw = ImageDraw.Draw(image)
+W, H = image.size
 for p in preds:
+    cx, cy = p["xy"]["x"] * W, p["xy"]["y"] * H
+    bw, bh = p["hw"]["w"] * W, p["hw"]["h"] * H
+    x0, y0 = cx - bw / 2, cy - bh / 2
+    x1, y1 = cx + bw / 2, cy + bh / 2
+    draw.rectangle([x0, y0, x1, y1], outline="lime", width=2)
+image.save("output.jpg")
 ```
 ## API
 |---|---|---|---|
 | `images` | `PIL.Image` or `list` | required | Single image or list of images |
 | `queries` | `str` or `list[str]` | required | Query string(s), one per image |
+| `task` | `str` | `"detection"` | Task type. Only `"detection"` is supported by this model. |
 | `max_new_tokens` | `int` | `2048` | Maximum decoding steps |
 | `min_dimension` | `int` | `256` | Minimum image side after resize |
 | `max_dimension` | `int` | `1024` | Maximum image side after resize |
 **Returns:** `list[list[dict]]`, one list per image.
+Each detection dict contains:
 ```python
 {
+  "xy": {"x": float, "y": float},  # center in normalized coordinates (0 to 1)
+  "hw": {"h": float, "w": float},  # size in normalized coordinates (0 to 1)
 }
 ```
+> [!NOTE]
+> Requesting `task="segmentation"` on this model will raise a `ValueError`. Use the full [`tiiuae/Falcon-Perception`](https://huggingface.co/tiiuae/Falcon-Perception) model for segmentation masks.
 ## What the model is for
+Falcon Perception 300M is designed for open-vocabulary object detection where the main difficulty is localization under free-form text queries. Use cases include:
 - Natural language driven object selection in images
+- Lightweight bounding-box detection for downstream pipelines
 - Crowded scenes where the number of instances is large and variable
+- Edge or resource-constrained deployments where the full model is too large
 It is not intended as a general-purpose vision-language assistant for open-ended reasoning, long-form generation, or multi-step VQA.
 - One dense Transformer backbone processes image patches and text tokens in a shared space from the first layer
 - Hybrid attention masking: bidirectional among image tokens, causal for text and task tokens conditioned on the image
+- Chain-of-Perception decoding: `<|coord|>` then `<|size|>` per instance
 - Specialized heads for coordinates and size, with geometry conditioning via Fourier features
+## Comparison with the full model
+| | **Falcon-Perception** | **Falcon-Perception-300M** |
+|---|---|---|
+| Parameters | ~7B | ~0.3B |
+| Tasks | Detection + Segmentation | Detection only |
+| Output | Bounding boxes + pixel masks | Bounding boxes |
+| Token sequence | `<\|coord\|>` `<\|size\|>` `<\|seg\|>` | `<\|coord\|>` `<\|size\|>` |
 ## Limitations
+- Presence calibration remains a key limitation for autoregressive dense interfaces. False positives are more likely on hard negatives than in DETR-like detection models.
 - OCR-driven prompts depend on text size and image resolution. Small text and degraded scans are challenging.
 - Dense scenes benefit strongly from high resolution inputs. Low resolution can be sufficient to recognize that a concept is present, but insufficient to localize each instance precisely.
+- This variant does **not** produce segmentation masks. Use the full model if pixel-level masks are needed.
 ## Citation

modeling_falcon_perception.py CHANGED Viewed

@@ -620,6 +620,7 @@ class FalconPerceptionForSegmentation(PreTrainedModel):
         self,
         images,
         queries,
         max_new_tokens: int = 2048,
         temperature: float = 0.0,
         top_k: int | None = None,
@@ -630,11 +631,13 @@ class FalconPerceptionForSegmentation(PreTrainedModel):
         segm_threshold: float = 0.5,
     ) -> list[list[dict]]:
         """
-        Segment objects in images matching the given queries.
         Args:
             images: Single PIL Image (or path/URL) or list of them.
             queries: Single query string or list of query strings (one per image).
             max_new_tokens: Maximum generation steps.
             temperature: Sampling temperature (0.0 = greedy).
             top_k: Top-k sampling (None = disabled).
@@ -645,14 +648,25 @@ class FalconPerceptionForSegmentation(PreTrainedModel):
             segm_threshold: Sigmoid threshold for binary mask.
         Returns:
-            List (per image) of lists (per detection) of dicts::
-                {
-                    "xy": {"x": float, "y": float},
-                    "hw": {"h": float, "w": float},
-                    "mask_rle": {"counts": str, "size": [H, W]},
-                }
         """
         self._ensure_device_buffers()
         if compile:
             self.compile_model()
@@ -716,9 +730,12 @@ class FalconPerceptionForSegmentation(PreTrainedModel):
             coord_xy=coord_xy, size_hw=size_hw_t,
         )
-        hr_img_features = self.upsample_img_features(
-            h_BSD, tokens, batch_inputs["pixel_values"], batch_inputs["pixel_mask"],
-        )
         aux_output_B = [[] for _ in range(B)]
         stop_ids = torch.tensor(stop_token_ids).to(device)
@@ -774,13 +791,14 @@ class FalconPerceptionForSegmentation(PreTrainedModel):
             for i, b in enumerate(sample_w_size.tolist()):
                 aux_output_B[b].append(size_preds[i])
-            # Decode segmentation
-            sample_w_segm = torch.where(tokens_B1 == self.config.seg_token_id)[0]
-            segm_tokens = h_BSD[sample_w_segm, -1, :]
-            segm_tokens = self.proj_segm(segm_tokens)
-            segm_masks = torch.einsum("kdhw,kd->khw", hr_img_features[sample_w_segm], segm_tokens)
-            for i, b in enumerate(sample_w_segm):
-                aux_output_B[b].append(segm_masks[i])
             # Next step
             logits_BSV, h_BSD = self.forward(
@@ -791,12 +809,13 @@ class FalconPerceptionForSegmentation(PreTrainedModel):
             hit_stop_B = torch.isin(tokens_B1, stop_ids).any(dim=-1)
             should_stop_B = should_stop_B.logical_or(hit_stop_B)
-        # Post-process: convert aux outputs to structured results with RLE masks
         pixel_mask_batch = batch_inputs["pixel_mask"][:, 0]  # (B, H, W)
         results = []
         for b in range(B):
             dets = self._postprocess_aux(
                 aux_output_B[b], pixel_mask_batch[b], original_sizes[b], segm_threshold,
             )
             results.append(dets)
@@ -875,11 +894,29 @@ class FalconPerceptionForSegmentation(PreTrainedModel):
         orig_hw: tuple[int, int],
         threshold: float,
         nms_iou_threshold: float = 0.6,
     ) -> list[dict]:
-        """Convert raw aux outputs into structured detections with RLE masks."""
         orig_h, orig_w = orig_hw
-        # Find active image region from pixel mask
         nonzero = torch.nonzero(pixel_mask_hw, as_tuple=False)
         if len(nonzero) > 0:
             min_h, min_w = nonzero.min(dim=0)[0]
@@ -890,30 +927,26 @@ class FalconPerceptionForSegmentation(PreTrainedModel):
             min_h = min_w = 0
             act_h = act_w = None
-        # Group into triplets: coord, size, mask — build binary masks first
         candidates = []
-        step = 3  # coord, size, mask
-        for i in range(0, len(aux_list), step):
-            if i + 2 >= len(aux_list):
-                break
-            xy = aux_list[i]
-            hw = aux_list[i + 1]
-            mask_logits = aux_list[i + 2]
-            if not isinstance(mask_logits, torch.Tensor):
-                continue
-            # Crop to active region
-            if act_h is not None and act_w is not None:
-                mask_logits = mask_logits[min_h:min_h + act_h, min_w:min_w + act_w]
-            # Resize to original image size
-            mask_logits = mask_logits.unsqueeze(0).unsqueeze(0).float()
-            mask_logits = F.interpolate(mask_logits, size=(orig_h, orig_w), mode="bilinear", align_corners=False)
-            mask_logits = mask_logits.squeeze(0).squeeze(0)
-            # Threshold
-            binary_mask = (torch.sigmoid(mask_logits) > threshold).bool()
-            candidates.append({"xy": xy, "hw": hw, "binary_mask": binary_mask})
         if not candidates:
             return []

         self,
         images,
         queries,
+        task: str | None = None,
         max_new_tokens: int = 2048,
         temperature: float = 0.0,
         top_k: int | None = None,
         segm_threshold: float = 0.5,
     ) -> list[list[dict]]:
         """
+        Detect (and optionally segment) objects in images matching the given queries.
         Args:
             images: Single PIL Image (or path/URL) or list of them.
             queries: Single query string or list of query strings (one per image).
+            task: ``"segmentation"`` or ``"detection"``. Defaults to ``"segmentation"``
+                when the model supports it, ``"detection"`` otherwise.
             max_new_tokens: Maximum generation steps.
             temperature: Sampling temperature (0.0 = greedy).
             top_k: Top-k sampling (None = disabled).
             segm_threshold: Sigmoid threshold for binary mask.
         Returns:
+            List (per image) of lists (per detection) of dicts.
+            For segmentation::
+                {"xy": {"x": float, "y": float}, "hw": {"h": float, "w": float},
+                 "mask_rle": {"counts": str, "size": [H, W]}}
+            For detection::
+                {"xy": {"x": float, "y": float}, "hw": {"h": float, "w": float}}
         """
+        if task is None:
+            task = "segmentation" if self.config.do_segmentation else "detection"
+        if task == "segmentation" and not self.config.do_segmentation:
+            raise ValueError(
+                "Task 'segmentation' requires a model with segmentation heads, "
+                "but this model was exported with do_segmentation=False. "
+                "Use task='detection' instead."
+            )
+        do_segm = task == "segmentation"
         self._ensure_device_buffers()
         if compile:
             self.compile_model()
             coord_xy=coord_xy, size_hw=size_hw_t,
         )
+        if do_segm:
+            hr_img_features = self.upsample_img_features(
+                h_BSD, tokens, batch_inputs["pixel_values"], batch_inputs["pixel_mask"],
+            )
+        else:
+            hr_img_features = None
         aux_output_B = [[] for _ in range(B)]
         stop_ids = torch.tensor(stop_token_ids).to(device)
             for i, b in enumerate(sample_w_size.tolist()):
                 aux_output_B[b].append(size_preds[i])
+            # Decode segmentation (only when model has segmentation heads)
+            if do_segm:
+                sample_w_segm = torch.where(tokens_B1 == self.config.seg_token_id)[0]
+                segm_tokens = h_BSD[sample_w_segm, -1, :]
+                segm_tokens = self.proj_segm(segm_tokens)
+                segm_masks = torch.einsum("kdhw,kd->khw", hr_img_features[sample_w_segm], segm_tokens)
+                for i, b in enumerate(sample_w_segm):
+                    aux_output_B[b].append(segm_masks[i])
             # Next step
             logits_BSV, h_BSD = self.forward(
             hit_stop_B = torch.isin(tokens_B1, stop_ids).any(dim=-1)
             should_stop_B = should_stop_B.logical_or(hit_stop_B)
+        # Post-process: convert aux outputs to structured results
         pixel_mask_batch = batch_inputs["pixel_mask"][:, 0]  # (B, H, W)
         results = []
         for b in range(B):
             dets = self._postprocess_aux(
                 aux_output_B[b], pixel_mask_batch[b], original_sizes[b], segm_threshold,
+                task=task,
             )
             results.append(dets)
         orig_hw: tuple[int, int],
         threshold: float,
         nms_iou_threshold: float = 0.6,
+        task: str = "segmentation",
     ) -> list[dict]:
+        """Convert raw aux outputs into structured detections.
+        For segmentation, returns dicts with ``xy``, ``hw``, and ``mask_rle``.
+        For detection, returns dicts with ``xy`` and ``hw`` only.
+        """
         orig_h, orig_w = orig_hw
+        if task == "detection":
+            # Detection-only: aux_list is interleaved coord/size dicts
+            detections = []
+            xy = None
+            for item in aux_list:
+                if isinstance(item, dict):
+                    if "x" in item or "y" in item:
+                        xy = item
+                    elif ("h" in item or "w" in item) and xy is not None:
+                        detections.append({"xy": xy, "hw": item})
+                        xy = None
+            return detections
+        # Segmentation: find active image region from pixel mask
         nonzero = torch.nonzero(pixel_mask_hw, as_tuple=False)
         if len(nonzero) > 0:
             min_h, min_w = nonzero.min(dim=0)[0]
             min_h = min_w = 0
             act_h = act_w = None
+        # Group into triplets: coord, size, mask
         candidates = []
+        xy = hw = None
+        for item in aux_list:
+            if isinstance(item, dict):
+                if "x" in item or "y" in item:
+                    xy = item
+                    hw = None
+                elif "h" in item or "w" in item:
+                    hw = item
+            elif isinstance(item, torch.Tensor) and xy is not None and hw is not None:
+                mask_logits = item
+                if act_h is not None and act_w is not None:
+                    mask_logits = mask_logits[min_h:min_h + act_h, min_w:min_w + act_w]
+                mask_logits = mask_logits.unsqueeze(0).unsqueeze(0).float()
+                mask_logits = F.interpolate(mask_logits, size=(orig_h, orig_w), mode="bilinear", align_corners=False)
+                mask_logits = mask_logits.squeeze(0).squeeze(0)
+                binary_mask = (torch.sigmoid(mask_logits) > threshold).bool()
+                candidates.append({"xy": xy, "hw": hw, "binary_mask": binary_mask})
+                xy = hw = None
         if not candidates:
             return []