tiiuae
/

Falcon-Perception

@@ -1,3 +1,5 @@
 ---
 license: apache-2.0
 pipeline_tag: mask-generation
@@ -9,112 +11,138 @@ tags:
 - open-vocabulary
 ---
-<img src="main_fig.jpg" width="480" alt="Falcon Perception"/>
-Falcon Perception is a **dense early-fusion vision-language model** for **open-vocabulary segmentation**. Given an image and a natural-language query, it segments **all matching objects** and returns pixel-accurate masks.
-The model **jointly processes image patches and text tokens** in a single transformer, then autoregressively predicts **`<|coord|>`**, **`<|size|>`**, and **`<|seg|>`** tokens for each detected object. Each `<|seg|>` token acts as a **mask query**: its hidden state is projected and dot-producted against upsampled image features to produce a binary mask (i.e. no autoregressive polygon generation needed).
-## Installation
 ```bash
-pip install transformers torch einops pycocotools
 ```
-Requires **PyTorch 2.5+** (FlexAttention).
-## Quick Start
 ```python
 import torch
-from transformers import AutoModelForCausalLM
 from PIL import Image
 model = AutoModelForCausalLM.from_pretrained(
     "tiiuae/falcon-perception",
     trust_remote_code=True,
-    dtype=torch.bfloat16,
-    device_map="cuda",
 )
 image = Image.open("photo.jpg")
-results = model.generate(image, "cat")
-for pred in results[0]:
-    print(pred["xy"])        # {"x": 0.35, "y": 0.42}
-    print(pred["hw"])        # {"h": 0.15, "w": 0.12}
-    print(pred["mask_rle"])  # {"counts": "...", "size": [H, W]}
 ```
-> The first `generate()` call is slower (~15-20 s) because `torch.compile` builds optimized kernels. Subsequent calls run in ~1-2 s.
 ### `model.generate(images, queries, **kwargs)`
 | Parameter | Type | Default | Description |
 |---|---|---|---|
-| `images` | `PIL.Image` or `list` | required | Single image or list of images (PIL, path, or URL) |
 | `queries` | `str` or `list[str]` | required | Query string(s), one per image |
-| `max_new_tokens` | `int` | `2048` | Maximum generation steps |
 | `min_dimension` | `int` | `256` | Minimum image side after resize |
 | `max_dimension` | `int` | `1024` | Maximum image side after resize |
-| `compile` | `bool` | `True` | Auto torch.compile on first call |
-| `segm_threshold` | `float` | `0.5` | Sigmoid threshold for binary masks |
-**Returns:** `list[list[dict]]` — one list per image, each containing detection dicts:
 ```python
 {
-    "xy": {"x": float, "y": float},       # center (normalized 0-1)
-    "hw": {"h": float, "w": float},        # size   (normalized 0-1)
-    "mask_rle": {"counts": str, "size": [H, W]},  # COCO RLE at original resolution
 }
 ```
-## Visualizing Masks
-```python
-import numpy as np
-from pycocotools import mask as mask_utils
-from PIL import Image, ImageDraw
-def overlay_masks(image, detections, alpha=0.55):
-    """Overlay RLE masks on an image with colored fills and black borders."""
-    overlay = image.convert("RGBA").copy()
-    colors = [
-        (255, 60, 60), (60, 220, 60), (50, 120, 255),
-        (255, 200, 40), (220, 60, 220), (60, 220, 220),
-    ]
-    for i, det in enumerate(detections):
-        m = mask_utils.decode(det["mask_rle"]).astype(bool)
-        r, g, b = colors[i % len(colors)]
-        fill = np.zeros((*m.shape, 4), dtype=np.uint8)
-        fill[m] = [r, g, b, int(255 * alpha)]
-        overlay = Image.alpha_composite(overlay, Image.fromarray(fill))
-        # black border around mask
-        border = np.zeros((*m.shape, 4), dtype=np.uint8)
-        ky = m[1:, :] != m[:-1, :]
-        kx = m[:, 1:] != m[:, :-1]
-        edge = np.zeros_like(m)
-        edge[1:, :] |= ky; edge[:-1, :] |= ky
-        edge[:, 1:] |= kx; edge[:, :-1] |= kx
-        border[edge] = [0, 0, 0, 200]
-        overlay = Image.alpha_composite(overlay, Image.fromarray(border))
-    return overlay
-image = Image.open("photo.jpg")
-results = model.generate(image, "cat")
-overlay_masks(image, results[0]).save("output.png")
-```
-## Performance
-### PBench
 ## Citation

+<img src="main_fig.jpg" width="480" alt="Falcon Perception"/>
 ---
 license: apache-2.0
 pipeline_tag: mask-generation
 - open-vocabulary
 ---
+## Falcon Perception
+Falcon Perception is a 0.6B parameter early-fusion vision-language model for open-vocabulary grounding and instance segmentation. Given an image and a natural language query, it returns zero, one, or many matching instances with pixel-accurate masks.
+The model is built around a simple interface. Image patches and text tokens are processed together in a single Transformer using a hybrid attention mask: image tokens build bidirectional visual context, while text and task tokens decode causally conditioned on the image. For each instance, the model generates a short structured sequence of task tokens in a fixed order, `<|coord|>` then `<|size|>` then `<|seg|>`. The `<|seg|>` token acts as a mask query whose hidden state is projected and dotted with upsampled image features, producing a full-resolution binary mask without autoregressive mask generation.
+### Links
+- Code and inference engine: `https://github.com/tiiuae/Falcon-Perception`
+- Tech report: arXiv link coming soon
+- PBench dataset: `tiiuae/PBench`
+- OCR model: `tiiuae/Falcon-OCR`
+## Quickstart
+### Installation
 ```bash
+pip install "torch>=2.5" transformers pillow einops pycocotools
 ```
+This model requires PyTorch 2.5 or newer for FlexAttention. The first call can be slower because `torch.compile` may build optimized kernels.
+### Run open-vocabulary segmentation
 ```python
 import torch
 from PIL import Image
+from transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained(
     "tiiuae/falcon-perception",
     trust_remote_code=True,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
 )
 image = Image.open("photo.jpg")
+preds = model.generate(image, "cat")[0]
+for p in preds:
+    print(p["xy"], p["hw"])
 ```
+### Decode masks
+```python
+import numpy as np
+from pycocotools import mask as mask_utils
+for p in preds:
+    rle = p["mask_rle"]
+    # pycocotools expects bytes for counts
+    m = {"size": rle["size"], "counts": rle["counts"].encode("utf-8")}
+    mask = mask_utils.decode(m).astype(bool)  # H x W
+    print(mask.shape, mask.sum())
+```
+## API
 ### `model.generate(images, queries, **kwargs)`
 | Parameter | Type | Default | Description |
 |---|---|---|---|
+| `images` | `PIL.Image` or `list` | required | Single image or list of images |
 | `queries` | `str` or `list[str]` | required | Query string(s), one per image |
+| `max_new_tokens` | `int` | `2048` | Maximum decoding steps |
 | `min_dimension` | `int` | `256` | Minimum image side after resize |
 | `max_dimension` | `int` | `1024` | Maximum image side after resize |
+| `compile` | `bool` | `True` | Run `torch.compile` on first call |
+**Returns:** `list[list[dict]]`, one list per image.
+Each prediction dict contains:
 ```python
 {
+  "xy": {"x": float, "y": float},                    # center in normalized coordinates (0 to 1)
+  "hw": {"h": float, "w": float},                    # size in normalized coordinates (0 to 1)
+  "mask_rle": {"counts": str, "size": [H, W]},       # COCO RLE at original resolution
 }
 ```
+## What the model is for
+Falcon Perception is designed for dense grounding regimes where the main difficulty is localization under open vocabulary. That includes:
+- Natural language driven object selection in images
+- Promptable instance segmentation for downstream pipelines
+- Crowded scenes where the number of instances is large and variable
+It is not intended as a general-purpose vision-language assistant for open-ended reasoning, long-form generation, or multi-step VQA.
+## Model details (high level)
+The architecture follows a single-stack early-fusion recipe:
+- One dense Transformer backbone processes image patches and text tokens in a shared space from the first layer
+- Hybrid attention masking: bidirectional among image tokens, causal for text and task tokens conditioned on the image
+- Chain-of-Perception decoding: `<|coord|>` then `<|size|>` then `<|seg|>` per instance
+- Specialized heads for coordinates and size, with geometry conditioning via Fourier features
+- Parallel mask decoding: each `<|seg|>` token becomes a mask query and produces a full-resolution mask via dot product with upsampled image features
+## Evaluation summary
+From the technical report:
+- SA-Co (open-vocabulary segmentation): 68.0 Macro F1 compared to 62.3 for SAM 3, with the main remaining gap being presence calibration (Average MCC 0.64 compared to 0.82 for SAM 3)
+- PBench: a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and includes a dense long-context crowded split
+Full tables, setup details, and ablations are in the report.
+## Limitations
+- Presence calibration remains a key limitation for autoregressive dense interfaces. False positives are more likely on hard negatives than in DETR like segmentation models.
+- OCR-driven prompts depend on text size and image resolution. Small text and degraded scans are challenging.
+- Dense scenes benefit strongly from high resolution inputs. Low resolution can be sufficient to recognize that a concept is present, but insufficient to localize each instance precisely.
 ## Citation
+If you use Falcon Perception, please cite:
+```bibtex
+@misc{falconperception2026,
+  title        = {Falcon Perception},
+  author       = {TII Falcon Vision Team},
+  year         = {2026},
+  howpublished = {arXiv preprint, link forthcoming},
+  note         = {Code: https://github.com/tiiuae/Falcon-Perception},
+}
+```