File size: 5,769 Bytes
d8138fe 8bb8aa9 d8138fe eb1d63a d8138fe 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b 9b053e0 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 4f2517b eb1d63a 795579b eb1d63a 8bb8aa9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | ---
pipeline_tag: mask-generation
library_name: transformers
tags:
- falcon
- segmentation
- vision-language
- open-vocabulary
license: apache-2.0
---
<img src="main_fig.jpg" width="480" alt="Falcon Perception"/>
## Falcon Perception
Falcon Perception is a 0.6B parameter early-fusion vision-language model for open-vocabulary grounding and instance segmentation. Given an image and a natural language query, it returns zero, one, or many matching instances with pixel-accurate masks.
The model is built around a simple interface. Image patches and text tokens are processed together in a single Transformer using a hybrid attention mask: image tokens build bidirectional visual context, while text and task tokens decode causally conditioned on the image. For each instance, the model generates a short structured sequence of task tokens in a fixed order, `<|coord|>` then `<|size|>` then `<|seg|>`. The `<|seg|>` token acts as a mask query whose hidden state is projected and dotted with upsampled image features, producing a full-resolution binary mask without autoregressive mask generation.
### Links
- Code and inference engine: `https://github.com/tiiuae/Falcon-Perception`
- Tech report: arXiv link coming soon
- PBench dataset: `tiiuae/PBench`
- OCR model: `tiiuae/Falcon-OCR`
## Quickstart
### Installation
```bash
pip install "torch>=2.5" transformers pillow einops pycocotools
```
This model requires PyTorch 2.5 or newer for FlexAttention. The first call can be slower because `torch.compile` may build optimized kernels.
### Run open-vocabulary segmentation
```python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/falcon-perception",
trust_remote_code=True,
device_map={"": "cuda:0"},
)
image = Image.open("photo.jpg")
preds = model.generate(image, "cat")[0]
for p in preds:
print(p["xy"], p["hw"])
```
### Decode masks
```python
import numpy as np
from pycocotools import mask as mask_utils
for p in preds:
rle = p["mask_rle"]
# pycocotools expects bytes for counts
m = {"size": rle["size"], "counts": rle["counts"].encode("utf-8")}
mask = mask_utils.decode(m).astype(bool) # H x W
print(mask.shape, mask.sum())
```
## API
### `model.generate(images, queries, **kwargs)`
| Parameter | Type | Default | Description |
|---|---|---|---|
| `images` | `PIL.Image` or `list` | required | Single image or list of images |
| `queries` | `str` or `list[str]` | required | Query string(s), one per image |
| `max_new_tokens` | `int` | `2048` | Maximum decoding steps |
| `min_dimension` | `int` | `256` | Minimum image side after resize |
| `max_dimension` | `int` | `1024` | Maximum image side after resize |
| `compile` | `bool` | `True` | Run `torch.compile` on first call |
**Returns:** `list[list[dict]]`, one list per image.
Each prediction dict contains:
```python
{
"xy": {"x": float, "y": float}, # center in normalized coordinates (0 to 1)
"hw": {"h": float, "w": float}, # size in normalized coordinates (0 to 1)
"mask_rle": {"counts": str, "size": [H, W]}, # COCO RLE at original resolution
}
```
## What the model is for
Falcon Perception is designed for dense grounding regimes where the main difficulty is localization under open vocabulary. That includes:
- Natural language driven object selection in images
- Promptable instance segmentation for downstream pipelines
- Crowded scenes where the number of instances is large and variable
It is not intended as a general-purpose vision-language assistant for open-ended reasoning, long-form generation, or multi-step VQA.
## Model details (high level)
The architecture follows a single-stack early-fusion recipe:
- One dense Transformer backbone processes image patches and text tokens in a shared space from the first layer
- Hybrid attention masking: bidirectional among image tokens, causal for text and task tokens conditioned on the image
- Chain-of-Perception decoding: `<|coord|>` then `<|size|>` then `<|seg|>` per instance
- Specialized heads for coordinates and size, with geometry conditioning via Fourier features
- Parallel mask decoding: each `<|seg|>` token becomes a mask query and produces a full-resolution mask via dot product with upsampled image features
## Evaluation summary
From the technical report:
- SA-Co (open-vocabulary segmentation): 68.0 Macro F1 compared to 62.3 for SAM 3, with the main remaining gap being presence calibration (Average MCC 0.64 compared to 0.82 for SAM 3)
- PBench: a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and includes a dense long-context crowded split
Full tables, setup details, and ablations are in the report.
## Limitations
- Presence calibration remains a key limitation for autoregressive dense interfaces. False positives are more likely on hard negatives than in DETR like segmentation models.
- OCR-driven prompts depend on text size and image resolution. Small text and degraded scans are challenging.
- Dense scenes benefit strongly from high resolution inputs. Low resolution can be sufficient to recognize that a concept is present, but insufficient to localize each instance precisely.
## Citation
If you use Falcon Perception, please cite:
```bibtex
@article{bevli2026falcon,
title = {Falcon Perception},
author = {Bevli, Aviraj and Chaybouti, Sofian and Dahou, Yasser and Hacid, Hakim and Huynh, Ngoc Dung and Le Khac, Phuc H. and Narayan, Sanath and Para, Wamiq Reyaz and Singh, Ankit},
journal = {arXiv preprint arXiv:2603.27365},
year = {2026},
url = {https://arxiv.org/abs/2603.27365}
}
``` |