---
pipeline_tag: object-detection
library_name: transformers
tags:
- falcon
- detection 
- vision-language
- open-vocabulary
license: apache-2.0
---

<img src="main_fig.jpg" width="480" alt="Falcon Perception"/>

> [!NOTE]
> This is the **300M parameter** variant of Falcon Perception. It supports **detection only** (bounding boxes). For the full model with segmentation masks, see [`tiiuae/Falcon-Perception`](https://huggingface.co/tiiuae/Falcon-Perception).

## Falcon Perception 300M

Falcon Perception 300M is a 0.3B parameter early-fusion vision-language model for open-vocabulary grounding detection. Given an image and a natural language query, it returns zero, one, or many matching instances with accurate bounding boxes.

The model is built around a simple interface. Image patches and text tokens are processed together in a single Transformer using a hybrid attention mask: image tokens build bidirectional visual context, while text and task tokens decode causally conditioned on the image. For each detected instance, the model generates a short structured sequence of task tokens: `<|coord|>` then `<|size|>`, producing a center point and bounding box size in normalized coordinates.


### Links

- Full model (with segmentation): [`tiiuae/Falcon-Perception`](https://huggingface.co/tiiuae/Falcon-Perception)
- Code and inference engine: [`github.com/tiiuae/Falcon-Perception`](https://github.com/tiiuae/Falcon-Perception)
- Tech report: arXiv link coming soon
- PBench dataset: `tiiuae/PBench`
- OCR model: [`tiiuae/Falcon-OCR`](https://huggingface.co/tiiuae/Falcon-OCR)

## Quickstart

### Installation

```bash
pip install "torch>=2.5" transformers pillow einops
```

This model requires PyTorch 2.5 or newer for FlexAttention. The first call can be slower because `torch.compile` may build optimized kernels.

### Run open-vocabulary detection

```python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "tiiuae/Falcon-Perception-300M",
    trust_remote_code=True,
    device_map={"": "cuda:0"},
)

image = Image.open("photo.jpg")
preds = model.generate(image, "cat")[0]

for p in preds:
    print(p["xy"], p["hw"])
```

Each prediction is a dict with normalized bounding box coordinates:

```python
{
  "xy": {"x": float, "y": float},  # center in normalized coordinates (0 to 1)
  "hw": {"h": float, "w": float},  # size in normalized coordinates (0 to 1)
}
```

### Visualize detections

```python
from PIL import ImageDraw

draw = ImageDraw.Draw(image)
W, H = image.size

for p in preds:
    cx, cy = p["xy"]["x"] * W, p["xy"]["y"] * H
    bw, bh = p["hw"]["w"] * W, p["hw"]["h"] * H
    x0, y0 = cx - bw / 2, cy - bh / 2
    x1, y1 = cx + bw / 2, cy + bh / 2
    draw.rectangle([x0, y0, x1, y1], outline="lime", width=2)

image.save("output.jpg")
```

## API

### `model.generate(images, queries, **kwargs)`

| Parameter | Type | Default | Description |
|---|---|---|---|
| `images` | `PIL.Image` or `list` | required | Single image or list of images |
| `queries` | `str` or `list[str]` | required | Query string(s), one per image |
| `task` | `str` | `"detection"` | Task type. Only `"detection"` is supported by this model. |
| `max_new_tokens` | `int` | `2048` | Maximum decoding steps |
| `min_dimension` | `int` | `256` | Minimum image side after resize |
| `max_dimension` | `int` | `1024` | Maximum image side after resize |
| `compile` | `bool` | `True` | Run `torch.compile` on first call |

**Returns:** `list[list[dict]]`, one list per image.

Each detection dict contains:

```python
{
  "xy": {"x": float, "y": float},  # center in normalized coordinates (0 to 1)
  "hw": {"h": float, "w": float},  # size in normalized coordinates (0 to 1)
}
```

> [!NOTE]
> Requesting `task="segmentation"` on this model will raise a `ValueError`. Use the full [`tiiuae/Falcon-Perception`](https://huggingface.co/tiiuae/Falcon-Perception) model for segmentation masks.

## What the model is for

Falcon Perception 300M is designed for open-vocabulary object detection where the main difficulty is localization under free-form text queries. Use cases include:

- Natural language driven object selection in images
- Lightweight bounding-box detection for downstream pipelines
- Crowded scenes where the number of instances is large and variable
- Edge or resource-constrained deployments where the full model is too large

It is not intended as a general-purpose vision-language assistant for open-ended reasoning, long-form generation, or multi-step VQA.

## Model details (high level)

The architecture follows a single-stack early-fusion recipe:

- One dense Transformer backbone processes image patches and text tokens in a shared space from the first layer
- Hybrid attention masking: bidirectional among image tokens, causal for text and task tokens conditioned on the image
- Chain-of-Perception decoding: `<|coord|>` then `<|size|>` per instance
- Specialized heads for coordinates and size, with geometry conditioning via Fourier features

## Comparison with the full model

| | **Falcon-Perception** | **Falcon-Perception-300M** |
|---|---|---|
| Parameters | ~7B | ~0.3B |
| Tasks | Detection + Segmentation | Detection only |
| Output | Bounding boxes + pixel masks | Bounding boxes |
| Token sequence | `<\|coord\|>` `<\|size\|>` `<\|seg\|>` | `<\|coord\|>` `<\|size\|>` |

## Limitations

- Presence calibration remains a key limitation for autoregressive dense interfaces. False positives are more likely on hard negatives than in DETR-like detection models.
- OCR-driven prompts depend on text size and image resolution. Small text and degraded scans are challenging.
- Dense scenes benefit strongly from high resolution inputs. Low resolution can be sufficient to recognize that a concept is present, but insufficient to localize each instance precisely.
- This variant does **not** produce segmentation masks. Use the full model if pixel-level masks are needed.

## Citation

If you use Falcon Perception, please cite:

```bibtex
@article{bevli2026falcon,
  title   = {Falcon Perception},
  author  = {Bevli, Aviraj and Chaybouti, Sofian and Dahou, Yasser and Hacid, Hakim and Huynh, Ngoc Dung and Le Khac, Phuc H. and Narayan, Sanath and Para, Wamiq Reyaz and Singh, Ankit},
  journal = {arXiv preprint arXiv:2603.27365},
  year    = {2026},
  url     = {https://arxiv.org/abs/2603.27365}
}
```