Update detection-mode code

3699368 2 months ago

6.42 kB

	---
	pipeline_tag: object-detection
	library_name: transformers
	tags:
	- falcon
	- detection
	- vision-language
	- open-vocabulary
	license: apache-2.0
	---

	<img src="main_fig.jpg" width="480" alt="Falcon Perception"/>

	> [!NOTE]
	> This is the 300M parameter variant of Falcon Perception. It supports detection only (bounding boxes). For the full model with segmentation masks, see [`tiiuae/Falcon-Perception`](https://huggingface.co/tiiuae/Falcon-Perception).

	## Falcon Perception 300M

	Falcon Perception 300M is a 0.3B parameter early-fusion vision-language model for open-vocabulary grounding detection. Given an image and a natural language query, it returns zero, one, or many matching instances with accurate bounding boxes.

	The model is built around a simple interface. Image patches and text tokens are processed together in a single Transformer using a hybrid attention mask: image tokens build bidirectional visual context, while text and task tokens decode causally conditioned on the image. For each detected instance, the model generates a short structured sequence of task tokens: `<\|coord\|>` then `<\|size\|>`, producing a center point and bounding box size in normalized coordinates.


	### Links

	- Full model (with segmentation): [`tiiuae/Falcon-Perception`](https://huggingface.co/tiiuae/Falcon-Perception)
	- Code and inference engine: [`github.com/tiiuae/Falcon-Perception`](https://github.com/tiiuae/Falcon-Perception)
	- Tech report: arXiv link coming soon
	- PBench dataset: `tiiuae/PBench`
	- OCR model: [`tiiuae/Falcon-OCR`](https://huggingface.co/tiiuae/Falcon-OCR)

	## Quickstart

	### Installation

	```bash
	pip install "torch>=2.5" transformers pillow einops
	```

	This model requires PyTorch 2.5 or newer for FlexAttention. The first call can be slower because `torch.compile` may build optimized kernels.

	### Run open-vocabulary detection

	```python
	import torch
	from PIL import Image
	from transformers import AutoModelForCausalLM

	model = AutoModelForCausalLM.from_pretrained(
	"tiiuae/Falcon-Perception-300M",
	trust_remote_code=True,
	device_map={"": "cuda:0"},
	)

	image = Image.open("photo.jpg")
	preds = model.generate(image, "cat")[0]

	for p in preds:
	print(p["xy"], p["hw"])
	```

	Each prediction is a dict with normalized bounding box coordinates:

	```python
	{
	"xy": {"x": float, "y": float}, # center in normalized coordinates (0 to 1)
	"hw": {"h": float, "w": float}, # size in normalized coordinates (0 to 1)
	}
	```

	### Visualize detections

	```python
	from PIL import ImageDraw

	draw = ImageDraw.Draw(image)
	W, H = image.size

	for p in preds:
	cx, cy = p["xy"]["x"] * W, p["xy"]["y"] * H
	bw, bh = p["hw"]["w"] * W, p["hw"]["h"] * H
	x0, y0 = cx - bw / 2, cy - bh / 2
	x1, y1 = cx + bw / 2, cy + bh / 2
	draw.rectangle([x0, y0, x1, y1], outline="lime", width=2)

	image.save("output.jpg")
	```

	## API

	### `model.generate(images, queries, **kwargs)`

	\| Parameter \| Type \| Default \| Description \|
	\|---\|---\|---\|---\|
	\| `images` \| `PIL.Image` or `list` \| required \| Single image or list of images \|
	\| `queries` \| `str` or `list[str]` \| required \| Query string(s), one per image \|
	\| `task` \| `str` \| `"detection"` \| Task type. Only `"detection"` is supported by this model. \|
	\| `max_new_tokens` \| `int` \| `2048` \| Maximum decoding steps \|
	\| `min_dimension` \| `int` \| `256` \| Minimum image side after resize \|
	\| `max_dimension` \| `int` \| `1024` \| Maximum image side after resize \|
	\| `compile` \| `bool` \| `True` \| Run `torch.compile` on first call \|

	Returns: `list[list[dict]]`, one list per image.

	Each detection dict contains:

	```python
	{
	"xy": {"x": float, "y": float}, # center in normalized coordinates (0 to 1)
	"hw": {"h": float, "w": float}, # size in normalized coordinates (0 to 1)
	}
	```

	> [!NOTE]
	> Requesting `task="segmentation"` on this model will raise a `ValueError`. Use the full [`tiiuae/Falcon-Perception`](https://huggingface.co/tiiuae/Falcon-Perception) model for segmentation masks.

	## What the model is for

	Falcon Perception 300M is designed for open-vocabulary object detection where the main difficulty is localization under free-form text queries. Use cases include:

	- Natural language driven object selection in images
	- Lightweight bounding-box detection for downstream pipelines
	- Crowded scenes where the number of instances is large and variable
	- Edge or resource-constrained deployments where the full model is too large

	It is not intended as a general-purpose vision-language assistant for open-ended reasoning, long-form generation, or multi-step VQA.

	## Model details (high level)

	The architecture follows a single-stack early-fusion recipe:

	- One dense Transformer backbone processes image patches and text tokens in a shared space from the first layer
	- Hybrid attention masking: bidirectional among image tokens, causal for text and task tokens conditioned on the image
	- Chain-of-Perception decoding: `<\|coord\|>` then `<\|size\|>` per instance
	- Specialized heads for coordinates and size, with geometry conditioning via Fourier features

	## Comparison with the full model

	\| \| Falcon-Perception \| Falcon-Perception-300M \|
	\|---\|---\|---\|
	\| Parameters \| ~7B \| ~0.3B \|
	\| Tasks \| Detection + Segmentation \| Detection only \|
	\| Output \| Bounding boxes + pixel masks \| Bounding boxes \|
	\| Token sequence \| `<\\|coord\\|>` `<\\|size\\|>` `<\\|seg\\|>` \| `<\\|coord\\|>` `<\\|size\\|>` \|

	## Limitations

	- Presence calibration remains a key limitation for autoregressive dense interfaces. False positives are more likely on hard negatives than in DETR-like detection models.
	- OCR-driven prompts depend on text size and image resolution. Small text and degraded scans are challenging.
	- Dense scenes benefit strongly from high resolution inputs. Low resolution can be sufficient to recognize that a concept is present, but insufficient to localize each instance precisely.
	- This variant does not produce segmentation masks. Use the full model if pixel-level masks are needed.

	## Citation

	If you use Falcon Perception, please cite:

	```bibtex
	@article{bevli2026falcon,
	title = {Falcon Perception},
	author = {Bevli, Aviraj and Chaybouti, Sofian and Dahou, Yasser and Hacid, Hakim and Huynh, Ngoc Dung and Le Khac, Phuc H. and Narayan, Sanath and Para, Wamiq Reyaz and Singh, Ankit},
	journal = {arXiv preprint arXiv:2603.27365},
	year = {2026},
	url = {https://arxiv.org/abs/2603.27365}
	}
	```