Object Detection
Transformers
Safetensors
falcon_perception
text-generation
falcon
detection
vision-language
open-vocabulary
custom_code
Instructions to use tiiuae/Falcon-Perception-300M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tiiuae/Falcon-Perception-300M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("object-detection", model="tiiuae/Falcon-Perception-300M", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("tiiuae/Falcon-Perception-300M", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| pipeline_tag: object-detection | |
| library_name: transformers | |
| tags: | |
| - falcon | |
| - detection | |
| - vision-language | |
| - open-vocabulary | |
| license: apache-2.0 | |
| <img src="main_fig.jpg" width="480" alt="Falcon Perception"/> | |
| > [!NOTE] | |
| > This is the **300M parameter** variant of Falcon Perception. It supports **detection only** (bounding boxes). For the full model with segmentation masks, see [`tiiuae/Falcon-Perception`](https://huggingface.co/tiiuae/Falcon-Perception). | |
| ## Falcon Perception 300M | |
| Falcon Perception 300M is a 0.3B parameter early-fusion vision-language model for open-vocabulary grounding detection. Given an image and a natural language query, it returns zero, one, or many matching instances with accurate bounding boxes. | |
| The model is built around a simple interface. Image patches and text tokens are processed together in a single Transformer using a hybrid attention mask: image tokens build bidirectional visual context, while text and task tokens decode causally conditioned on the image. For each detected instance, the model generates a short structured sequence of task tokens: `<|coord|>` then `<|size|>`, producing a center point and bounding box size in normalized coordinates. | |
| ### Links | |
| - Full model (with segmentation): [`tiiuae/Falcon-Perception`](https://huggingface.co/tiiuae/Falcon-Perception) | |
| - Code and inference engine: [`github.com/tiiuae/Falcon-Perception`](https://github.com/tiiuae/Falcon-Perception) | |
| - Tech report: arXiv link coming soon | |
| - PBench dataset: `tiiuae/PBench` | |
| - OCR model: [`tiiuae/Falcon-OCR`](https://huggingface.co/tiiuae/Falcon-OCR) | |
| ## Quickstart | |
| ### Installation | |
| ```bash | |
| pip install "torch>=2.5" transformers pillow einops | |
| ``` | |
| This model requires PyTorch 2.5 or newer for FlexAttention. The first call can be slower because `torch.compile` may build optimized kernels. | |
| ### Run open-vocabulary detection | |
| ```python | |
| import torch | |
| from PIL import Image | |
| from transformers import AutoModelForCausalLM | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "tiiuae/Falcon-Perception-300M", | |
| trust_remote_code=True, | |
| device_map={"": "cuda:0"}, | |
| ) | |
| image = Image.open("photo.jpg") | |
| preds = model.generate(image, "cat")[0] | |
| for p in preds: | |
| print(p["xy"], p["hw"]) | |
| ``` | |
| Each prediction is a dict with normalized bounding box coordinates: | |
| ```python | |
| { | |
| "xy": {"x": float, "y": float}, # center in normalized coordinates (0 to 1) | |
| "hw": {"h": float, "w": float}, # size in normalized coordinates (0 to 1) | |
| } | |
| ``` | |
| ### Visualize detections | |
| ```python | |
| from PIL import ImageDraw | |
| draw = ImageDraw.Draw(image) | |
| W, H = image.size | |
| for p in preds: | |
| cx, cy = p["xy"]["x"] * W, p["xy"]["y"] * H | |
| bw, bh = p["hw"]["w"] * W, p["hw"]["h"] * H | |
| x0, y0 = cx - bw / 2, cy - bh / 2 | |
| x1, y1 = cx + bw / 2, cy + bh / 2 | |
| draw.rectangle([x0, y0, x1, y1], outline="lime", width=2) | |
| image.save("output.jpg") | |
| ``` | |
| ## API | |
| ### `model.generate(images, queries, **kwargs)` | |
| | Parameter | Type | Default | Description | | |
| |---|---|---|---| | |
| | `images` | `PIL.Image` or `list` | required | Single image or list of images | | |
| | `queries` | `str` or `list[str]` | required | Query string(s), one per image | | |
| | `task` | `str` | `"detection"` | Task type. Only `"detection"` is supported by this model. | | |
| | `max_new_tokens` | `int` | `2048` | Maximum decoding steps | | |
| | `min_dimension` | `int` | `256` | Minimum image side after resize | | |
| | `max_dimension` | `int` | `1024` | Maximum image side after resize | | |
| | `compile` | `bool` | `True` | Run `torch.compile` on first call | | |
| **Returns:** `list[list[dict]]`, one list per image. | |
| Each detection dict contains: | |
| ```python | |
| { | |
| "xy": {"x": float, "y": float}, # center in normalized coordinates (0 to 1) | |
| "hw": {"h": float, "w": float}, # size in normalized coordinates (0 to 1) | |
| } | |
| ``` | |
| > [!NOTE] | |
| > Requesting `task="segmentation"` on this model will raise a `ValueError`. Use the full [`tiiuae/Falcon-Perception`](https://huggingface.co/tiiuae/Falcon-Perception) model for segmentation masks. | |
| ## What the model is for | |
| Falcon Perception 300M is designed for open-vocabulary object detection where the main difficulty is localization under free-form text queries. Use cases include: | |
| - Natural language driven object selection in images | |
| - Lightweight bounding-box detection for downstream pipelines | |
| - Crowded scenes where the number of instances is large and variable | |
| - Edge or resource-constrained deployments where the full model is too large | |
| It is not intended as a general-purpose vision-language assistant for open-ended reasoning, long-form generation, or multi-step VQA. | |
| ## Model details (high level) | |
| The architecture follows a single-stack early-fusion recipe: | |
| - One dense Transformer backbone processes image patches and text tokens in a shared space from the first layer | |
| - Hybrid attention masking: bidirectional among image tokens, causal for text and task tokens conditioned on the image | |
| - Chain-of-Perception decoding: `<|coord|>` then `<|size|>` per instance | |
| - Specialized heads for coordinates and size, with geometry conditioning via Fourier features | |
| ## Comparison with the full model | |
| | | **Falcon-Perception** | **Falcon-Perception-300M** | | |
| |---|---|---| | |
| | Parameters | ~7B | ~0.3B | | |
| | Tasks | Detection + Segmentation | Detection only | | |
| | Output | Bounding boxes + pixel masks | Bounding boxes | | |
| | Token sequence | `<\|coord\|>` `<\|size\|>` `<\|seg\|>` | `<\|coord\|>` `<\|size\|>` | | |
| ## Limitations | |
| - Presence calibration remains a key limitation for autoregressive dense interfaces. False positives are more likely on hard negatives than in DETR-like detection models. | |
| - OCR-driven prompts depend on text size and image resolution. Small text and degraded scans are challenging. | |
| - Dense scenes benefit strongly from high resolution inputs. Low resolution can be sufficient to recognize that a concept is present, but insufficient to localize each instance precisely. | |
| - This variant does **not** produce segmentation masks. Use the full model if pixel-level masks are needed. | |
| ## Citation | |
| If you use Falcon Perception, please cite: | |
| ```bibtex | |
| @article{bevli2026falcon, | |
| title = {Falcon Perception}, | |
| author = {Bevli, Aviraj and Chaybouti, Sofian and Dahou, Yasser and Hacid, Hakim and Huynh, Ngoc Dung and Le Khac, Phuc H. and Narayan, Sanath and Para, Wamiq Reyaz and Singh, Ankit}, | |
| journal = {arXiv preprint arXiv:2603.27365}, | |
| year = {2026}, | |
| url = {https://arxiv.org/abs/2603.27365} | |
| } | |
| ``` | |