yasserDahou commited on
Commit
eb1d63a
·
verified ·
1 Parent(s): 1f8827e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -59
README.md CHANGED
@@ -1,3 +1,5 @@
 
 
1
  ---
2
  license: apache-2.0
3
  pipeline_tag: mask-generation
@@ -9,112 +11,138 @@ tags:
9
  - open-vocabulary
10
  ---
11
 
12
- <img src="main_fig.jpg" width="480" alt="Falcon Perception"/>
13
 
14
 
15
- Falcon Perception is a **dense early-fusion vision-language model** for **open-vocabulary segmentation**. Given an image and a natural-language query, it segments **all matching objects** and returns pixel-accurate masks.
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- The model **jointly processes image patches and text tokens** in a single transformer, then autoregressively predicts **`<|coord|>`**, **`<|size|>`**, and **`<|seg|>`** tokens for each detected object. Each `<|seg|>` token acts as a **mask query**: its hidden state is projected and dot-producted against upsampled image features to produce a binary mask (i.e. no autoregressive polygon generation needed).
18
 
19
- ## Installation
20
 
21
  ```bash
22
- pip install transformers torch einops pycocotools
23
  ```
24
 
25
- Requires **PyTorch 2.5+** (FlexAttention).
26
 
27
- ## Quick Start
28
 
29
  ```python
30
  import torch
31
- from transformers import AutoModelForCausalLM
32
  from PIL import Image
 
33
 
34
  model = AutoModelForCausalLM.from_pretrained(
35
  "tiiuae/falcon-perception",
36
  trust_remote_code=True,
37
- dtype=torch.bfloat16,
38
- device_map="cuda",
39
  )
40
 
41
  image = Image.open("photo.jpg")
42
- results = model.generate(image, "cat")
43
 
44
- for pred in results[0]:
45
- print(pred["xy"]) # {"x": 0.35, "y": 0.42}
46
- print(pred["hw"]) # {"h": 0.15, "w": 0.12}
47
- print(pred["mask_rle"]) # {"counts": "...", "size": [H, W]}
48
  ```
49
 
50
- > The first `generate()` call is slower (~15-20 s) because `torch.compile` builds optimized kernels. Subsequent calls run in ~1-2 s.
 
 
 
 
 
 
 
 
 
 
 
 
51
 
 
52
 
53
  ### `model.generate(images, queries, **kwargs)`
54
 
55
  | Parameter | Type | Default | Description |
56
  |---|---|---|---|
57
- | `images` | `PIL.Image` or `list` | required | Single image or list of images (PIL, path, or URL) |
58
  | `queries` | `str` or `list[str]` | required | Query string(s), one per image |
59
- | `max_new_tokens` | `int` | `2048` | Maximum generation steps |
60
  | `min_dimension` | `int` | `256` | Minimum image side after resize |
61
  | `max_dimension` | `int` | `1024` | Maximum image side after resize |
62
- | `compile` | `bool` | `True` | Auto torch.compile on first call |
63
- | `segm_threshold` | `float` | `0.5` | Sigmoid threshold for binary masks |
 
64
 
65
- **Returns:** `list[list[dict]]` — one list per image, each containing detection dicts:
66
 
67
  ```python
68
  {
69
- "xy": {"x": float, "y": float}, # center (normalized 0-1)
70
- "hw": {"h": float, "w": float}, # size (normalized 0-1)
71
- "mask_rle": {"counts": str, "size": [H, W]}, # COCO RLE at original resolution
72
  }
73
  ```
74
 
 
75
 
76
- ## Visualizing Masks
77
 
78
- ```python
79
- import numpy as np
80
- from pycocotools import mask as mask_utils
81
- from PIL import Image, ImageDraw
82
-
83
- def overlay_masks(image, detections, alpha=0.55):
84
- """Overlay RLE masks on an image with colored fills and black borders."""
85
- overlay = image.convert("RGBA").copy()
86
- colors = [
87
- (255, 60, 60), (60, 220, 60), (50, 120, 255),
88
- (255, 200, 40), (220, 60, 220), (60, 220, 220),
89
- ]
90
- for i, det in enumerate(detections):
91
- m = mask_utils.decode(det["mask_rle"]).astype(bool)
92
- r, g, b = colors[i % len(colors)]
93
- fill = np.zeros((*m.shape, 4), dtype=np.uint8)
94
- fill[m] = [r, g, b, int(255 * alpha)]
95
- overlay = Image.alpha_composite(overlay, Image.fromarray(fill))
96
- # black border around mask
97
- border = np.zeros((*m.shape, 4), dtype=np.uint8)
98
- ky = m[1:, :] != m[:-1, :]
99
- kx = m[:, 1:] != m[:, :-1]
100
- edge = np.zeros_like(m)
101
- edge[1:, :] |= ky; edge[:-1, :] |= ky
102
- edge[:, 1:] |= kx; edge[:, :-1] |= kx
103
- border[edge] = [0, 0, 0, 200]
104
- overlay = Image.alpha_composite(overlay, Image.fromarray(border))
105
- return overlay
106
 
107
- image = Image.open("photo.jpg")
108
- results = model.generate(image, "cat")
109
- overlay_masks(image, results[0]).save("output.png")
110
- ```
111
 
112
- ## Performance
113
 
114
- ### PBench
 
 
 
 
115
 
 
116
 
 
117
 
 
 
 
 
 
 
 
 
 
 
118
 
119
  ## Citation
120
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <img src="main_fig.jpg" width="480" alt="Falcon Perception"/>
2
+
3
  ---
4
  license: apache-2.0
5
  pipeline_tag: mask-generation
 
11
  - open-vocabulary
12
  ---
13
 
 
14
 
15
 
16
+ ## Falcon Perception
17
+
18
+ Falcon Perception is a 0.6B parameter early-fusion vision-language model for open-vocabulary grounding and instance segmentation. Given an image and a natural language query, it returns zero, one, or many matching instances with pixel-accurate masks.
19
+
20
+ The model is built around a simple interface. Image patches and text tokens are processed together in a single Transformer using a hybrid attention mask: image tokens build bidirectional visual context, while text and task tokens decode causally conditioned on the image. For each instance, the model generates a short structured sequence of task tokens in a fixed order, `<|coord|>` then `<|size|>` then `<|seg|>`. The `<|seg|>` token acts as a mask query whose hidden state is projected and dotted with upsampled image features, producing a full-resolution binary mask without autoregressive mask generation.
21
+
22
+
23
+ ### Links
24
+
25
+ - Code and inference engine: `https://github.com/tiiuae/Falcon-Perception`
26
+ - Tech report: arXiv link coming soon
27
+ - PBench dataset: `tiiuae/PBench`
28
+ - OCR model: `tiiuae/Falcon-OCR`
29
 
30
+ ## Quickstart
31
 
32
+ ### Installation
33
 
34
  ```bash
35
+ pip install "torch>=2.5" transformers pillow einops pycocotools
36
  ```
37
 
38
+ This model requires PyTorch 2.5 or newer for FlexAttention. The first call can be slower because `torch.compile` may build optimized kernels.
39
 
40
+ ### Run open-vocabulary segmentation
41
 
42
  ```python
43
  import torch
 
44
  from PIL import Image
45
+ from transformers import AutoModelForCausalLM
46
 
47
  model = AutoModelForCausalLM.from_pretrained(
48
  "tiiuae/falcon-perception",
49
  trust_remote_code=True,
50
+ torch_dtype=torch.bfloat16,
51
+ device_map="auto",
52
  )
53
 
54
  image = Image.open("photo.jpg")
55
+ preds = model.generate(image, "cat")[0]
56
 
57
+ for p in preds:
58
+ print(p["xy"], p["hw"])
 
 
59
  ```
60
 
61
+ ### Decode masks
62
+
63
+ ```python
64
+ import numpy as np
65
+ from pycocotools import mask as mask_utils
66
+
67
+ for p in preds:
68
+ rle = p["mask_rle"]
69
+ # pycocotools expects bytes for counts
70
+ m = {"size": rle["size"], "counts": rle["counts"].encode("utf-8")}
71
+ mask = mask_utils.decode(m).astype(bool) # H x W
72
+ print(mask.shape, mask.sum())
73
+ ```
74
 
75
+ ## API
76
 
77
  ### `model.generate(images, queries, **kwargs)`
78
 
79
  | Parameter | Type | Default | Description |
80
  |---|---|---|---|
81
+ | `images` | `PIL.Image` or `list` | required | Single image or list of images |
82
  | `queries` | `str` or `list[str]` | required | Query string(s), one per image |
83
+ | `max_new_tokens` | `int` | `2048` | Maximum decoding steps |
84
  | `min_dimension` | `int` | `256` | Minimum image side after resize |
85
  | `max_dimension` | `int` | `1024` | Maximum image side after resize |
86
+ | `compile` | `bool` | `True` | Run `torch.compile` on first call |
87
+
88
+ **Returns:** `list[list[dict]]`, one list per image.
89
 
90
+ Each prediction dict contains:
91
 
92
  ```python
93
  {
94
+ "xy": {"x": float, "y": float}, # center in normalized coordinates (0 to 1)
95
+ "hw": {"h": float, "w": float}, # size in normalized coordinates (0 to 1)
96
+ "mask_rle": {"counts": str, "size": [H, W]}, # COCO RLE at original resolution
97
  }
98
  ```
99
 
100
+ ## What the model is for
101
 
102
+ Falcon Perception is designed for dense grounding regimes where the main difficulty is localization under open vocabulary. That includes:
103
 
104
+ - Natural language driven object selection in images
105
+ - Promptable instance segmentation for downstream pipelines
106
+ - Crowded scenes where the number of instances is large and variable
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
+ It is not intended as a general-purpose vision-language assistant for open-ended reasoning, long-form generation, or multi-step VQA.
109
+
110
+ ## Model details (high level)
 
111
 
112
+ The architecture follows a single-stack early-fusion recipe:
113
 
114
+ - One dense Transformer backbone processes image patches and text tokens in a shared space from the first layer
115
+ - Hybrid attention masking: bidirectional among image tokens, causal for text and task tokens conditioned on the image
116
+ - Chain-of-Perception decoding: `<|coord|>` then `<|size|>` then `<|seg|>` per instance
117
+ - Specialized heads for coordinates and size, with geometry conditioning via Fourier features
118
+ - Parallel mask decoding: each `<|seg|>` token becomes a mask query and produces a full-resolution mask via dot product with upsampled image features
119
 
120
+ ## Evaluation summary
121
 
122
+ From the technical report:
123
 
124
+ - SA-Co (open-vocabulary segmentation): 68.0 Macro F1 compared to 62.3 for SAM 3, with the main remaining gap being presence calibration (Average MCC 0.64 compared to 0.82 for SAM 3)
125
+ - PBench: a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and includes a dense long-context crowded split
126
+
127
+ Full tables, setup details, and ablations are in the report.
128
+
129
+ ## Limitations
130
+
131
+ - Presence calibration remains a key limitation for autoregressive dense interfaces. False positives are more likely on hard negatives than in DETR like segmentation models.
132
+ - OCR-driven prompts depend on text size and image resolution. Small text and degraded scans are challenging.
133
+ - Dense scenes benefit strongly from high resolution inputs. Low resolution can be sufficient to recognize that a concept is present, but insufficient to localize each instance precisely.
134
 
135
  ## Citation
136
 
137
+ If you use Falcon Perception, please cite:
138
+
139
+ ```bibtex
140
+ @misc{falconperception2026,
141
+ title = {Falcon Perception},
142
+ author = {TII Falcon Vision Team},
143
+ year = {2026},
144
+ howpublished = {arXiv preprint, link forthcoming},
145
+ note = {Code: https://github.com/tiiuae/Falcon-Perception},
146
+ }
147
+ ```
148
+