Update README.md
Browse files
README.md
CHANGED
|
@@ -16,7 +16,7 @@ tags:
|
|
| 16 |
|
| 17 |
# MolmoPoint-8B
|
| 18 |
MolmoPoint-8B is a fully-open VLM developed by the Allen Institute for AI (Ai2) that support image, video and multi-image understanding and grounding.
|
| 19 |
-
It has
|
| 20 |
|
| 21 |
Note the huggingface MolmoPoint model does not support training, see our github repo for the training code.
|
| 22 |
|
|
@@ -42,7 +42,7 @@ We recommend running MolmoPoint with `logits_processor=model.build_logit_process
|
|
| 42 |
to enforce points tokens are generated in a valid way.
|
| 43 |
|
| 44 |
In MolmoPoint, instead of coordinates points will be generated as a series of special
|
| 45 |
-
tokens,
|
| 46 |
metadata from the preprocessor.
|
| 47 |
The metadata is returned by the preprocessor using the `return_pointing_metadata` flag.
|
| 48 |
Then `model.extract_image_points` and `model.extract_video_points` do the decoding, they
|
|
@@ -111,6 +111,9 @@ points = model.extract_image_points(
|
|
| 111 |
metadata["subpatch_mapping"],
|
| 112 |
metadata["image_sizes"]
|
| 113 |
)
|
|
|
|
|
|
|
|
|
|
| 114 |
print(points)
|
| 115 |
```
|
| 116 |
|
|
@@ -156,6 +159,9 @@ points = model.extract_video_points(
|
|
| 156 |
metadata["timestamps"],
|
| 157 |
metadata["video_size"]
|
| 158 |
)
|
|
|
|
|
|
|
|
|
|
| 159 |
print(points)
|
| 160 |
```
|
| 161 |
|
|
|
|
| 16 |
|
| 17 |
# MolmoPoint-8B
|
| 18 |
MolmoPoint-8B is a fully-open VLM developed by the Allen Institute for AI (Ai2) that support image, video and multi-image understanding and grounding.
|
| 19 |
+
It has new pointing mechansim that improves image pointing, video pointing, and video tracking, see our technical report for details.
|
| 20 |
|
| 21 |
Note the huggingface MolmoPoint model does not support training, see our github repo for the training code.
|
| 22 |
|
|
|
|
| 42 |
to enforce points tokens are generated in a valid way.
|
| 43 |
|
| 44 |
In MolmoPoint, instead of coordinates points will be generated as a series of special
|
| 45 |
+
tokens, decoding the tokens back into points requires some additional
|
| 46 |
metadata from the preprocessor.
|
| 47 |
The metadata is returned by the preprocessor using the `return_pointing_metadata` flag.
|
| 48 |
Then `model.extract_image_points` and `model.extract_video_points` do the decoding, they
|
|
|
|
| 111 |
metadata["subpatch_mapping"],
|
| 112 |
metadata["image_sizes"]
|
| 113 |
)
|
| 114 |
+
|
| 115 |
+
# points as a list of [object_id, image_num, x, y]
|
| 116 |
+
# For multiple images, `image_num` is the index of the image the point is in
|
| 117 |
print(points)
|
| 118 |
```
|
| 119 |
|
|
|
|
| 159 |
metadata["timestamps"],
|
| 160 |
metadata["video_size"]
|
| 161 |
)
|
| 162 |
+
|
| 163 |
+
# points as a list of [object_id, image_num, x, y]
|
| 164 |
+
# For tracking, object_id uniquely identifies objects that might appear multiple frames.
|
| 165 |
print(points)
|
| 166 |
```
|
| 167 |
|