chrisc36 commited on
Commit
2c280e9
·
verified ·
1 Parent(s): 74a5036

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -2
README.md CHANGED
@@ -16,7 +16,7 @@ tags:
16
 
17
  # MolmoPoint-8B
18
  MolmoPoint-8B is a fully-open VLM developed by the Allen Institute for AI (Ai2) that support image, video and multi-image understanding and grounding.
19
- It has novel pointing mechansim that improves image pointing, video pointing, and video tracking, see our technical report for details.
20
 
21
  Note the huggingface MolmoPoint model does not support training, see our github repo for the training code.
22
 
@@ -42,7 +42,7 @@ We recommend running MolmoPoint with `logits_processor=model.build_logit_process
42
  to enforce points tokens are generated in a valid way.
43
 
44
  In MolmoPoint, instead of coordinates points will be generated as a series of special
45
- tokens, to decode the tokens back into points requires some additional
46
  metadata from the preprocessor.
47
  The metadata is returned by the preprocessor using the `return_pointing_metadata` flag.
48
  Then `model.extract_image_points` and `model.extract_video_points` do the decoding, they
@@ -111,6 +111,9 @@ points = model.extract_image_points(
111
  metadata["subpatch_mapping"],
112
  metadata["image_sizes"]
113
  )
 
 
 
114
  print(points)
115
  ```
116
 
@@ -156,6 +159,9 @@ points = model.extract_video_points(
156
  metadata["timestamps"],
157
  metadata["video_size"]
158
  )
 
 
 
159
  print(points)
160
  ```
161
 
 
16
 
17
  # MolmoPoint-8B
18
  MolmoPoint-8B is a fully-open VLM developed by the Allen Institute for AI (Ai2) that support image, video and multi-image understanding and grounding.
19
+ It has new pointing mechansim that improves image pointing, video pointing, and video tracking, see our technical report for details.
20
 
21
  Note the huggingface MolmoPoint model does not support training, see our github repo for the training code.
22
 
 
42
  to enforce points tokens are generated in a valid way.
43
 
44
  In MolmoPoint, instead of coordinates points will be generated as a series of special
45
+ tokens, decoding the tokens back into points requires some additional
46
  metadata from the preprocessor.
47
  The metadata is returned by the preprocessor using the `return_pointing_metadata` flag.
48
  Then `model.extract_image_points` and `model.extract_video_points` do the decoding, they
 
111
  metadata["subpatch_mapping"],
112
  metadata["image_sizes"]
113
  )
114
+
115
+ # points as a list of [object_id, image_num, x, y]
116
+ # For multiple images, `image_num` is the index of the image the point is in
117
  print(points)
118
  ```
119
 
 
159
  metadata["timestamps"],
160
  metadata["video_size"]
161
  )
162
+
163
+ # points as a list of [object_id, image_num, x, y]
164
+ # For tracking, object_id uniquely identifies objects that might appear multiple frames.
165
  print(points)
166
  ```
167