chrisc36 commited on
Commit
a9b8aa1
·
verified ·
1 Parent(s): 807b58f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +164 -0
README.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen3-8B
7
+ - google/siglip-so400m-patch14-384
8
+ pipeline_tag: image-text-to-text
9
+ tags:
10
+ - multimodal
11
+ - olmo
12
+ - molmo
13
+ - molmo2
14
+ - molmo_point
15
+ ---
16
+
17
+ # MolmoPoint-8B
18
+ MolmoPoint-8B is a fully-open VLM developed by the Allen Institute for AI (Ai2) that support image, video and multi-image understanding and grounding.
19
+ It has novel pointing mechansim that improves image pointing, video pointing, and video tracking, see our technical report for details.
20
+
21
+ Note the huggingface MolmoPoint model does not support training, see our github repo for the training code.
22
+
23
+ Quick links:
24
+ - 💬 [Code](https://github.com/allenai/molmo2)
25
+ - 📂 [All Models](https://huggingface.co/collections/allenai/molmo_point)
26
+ - 📃 [Paper](https://allenai.org/papers/molmo_point)
27
+ - 📝 [Blog](https://allenai.org/blog/molmo_point)
28
+
29
+
30
+ ## Quick Start
31
+
32
+ ### Setup Conda Environment
33
+ ```
34
+ conda create --name transformers4571 python=3.11
35
+ conda activate transformers4571
36
+ pip install transformers==4.57.1
37
+ pip install torch pillow einops torchvision accelerate decord2
38
+ ```
39
+
40
+ ## Inference
41
+ We recommend running MolmoPoint with `logits_processor=model.build_logit_processor_from_inputs(model_inputs)`
42
+ to enforce points tokens are generated in a valid way.
43
+
44
+ In MolmoPoint, instead of coordinates points will be generated as a series of special
45
+ tokens, to decode the tokens back into points requires some additional
46
+ metadata from the preprocessor.
47
+ The metadata is returned by the preprocessor using the `return_pointing_metadata` flag.
48
+ Then `model.extract_image_points` and `model.extract_video_points` do the decoding, they
49
+ return a list of ({image_id|timestamps}, object_id, pixel_x, pixel_y) output points.
50
+
51
+
52
+ ### Image Pointing Example:
53
+
54
+ ```python
55
+ from transformers import AutoProcessor, AutoModelForImageTextToText
56
+ from PIL import Image
57
+ import requests
58
+ import torch
59
+
60
+ checkpoint_dir = "allenai/MolmoPoint-8B" # or path to a converted HF checkpoint
61
+
62
+ model = AutoModelForImageTextToText.from_pretrained(
63
+ checkpoint_dir,
64
+ trust_remote_code=True,
65
+ dtype="auto",
66
+ device_map="auto",
67
+ )
68
+
69
+ processor = AutoProcessor.from_pretrained(
70
+ checkpoint_dir,
71
+ trust_remote_code=True,
72
+ padding_side="left",
73
+ )
74
+
75
+ image_messages = [
76
+ {
77
+ "role": "user",
78
+ "content": [
79
+ {"type": "text", "text": "Point to the eyes"},
80
+ {"type": "image", "image": Image.open(requests.get(
81
+ "https://picsum.photos/id/237/536/354", stream=True
82
+ ).raw)},
83
+ ]
84
+ }
85
+ ]
86
+
87
+ inputs = processor.apply_chat_template(
88
+ image_messages,
89
+ tokenize=True,
90
+ add_generation_prompt=True,
91
+ return_tensors="pt",
92
+ return_dict=True,
93
+ padding=True,
94
+ return_pointing_metadata=True
95
+ )
96
+ metadata = inputs.pop("metadata")
97
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
98
+
99
+ with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
100
+ output = model.generate(
101
+ **inputs,
102
+ logits_processor=model.build_logit_processor_from_inputs(inputs),
103
+ max_new_tokens=200
104
+ )
105
+
106
+ generated_tokens = output[:, inputs["input_ids"].size(1):]
107
+ generated_text = processor.post_process_image_text_to_text(generated_tokens, skip_special_tokens=False, clean_up_tokenization_spaces=False)[0]
108
+ points = model.extract_image_points(
109
+ generated_text,
110
+ metadata["token_pooling"],
111
+ metadata["subpatch_mapping"],
112
+ metadata["image_sizes"]
113
+ )
114
+ print(points)
115
+ ```
116
+
117
+
118
+ ### Video Pointing Example:
119
+ ```python
120
+ video_path = "https://storage.googleapis.com/oe-training-public/demo_videos/many_penguins.mp4"
121
+ video_messages = [
122
+ {
123
+ "role": "user",
124
+ "content": [
125
+ dict(type="text", text="Point to the penguins"),
126
+ dict(type="video", video=video_path),
127
+ ]
128
+ }
129
+ ]
130
+
131
+ inputs = processor.apply_chat_template(
132
+ video_messages,
133
+ tokenize=True,
134
+ add_generation_prompt=True,
135
+ return_tensors="pt",
136
+ return_dict=True,
137
+ padding=True,
138
+ return_pointing_metadata=True
139
+ )
140
+
141
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
142
+
143
+ with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
144
+ output = model.generate(
145
+ **inputs,
146
+ logits_processor=model.build_logit_processor_from_inputs(inputs)
147
+ max_new_tokens=200
148
+ )
149
+
150
+ generated_tokens = output[:, inputs['input_ids'].size(1):]
151
+ generated_text = processor.post_process_image_text_to_text(generated_tokens, skip_special_tokens=False, clean_up_tokenization_spaces=False)[0]
152
+ points = model.extract_video_points(
153
+ generated_text,
154
+ metadata["token_pooling"],
155
+ metadata["subpatch_mapping"],
156
+ metadata["timestamps"],
157
+ metadata["video_size"]
158
+ )
159
+ print(points)
160
+ ```
161
+
162
+ ## License and Use
163
+
164
+ This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2’s Responsible Use Guidelines. This model is trained on third party datasets that are subject to academic and non-commercial research use only. Please review the sources to determine if this model is appropriate for your use case.