chrisc36 commited on
Commit
a9ddd32
·
verified ·
1 Parent(s): 448b75b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +122 -0
README.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen3-8B
7
+ - google/siglip-so400m-patch14-384
8
+ pipeline_tag: image-text-to-text
9
+ tags:
10
+ - multimodal
11
+ - olmo
12
+ - molmo
13
+ - molmo2
14
+ - molmo_point
15
+ ---
16
+
17
+ # MolmoPoint-8B
18
+ MolmoPoint-Img-8B is a fully-open VLM developed by the Allen Institute for AI (Ai2) that is specialized for GUI pointing.
19
+ As specialized model, it only supports single image input with instruction-like queries, and will output a single point.
20
+ See MolmoPoint-8B for a generalist model.
21
+ MolmoPoint-Img-8B points using grounding-tokens instead of text coordinates, see our paper for details.
22
+
23
+ Note the huggingface MolmoPoint model does not support training, see our github repo for the training code.
24
+
25
+ Quick links:
26
+ - 💬 [Code](https://github.com/allenai/molmo2)
27
+ - 📂 [All Models](https://huggingface.co/collections/allenai/molmo_point)
28
+ - 📃 [Paper](https://allenai.org/papers/molmo_point)
29
+ - 📝 [Blog](https://allenai.org/blog/molmo_point)
30
+
31
+
32
+ ## Quick Start
33
+
34
+ ### Setup Conda Environment
35
+ ```
36
+ conda create --name transformers4571 python=3.11
37
+ conda activate transformers4571
38
+ pip install transformers==4.57.1
39
+ pip install torch pillow einops torchvision accelerate decord2
40
+ ```
41
+
42
+ ## Inference
43
+ We recommend running MolmoPoint with `logits_processor=model.build_logit_processor_from_inputs(model_inputs)`
44
+ to enforce points tokens are generated in a valid way.
45
+
46
+ In MolmoPoint, instead of coordinates points will be generated as a series of special
47
+ tokens, decoding the tokens back into points requires some additional
48
+ metadata from the preprocessor.
49
+ The metadata is returned by the preprocessor using the `return_pointing_metadata` flag.
50
+ Then `model.extract_image_points` to do the decoding, it returns a list of (image_id, object_id, pixel_x, pixel_y) output points.
51
+
52
+ Note this model is only trained for single-image GUI screenshot input.
53
+
54
+
55
+ ### Image Pointing Example:
56
+
57
+ ```python
58
+ from transformers import AutoProcessor, AutoModelForImageTextToText
59
+ import torch
60
+
61
+ checkpoint_dir = "allenai/MolmoPoint-Img-8B" # or path to a converted HF checkpoint
62
+
63
+ model = AutoModelForImageTextToText.from_pretrained(
64
+ checkpoint_dir,
65
+ trust_remote_code=True,
66
+ dtype="auto",
67
+ device_map="auto",
68
+ )
69
+
70
+ processor = AutoProcessor.from_pretrained(
71
+ checkpoint_dir,
72
+ trust_remote_code=True,
73
+ padding_side="left",
74
+ )
75
+
76
+ image_messages = [
77
+ {
78
+ "role": "user",
79
+ "content": [
80
+ {"type": "text", "text": "open microsoft edge"},
81
+ {"type": "image", "image": "https://assets.techrepublic.com/uploads/2020/08/windows-10-start-menu.jpg"},
82
+ ]
83
+ }
84
+ ]
85
+
86
+ inputs = processor.apply_chat_template(
87
+ image_messages,
88
+ tokenize=True,
89
+ add_generation_prompt=True,
90
+ return_tensors="pt",
91
+ return_dict=True,
92
+ padding=True,
93
+ return_pointing_metadata=True
94
+ )
95
+ metadata = inputs.pop("metadata")
96
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
97
+
98
+ with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
99
+ output = model.generate(
100
+ **inputs,
101
+ logits_processor=model.build_logit_processor_from_inputs(inputs),
102
+ max_new_tokens=200
103
+ )
104
+
105
+ generated_tokens = output[:, inputs["input_ids"].size(1):]
106
+ generated_text = processor.post_process_image_text_to_text(generated_tokens, skip_special_tokens=False, clean_up_tokenization_spaces=False)[0]
107
+ points = model.extract_image_points(
108
+ generated_text,
109
+ metadata["token_pooling"],
110
+ metadata["subpatch_mapping"],
111
+ metadata["image_sizes"]
112
+ )
113
+
114
+ print(points)
115
+ # points as a list of [object_id, image_num, x, y]
116
+ # expected: [[1, 0, np.float64(250.42718446601944), np.float64(274.73276923076924)]]
117
+ ```
118
+
119
+
120
+ ## License and Use
121
+
122
+ This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2’s Responsible Use Guidelines. This model is trained on third party datasets that are subject to academic and non-commercial research use only. Please review the sources to determine if this model is appropriate for your use case.