allenai
/

MolmoPoint-8B

Image-Text-to-Text

Model card Files Files and versions

chrisc36 commited on Mar 16

Commit

3f47d4a

·

verified ·

1 Parent(s): 9a03098

Update README.md

Files changed (1) hide show

README.md +46 -4

README.md CHANGED Viewed

@@ -1,6 +1,45 @@
-MolmoPoint's HF inference works the same, but we recommend running it with
-with `logits_processor=model.build_logit_processor_from_inputs(model_inputs)`
-to enforce points tokens are generated correctly.
 In MolmoPoint, instead of coordinates points will be generated as a series of special
 tokens, to decode the tokens back into points requires some additional
@@ -9,7 +48,6 @@ The metadata is returned by the preprocessor using the `return_pointing_metadata
 Then `model.extract_image_points` and `model.extract_video_points` do the decoding, they
 return a list of ({image_id|timestamps}, object_id, pixel_x, pixel_y) output points.
-Note the huggingface MolmoPoint model does not support training.
 ### Image Pointing Example:
@@ -120,3 +158,7 @@ points = model.extract_video_points(
 )
 print(points)
 ```

+---
+license: apache-2.0
+language:
+- en
+base_model:
+- Qwen/Qwen3-8B
+- google/siglip-so400m-patch14-384
+pipeline_tag: image-text-to-text
+library_name: transformers
+tags:
+- multimodal
+- olmo
+- molmo
+- molmo2
+---
+# MolmoPoint-8B
+MolmoPoint-8B is a fully-open VLM developed by the Allen Institute for AI (Ai2) that support image, video and multi-image understanding and grounding.
+It has novel pointing mechansim that improves image pointing, video pointing, and video tracking, see our technical report for details.
+Note the huggingface MolmoPoint model does not support training, see our github repo for the training code.
+Quick links:
+- 💬 [Code](https://github.com/allenai/molmo2)
+- 📂 [All Models](https://huggingface.co/collections/allenai/molmo_point)
+- 📃 [Paper](https://allenai.org/papers/molmo_point)
+- 📝 [Blog](https://allenai.org/blog/molmo_point)
+## Quick Start
+### Setup Conda Environment
+```
+conda create --name transformers4571 python=3.11
+conda activate transformers4571
+pip install transformers==4.57.1
+pip install torch pillow einops torchvision accelerate decord2
+```
+## Inference
+We recommend running MolmoPoint with `logits_processor=model.build_logit_processor_from_inputs(model_inputs)`
+to enforce points tokens are generated in a valid way.
 In MolmoPoint, instead of coordinates points will be generated as a series of special
 tokens, to decode the tokens back into points requires some additional
 Then `model.extract_image_points` and `model.extract_video_points` do the decoding, they
 return a list of ({image_id|timestamps}, object_id, pixel_x, pixel_y) output points.
 ### Image Pointing Example:
 )
 print(points)
 ```
+## License and Use
+This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2’s Responsible Use Guidelines. This model is trained on third party datasets that are subject to academic and non-commercial research use only. Please review the sources to determine if this model is appropriate for your use case.