Instructions to use SeanScripts/Molmo-72B-0924-nf4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SeanScripts/Molmo-72B-0924-nf4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="SeanScripts/Molmo-72B-0924-nf4", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("SeanScripts/Molmo-72B-0924-nf4", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use SeanScripts/Molmo-72B-0924-nf4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SeanScripts/Molmo-72B-0924-nf4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SeanScripts/Molmo-72B-0924-nf4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/SeanScripts/Molmo-72B-0924-nf4

SGLang

How to use SeanScripts/Molmo-72B-0924-nf4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SeanScripts/Molmo-72B-0924-nf4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SeanScripts/Molmo-72B-0924-nf4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SeanScripts/Molmo-72B-0924-nf4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SeanScripts/Molmo-72B-0924-nf4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use SeanScripts/Molmo-72B-0924-nf4 with Docker Model Runner:
```
docker model run hf.co/SeanScripts/Molmo-72B-0924-nf4
```

Quantized with NF4 double quantization from allenai/Molmo-72B-0924 using BitsAndBytes.

Vision backbone modules were not quantized to NF4 (though they are still FP16), and need to be run in FP32 at the moment (layer norm precision loss issue), and should be offloaded to CPU or you'll run out of memory on 48 GB VRAM.

This model just barely fits in 48 GB (tested on 2 x 3090, and gets about 6 tok/s). It probably doesn't have a very high max sequence length, but at least it works.

For 2 cards with 24 GB VRAM, this requires a very specific device map to work. For single cards with 48 GB VRAM, I imagine it works much more smoothly.

Example usage for image captioning with 2 x 24 GB VRAM GPUs:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig, StopStringCriteria
from PIL import Image
import time

# For 2 x 24 GB. If using 1 x 48 GB or more (lucky you), you can just use device_map="auto"
device_map = {
    "model.vision_backbone": "cpu", # Seems to be required to not run out of memory at 48 GB
    "model.transformer.wte": 0,
    "model.transformer.ln_f": 0,
    "model.transformer.ff_out": 1,
}
# For 2 x 24 GB, this works for *only* 38 or 39. Any higher or lower and it'll either only work for 1 token of output or fail completely.
switch_point = 38 # layer index to switch to second GPU
device_map |= {f"model.transformer.blocks.{i}": 0 for i in range(0, switch_point)}
device_map |= {f"model.transformer.blocks.{i}": 1 for i in range(switch_point, 80)}

model_name = "SeanScripts/Molmo-72B-0924-nf4"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    use_safetensors=True,
    device_map=device_map,
    trust_remote_code=True, # Required for Molmo at the moment.
)
model.model.vision_backbone.float() # vision backbone needs to be in FP32 for this

processor = AutoProcessor.from_pretrained(
    model_name,
    trust_remote_code=True, # Required for Molmo at the moment.
)

torch.cuda.empty_cache()

image = Image.open("test.png")
inputs = processor.process(images=image, text="Caption this image.")
inputs = {k: v.to("cuda:0").unsqueeze(0) for k,v in inputs.items()}
prompt_tokens = inputs["input_ids"].size(1)
print("Prompt tokens:", prompt_tokens)

t0 = time.time()
output = model.generate_from_batch(
    inputs,
    generation_config=GenerationConfig(
        max_new_tokens=256,
    ),
    stopping_criteria=[StopStringCriteria(tokenizer=processor.tokenizer, stop_strings=["<|endoftext|>"])],
    tokenizer=processor.tokenizer,
)
t1 = time.time()
total_time = t1 - t0
generated_tokens = output.size(1) - prompt_tokens
time_per_token = generated_tokens/total_time
print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)")

response = processor.tokenizer.decode(output[0, prompt_tokens:], skip_special_tokens=True)
print(response)

torch.cuda.empty_cache()

Downloads last month: 4

Safetensors

Model size

76B params

Tensor type

F16

F32

Model tree for SeanScripts/Molmo-72B-0924-nf4

Base model

Qwen/Qwen2-72B

Finetuned

allenai/Molmo-72B-0924

Quantized

(5)

this model