Instructions to use SeanScripts/Molmo-72B-0924-nf4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SeanScripts/Molmo-72B-0924-nf4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="SeanScripts/Molmo-72B-0924-nf4", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("SeanScripts/Molmo-72B-0924-nf4", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SeanScripts/Molmo-72B-0924-nf4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SeanScripts/Molmo-72B-0924-nf4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SeanScripts/Molmo-72B-0924-nf4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/SeanScripts/Molmo-72B-0924-nf4
- SGLang
How to use SeanScripts/Molmo-72B-0924-nf4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SeanScripts/Molmo-72B-0924-nf4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SeanScripts/Molmo-72B-0924-nf4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SeanScripts/Molmo-72B-0924-nf4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SeanScripts/Molmo-72B-0924-nf4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use SeanScripts/Molmo-72B-0924-nf4 with Docker Model Runner:
docker model run hf.co/SeanScripts/Molmo-72B-0924-nf4
# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("SeanScripts/Molmo-72B-0924-nf4", trust_remote_code=True, dtype="auto")Quantized with NF4 double quantization from allenai/Molmo-72B-0924 using BitsAndBytes.
Vision backbone modules were not quantized to NF4 (though they are still FP16), and need to be run in FP32 at the moment (layer norm precision loss issue), and should be offloaded to CPU or you'll run out of memory on 48 GB VRAM.
This model just barely fits in 48 GB (tested on 2 x 3090, and gets about 6 tok/s). It probably doesn't have a very high max sequence length, but at least it works.
For 2 cards with 24 GB VRAM, this requires a very specific device map to work. For single cards with 48 GB VRAM, I imagine it works much more smoothly.
Example usage for image captioning with 2 x 24 GB VRAM GPUs:
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig, StopStringCriteria
from PIL import Image
import time
# For 2 x 24 GB. If using 1 x 48 GB or more (lucky you), you can just use device_map="auto"
device_map = {
"model.vision_backbone": "cpu", # Seems to be required to not run out of memory at 48 GB
"model.transformer.wte": 0,
"model.transformer.ln_f": 0,
"model.transformer.ff_out": 1,
}
# For 2 x 24 GB, this works for *only* 38 or 39. Any higher or lower and it'll either only work for 1 token of output or fail completely.
switch_point = 38 # layer index to switch to second GPU
device_map |= {f"model.transformer.blocks.{i}": 0 for i in range(0, switch_point)}
device_map |= {f"model.transformer.blocks.{i}": 1 for i in range(switch_point, 80)}
model_name = "SeanScripts/Molmo-72B-0924-nf4"
model = AutoModelForCausalLM.from_pretrained(
model_name,
use_safetensors=True,
device_map=device_map,
trust_remote_code=True, # Required for Molmo at the moment.
)
model.model.vision_backbone.float() # vision backbone needs to be in FP32 for this
processor = AutoProcessor.from_pretrained(
model_name,
trust_remote_code=True, # Required for Molmo at the moment.
)
torch.cuda.empty_cache()
image = Image.open("test.png")
inputs = processor.process(images=image, text="Caption this image.")
inputs = {k: v.to("cuda:0").unsqueeze(0) for k,v in inputs.items()}
prompt_tokens = inputs["input_ids"].size(1)
print("Prompt tokens:", prompt_tokens)
t0 = time.time()
output = model.generate_from_batch(
inputs,
generation_config=GenerationConfig(
max_new_tokens=256,
),
stopping_criteria=[StopStringCriteria(tokenizer=processor.tokenizer, stop_strings=["<|endoftext|>"])],
tokenizer=processor.tokenizer,
)
t1 = time.time()
total_time = t1 - t0
generated_tokens = output.size(1) - prompt_tokens
time_per_token = generated_tokens/total_time
print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)")
response = processor.tokenizer.decode(output[0, prompt_tokens:], skip_special_tokens=True)
print(response)
torch.cuda.empty_cache()
- Downloads last month
- 4
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="SeanScripts/Molmo-72B-0924-nf4", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)