Instructions to use dphn/dolphin-vision-72b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dphn/dolphin-vision-72b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dphn/dolphin-vision-72b", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("dphn/dolphin-vision-72b", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dphn/dolphin-vision-72b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dphn/dolphin-vision-72b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dphn/dolphin-vision-72b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dphn/dolphin-vision-72b

SGLang

How to use dphn/dolphin-vision-72b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dphn/dolphin-vision-72b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dphn/dolphin-vision-72b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dphn/dolphin-vision-72b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dphn/dolphin-vision-72b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dphn/dolphin-vision-72b with Docker Model Runner:
```
docker model run hf.co/dphn/dolphin-vision-72b
```

Running on MacOS - issues with getting projector built.

by FiditeNemini - opened Jul 1, 2024

Discussion

FiditeNemini

Jul 1, 2024

Hey folks, I've been trying to run on MacOS and am running into some difficulties.

Shards load ok, chat template works fine, but I'm running into issues with the vision tower loads. To get the initial loading working, I had to tweak the code you supplied to get around the flash_attn issue, as follows:

import os
from unittest.mock import patch
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.dynamic_module_utils import get_imports
from PIL import Image
import warnings

# disable some warnings
transformers.logging.set_verbosity_debug()
# transformers.logging.disable_progress_bar()
# warnings.filterwarnings('ignore')

def fixed_get_imports(model_name: str | os.PathLike) -> list[str]:
    """Work around for running on MacOS; no flash_attn"""
    if not str(model_name).endswith("/modeling_llava_qwen2.py"):
        return get_imports(model_name)
    imports = get_imports(model_name)
    imports.remove("flash_attn")
    return imports

# set device
torch.set_default_device('cpu')  # or 'cpu'

model_name = 'cognitivecomputations/dolphin-vision-72b'

with patch("transformers.dynamic_module_utils.get_imports", fixed_get_imports):
    # create model
    model = AutoModelForCausalLM.from_pretrained(
        	model_name,
        torch_dtype=torch.float16,
        trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True)

# text prompt
prompt = 'Describe this image in detail'

messages = [
    {"role": "user", "content": f'<image>\n{prompt}'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(text)

text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0)

# image, sample images can be found in images folder
image = Image.open('/Downloads/test.jpeg')
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)

# generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=2048,
    use_cache=True)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())

Everything works until loading the google/siglip model (google/siglip-so400m-patch14-384/model.safetensors) when I get this error while debugging.

- This IS expected if you are initializing SigLipVisionModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing SigLipVisionModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of SigLipVisionModel were initialized from the model checkpoint at /Users/willdee/Documents/Projects/llama.cpp/models/siglip-so400m-patch14-384.
If your task is similar to the task the model of the checkpoint was trained on, you can already use SigLipVisionModel for predictions without further training.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

Would appreciate it if you could perhaps advise on what I can try to get around this.

Thanks much!
Will

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment