Instructions to use AIM-ZJU/HawkLlama_8b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AIM-ZJU/HawkLlama_8b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="AIM-ZJU/HawkLlama_8b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("AIM-ZJU/HawkLlama_8b")
model = AutoModelForMultimodalLM.from_pretrained("AIM-ZJU/HawkLlama_8b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AIM-ZJU/HawkLlama_8b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AIM-ZJU/HawkLlama_8b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIM-ZJU/HawkLlama_8b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/AIM-ZJU/HawkLlama_8b

SGLang

How to use AIM-ZJU/HawkLlama_8b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AIM-ZJU/HawkLlama_8b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIM-ZJU/HawkLlama_8b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AIM-ZJU/HawkLlama_8b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AIM-ZJU/HawkLlama_8b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use AIM-ZJU/HawkLlama_8b with Docker Model Runner:
```
docker model run hf.co/AIM-ZJU/HawkLlama_8b
```

HawkLlama

🤗Huggingface | 🗂️Github | 📖Technical Report

Zhejiang University, China

This is the official implementation of HawkLlama, an open-source multimodal large language model designed for real-world vision and language understanding applications. Our model features the following highlights.

HawkLlama-8B is constructed utilizing:
- Llama3-8B, the latest open-source large language model, trained on over 15 trillion tokens.
- SigLIP, an enhancement over CLIP employing sigmoid loss, which achieves superior performance in image recognition.
- An efficient vision-language connector, designed to capture high-resolution details without increasing the number of visual tokens, helps reduce the training overhead associated with high-resolution images.
For model training, we utilize Llava-Pretrain dataset for pretraining and a mixed dataset specifically curated for instruction tuning, which contains both multimodal and language-only data for supervised fine-tuning.
HawkLlama-8B is developed on NeMo framework, which facilitates 3D parallelism and offers scalability potential for future extension.

Our model is open-source and reproducable. Please check our technical report for more details.

Setup
Model Weights
Inference
Evaluation
Demo

Setup

Create envoirment and activate it.

conda create -n hawkllama python=3.10 -y
conda activate hawkllama

Clone and install this repo.

git clone https://github.com/aim-uofa/VLModel.git
cd VLModel
pip install -e .
pip install -e third_party/VLMEvalKit

Model Weights

Please refer to our HuggingFace repository to download the pretrained model weights.

Inference

We provide an example code for inference.

import torch
from PIL import Image
from HawkLlama.model import LlavaNextProcessor, LlavaNextForConditionalGeneration
from HawkLlama.utils.conversation import conv_llava_llama_3, DEFAULT_IMAGE_TOKEN

processor = LlavaNextProcessor.from_pretrained("AIM-ZJU/HawkLlama_8b")

model = LlavaNextForConditionalGeneration.from_pretrained("AIM-ZJU/HawkLlama_8b", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True) 
model.to("cuda:0")

image_file = "assets/coin.png"
image = Image.open(image_file).convert('RGB')

prompt = "what coin is that?"
prompt = DEFAULT_IMAGE_TOKEN + "\n" + prompt

conversation = conv_llava_llama_3.copy()
user_role_ind = 0
bot_role_ind = 1
conversation.append_message(conversation.roles[user_role_ind], prompt)
conversation.append_message(conversation.roles[bot_role_ind], "")
prompt = conversation.get_prompt()
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
output = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, max_new_tokens=2048, do_sample=False, use_cache=True)

print(processor.decode(output[0], skip_special_tokens=True))

Evaluation

Evaluate is modified based on the VLMEval codebase.

# single gpu
python third_party/VLMEvalKit/run.py --data MMBench_DEV_EN MMMU_DEV_VAL SEEDBench_IMG --model hawkllama_llama3_vlm --verbose
# multi-gpus
torchrun --nproc-per-node=8 third_party/VLMEvalKit/run.py --data MMBench_DEV_EN MMMU_DEV_VAL SEEDBench_IMG --model hawkllama_llama3_vlm --verbose

The results are shown below:

Benchmark	Our MethodName	LLaVA-Llama3-v1.1	LLaVA-Next
MMMU val	37.8	36.8	36.9
SEEDBench img	71.0	70.1	70.0
MMBench-EN dev	70.6	70.4	68.0
MMBench-CN dev	64.4	64.2	60.6
CCBench	33.9	31.6	24.7
AI2D test	65.6	70.0	67.1
ScienceQA test	76.1	72.9	70.4
HallusionBench	41.0	47.7	35.2
MMStar	43.0	45.1	38.1

Demo

Welcome to try our demo!

Acknowledgements

We express our appreciation to the following projects for their outstanding contributions in academia and code development: LLaVA, NeMo, VLMEvalKit and xtuner.

License

HawkLlama is released under the Apache 2.0 license.

Downloads last month: 2

Safetensors

Model size

8B params

Tensor type

BF16

AIM-ZJU
/

HawkLlama_8b

Contents