Instructions to use Xkev/Llama-3.2V-11B-cot with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Xkev/Llama-3.2V-11B-cot with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Xkev/Llama-3.2V-11B-cot")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Xkev/Llama-3.2V-11B-cot")
model = AutoModelForMultimodalLM.from_pretrained("Xkev/Llama-3.2V-11B-cot")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Xkev/Llama-3.2V-11B-cot with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Xkev/Llama-3.2V-11B-cot"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Xkev/Llama-3.2V-11B-cot",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Xkev/Llama-3.2V-11B-cot

SGLang

How to use Xkev/Llama-3.2V-11B-cot with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Xkev/Llama-3.2V-11B-cot" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Xkev/Llama-3.2V-11B-cot",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Xkev/Llama-3.2V-11B-cot" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Xkev/Llama-3.2V-11B-cot",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Xkev/Llama-3.2V-11B-cot with Docker Model Runner:
```
docker model run hf.co/Xkev/Llama-3.2V-11B-cot
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Model Card for Model ID

Llama-3.2V-11B-cot is a visual language model capable of spontaneous, systematic reasoning.

The model was proposed in LLaVA-CoT: Let Vision Language Models Reason Step-by-Step.

Our model is built upon meta-llama/Llama-3.2-11B-Vision-Instruct. Llama 3.2 is licensed under the LLaMA 3.2 Community License, Copyright © Meta Platforms, Inc. The use of our model must comply with Meta’s Acceptable Use Policy.

Model Details

License: apache-2.0
Finetuned from model: meta-llama/Llama-3.2-11B-Vision-Instruct

Code: https://github.com/PKU-YuanGroup/LLaVA-CoT

Benchmark Results

MMStar	MMBench	MMVet	MathVista	AI2D	Hallusion	Average
57.6	75.0	60.3	54.8	85.7	47.8	63.5

Reproduction

To reproduce our results, you should use VLMEvalKit and the following settings.

Parameter	Value
do_sample	True
temperature	0.6
top_p	0.9
max_new_tokens	2048

You may change them in this file, line 80-83, and modify the max_new_tokens throughout the file.

Note: We follow the same settings as Llama-3.2-11B-Vision-Instruct, except that we extend the max_new_tokens to 2048.

After you get the results, you should filter the model output and only keep the outputs between <CONCLUSION> and </CONCLUSION>.

This shouldn't have any difference in theory, but empirically we observe some performance difference because the jugder GPT-4o can be inaccurate sometimes.

By keeping the outputs between <CONCLUSION> and </CONCLUSION>, most answers can be direclty extracted using VLMEvalKit system, which can be much less biased.

How to Get Started with the Model

You can use the inference code for Llama-3.2-11B-Vision-Instruct.

Training Details

Training Data

The model is trained on the LLaVA-CoT-100k dataset.

Training Procedure

The model is finetuned on llama-recipes with the following settings. Using the same setting should accurately reproduce our results.

Parameter	Value
FSDP	enabled
lr	1e-5
num_epochs	3
batch_size_training	4
use_fast_kernels	True
run_validation	False
batching_strategy	padding
context_length	4096
gradient_accumulation_steps	1
gradient_clipping	False
gradient_clipping_threshold	1.0
weight_decay	0.0
gamma	0.85
seed	42
use_fp16	False
mixed_precision	True

Bias, Risks, and Limitations

The model may generate biased or offensive content, similar to other VLMs, due to limitations in the training data. Technically, the model's performance in aspects like instruction following still falls short of leading industry models.