Instructions to use ubitech-edg/llava-7b-cpt-sft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ubitech-edg/llava-7b-cpt-sft with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ubitech-edg/llava-7b-cpt-sft")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("ubitech-edg/llava-7b-cpt-sft")
model = AutoModelForMultimodalLM.from_pretrained("ubitech-edg/llava-7b-cpt-sft", device_map="auto")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ubitech-edg/llava-7b-cpt-sft with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ubitech-edg/llava-7b-cpt-sft"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubitech-edg/llava-7b-cpt-sft",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/ubitech-edg/llava-7b-cpt-sft

SGLang

How to use ubitech-edg/llava-7b-cpt-sft with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ubitech-edg/llava-7b-cpt-sft" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubitech-edg/llava-7b-cpt-sft",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ubitech-edg/llava-7b-cpt-sft" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ubitech-edg/llava-7b-cpt-sft",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use ubitech-edg/llava-7b-cpt-sft with Docker Model Runner:
```
docker model run hf.co/ubitech-edg/llava-7b-cpt-sft
```

LLaVA 7B — Multimodal Supervised Fine-Tuning (CPT-SFT)

Model type: Vision-Language Causal Model
Base model: ubitech-edg/llava-7b-cpt
License: Llama 2 Community License
Framework: Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 + CUDA 12.1)

Overview

llava-7b-cpt-sft is the final multimodal supervised fine-tuned version of LLaVA 1.5 7B.
It builds upon the multimodal continual-pretrained model (ubitech-edg/llava-7b-cpt), combining rich visual grounding with instruction-following and question-answering abilities.

This stage refines both the text and image reasoning layers using synthetic QA data while retaining the full multimodal processor and vision encoder.

Training was performed on the Leonardo EuroHPC supercomputer using Axolotl and DeepSpeed ZeRO-1 with bfloat16 precision and LoRA adapters merged into the final weights.

Training Setup

Component	Specification
Objective	Multimodal supervised fine-tuning (image–text QA)
Base model	`ubitech-edg/llava-7b-cpt`
Adapter type	LoRA (merged into full model)
Precision	bfloat16
Hardware	8 nodes × 2 × NVIDIA A100 64 GB GPUs
Framework	Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 / CUDA 12.1)
Runtime	~24 hours
Checkpoints	1 per epoch
Vision tower	Active (unfrozen multimodal processing)
Dataset split	70% train / 30% validation

Dataset

This multimodal SFT stage uses the synthetic QA dataset for text reasoning and may optionally pair visual data from prior continual pretraining.

Dataset	Description
`axolotl_deduplicated_synthetic_qa.jsonl`	Text-based instruction-following and question-answering dataset
`mm_captions_chat.jsonl`	Image–caption dialogues, aligning visual grounding with natural language

Together, these datasets enhance visual question answering, caption reasoning, and multimodal instruction following.

Hyperparameters

Parameter	Value
Sequence length	2048
Micro batch size	1
Gradient accumulation	4
Epochs	1
Learning rate	0.00015
LR scheduler	cosine
Optimizer	AdamW (8-bit)
Warmup steps	10
Weight decay	0.0
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.05
LoRA target modules	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
Gradient checkpointing	✅
Flash attention	❌ (disabled for multimodal stability)
Validation set size	0.3
Evals per epoch	1
Image size	512
Resize algorithm	bilinear

Tokenizer & Processor

Component	Description
Tokenizer type	`AutoTokenizer`
Processor type	`AutoProcessor`
Pad token	`<pad>` (ID 32001)
Chat template	`llava`

The processor is fully multimodal, handling both image and text inputs with unified preprocessing.

Usage Example

Perform visual question answering or image–text chat directly with transformers:

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

model_id = "ubitech-edg/llava-7b-cpt-sft"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

image = Image.open("example.jpg").convert("RGB")
prompt = "USER: <image>\nDescribe what is happening in this image.\nASSISTANT:"

inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=150, temperature=0.7, top_p=0.9)

print(processor.decode(output[0], skip_special_tokens=True))

Downloads last month: 249

Safetensors

Model size

7B params

Tensor type

BF16

Model tree for ubitech-edg/llava-7b-cpt-sft

Base model

llava-hf/llava-1.5-7b-hf

Adapter

ubitech-edg/llava-7b-cpt

Adapter

(1)

this model