Instructions to use ubitech-edg/llava-7b-cpt-sft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ubitech-edg/llava-7b-cpt-sft with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ubitech-edg/llava-7b-cpt-sft") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ubitech-edg/llava-7b-cpt-sft") model = AutoModelForImageTextToText.from_pretrained("ubitech-edg/llava-7b-cpt-sft") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ubitech-edg/llava-7b-cpt-sft with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ubitech-edg/llava-7b-cpt-sft" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ubitech-edg/llava-7b-cpt-sft", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ubitech-edg/llava-7b-cpt-sft
- SGLang
How to use ubitech-edg/llava-7b-cpt-sft with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ubitech-edg/llava-7b-cpt-sft" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ubitech-edg/llava-7b-cpt-sft", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ubitech-edg/llava-7b-cpt-sft" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ubitech-edg/llava-7b-cpt-sft", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use ubitech-edg/llava-7b-cpt-sft with Docker Model Runner:
docker model run hf.co/ubitech-edg/llava-7b-cpt-sft
LLaVA 7B — Multimodal Supervised Fine-Tuning (CPT-SFT)
Model type: Vision-Language Causal Model
Base model: ubitech-edg/llava-7b-cpt
License: Llama 2 Community License
Framework: Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 + CUDA 12.1)
Overview
llava-7b-cpt-sft is the final multimodal supervised fine-tuned version of LLaVA 1.5 7B.
It builds upon the multimodal continual-pretrained model (ubitech-edg/llava-7b-cpt), combining rich visual grounding with instruction-following and question-answering abilities.
This stage refines both the text and image reasoning layers using synthetic QA data while retaining the full multimodal processor and vision encoder.
Training was performed on the Leonardo EuroHPC supercomputer using Axolotl and DeepSpeed ZeRO-1 with bfloat16 precision and LoRA adapters merged into the final weights.
Training Setup
| Component | Specification |
|---|---|
| Objective | Multimodal supervised fine-tuning (image–text QA) |
| Base model | ubitech-edg/llava-7b-cpt |
| Adapter type | LoRA (merged into full model) |
| Precision | bfloat16 |
| Hardware | 8 nodes × 2 × NVIDIA A100 64 GB GPUs |
| Framework | Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 / CUDA 12.1) |
| Runtime | ~24 hours |
| Checkpoints | 1 per epoch |
| Vision tower | Active (unfrozen multimodal processing) |
| Dataset split | 70% train / 30% validation |
Dataset
This multimodal SFT stage uses the synthetic QA dataset for text reasoning and may optionally pair visual data from prior continual pretraining.
| Dataset | Description |
|---|---|
axolotl_deduplicated_synthetic_qa.jsonl |
Text-based instruction-following and question-answering dataset |
mm_captions_chat.jsonl |
Image–caption dialogues, aligning visual grounding with natural language |
Together, these datasets enhance visual question answering, caption reasoning, and multimodal instruction following.
Hyperparameters
| Parameter | Value |
|---|---|
| Sequence length | 2048 |
| Micro batch size | 1 |
| Gradient accumulation | 4 |
| Epochs | 1 |
| Learning rate | 0.00015 |
| LR scheduler | cosine |
| Optimizer | AdamW (8-bit) |
| Warmup steps | 10 |
| Weight decay | 0.0 |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Gradient checkpointing | ✅ |
| Flash attention | ❌ (disabled for multimodal stability) |
| Validation set size | 0.3 |
| Evals per epoch | 1 |
| Image size | 512 |
| Resize algorithm | bilinear |
Tokenizer & Processor
| Component | Description |
|---|---|
| Tokenizer type | AutoTokenizer |
| Processor type | AutoProcessor |
| Pad token | <pad> (ID 32001) |
| Chat template | llava |
The processor is fully multimodal, handling both image and text inputs with unified preprocessing.
Usage Example
Perform visual question answering or image–text chat directly with transformers:
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
model_id = "ubitech-edg/llava-7b-cpt-sft"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
image = Image.open("example.jpg").convert("RGB")
prompt = "USER: <image>\nDescribe what is happening in this image.\nASSISTANT:"
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=150, temperature=0.7, top_p=0.9)
print(processor.decode(output[0], skip_special_tokens=True))
- Downloads last month
- 8