LLaVA 7B — Supervised Fine-Tuning (SFT) on Synthetic QA

Model type: Vision-Language Causal Model (text-finetuned LLaVA-1.5)
Base model: llava-hf/llava-1.5-7b-hf
License: Llama 2 Community License
Framework: Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 + CUDA 12.1)


Overview

llava-7b-sft is a supervised fine-tuned version of LLaVA 1.5 7B, trained on a synthetic instruction-following dataset of question–answer pairs to enhance text understanding and reasoning.
Although derived from a multimodal base, this SFT run fine-tunes the language model component using LoRA adapters which were later merged into the full model weights.

This model therefore supports text-only generation natively (without PEFT) and retains compatibility with the multimodal processor and vision configuration from LLaVA.

Training was conducted on the Leonardo EuroHPC system using Axolotl and DeepSpeed ZeRO-1.


Training Setup

Component Specification
Objective Supervised fine-tuning (instruction-following QA)
Adapter type LoRA (merged into full model)
Precision bfloat16
Hardware 8 nodes × 2 × NVIDIA A100 64 GB GPUs
Framework Axolotl 0.6 + DeepSpeed ZeRO-1 (PyTorch 2.5.1 + CUDA 12.1)
Runtime ~24 hours
Checkpoints 2 per epoch
Vision tower Frozen during SFT
Dataset split 70% train / 30% validation

Dataset

Name: axolotl_deduplicated_synthetic_qa.jsonl
Type: Instruction-following synthetic QA dataset (Alpaca-style)

Each record contains a single-turn question and a high-quality generated answer.
This SFT data improves the model’s reasoning, language coherence, and conversational QA quality.


Hyperparameters

Parameter Value
Sequence length 2048
Micro batch size 1
Gradient accumulation 4
Epochs 1
Learning rate 0.0002
LR scheduler cosine
Optimizer AdamW (8-bit)
Warmup steps 10
Weight decay 0.0
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Gradient checkpointing
Flash attention
Validation set size 0.3
Evals per epoch 2

Tokenizer & Processor

Component Description
Tokenizer type AutoTokenizer
Processor type AutoProcessor (compatible with LLaVA image+text inputs)
Pad token <pad> (ID 32001)
Chat template llava

The processor configuration allows image or text inputs; however, this release focuses on text-based supervised tuning.


Files Included

This repository contains the fully merged model weights and all required configs for direct use with transformers:

  • config.json
  • model-*.safetensors
  • tokenizer.json
  • tokenizer_config.json
  • tokenizer.model
  • special_tokens_map.json
  • processor_config.json
  • preprocessor_config.json
  • vision_config.json
  • image_processor_config.json
  • README.md

Usage Example

To run text-based generation with this model:

import torch
from transformers import AutoProcessor, AutoModelForCausalLM

model_id = "ubitech-edg/llava-7b-sft"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = "USER: Explain the principle of energy conservation.\nASSISTANT:"
inputs = processor(text=prompt, return_tensors="pt").to("cuda")

with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, top_p=0.9)

print(processor.decode(outputs[0], skip_special_tokens=True))
Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ubitech-edg/llava-7b-sft

Adapter
(132)
this model