Image-Text-to-Text
Transformers
Safetensors
qwen2_5_vl
autonomous-driving
hazard-detection
vision-language-model
lora
bitsandbytes
nf4
conversational
4-bit precision
Instructions to use jayanth7111/DriveSense-VLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jayanth7111/DriveSense-VLM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="jayanth7111/DriveSense-VLM") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("jayanth7111/DriveSense-VLM") model = AutoModelForImageTextToText.from_pretrained("jayanth7111/DriveSense-VLM") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use jayanth7111/DriveSense-VLM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jayanth7111/DriveSense-VLM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jayanth7111/DriveSense-VLM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/jayanth7111/DriveSense-VLM
- SGLang
How to use jayanth7111/DriveSense-VLM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "jayanth7111/DriveSense-VLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jayanth7111/DriveSense-VLM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "jayanth7111/DriveSense-VLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jayanth7111/DriveSense-VLM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use jayanth7111/DriveSense-VLM with Docker Model Runner:
docker model run hf.co/jayanth7111/DriveSense-VLM
DriveSense-VLM
SFT-optimized vision-language model for autonomous-vehicle rare hazard detection.
DriveSense-VLM is a LoRA-fine-tuned Qwen2.5-VL-3B-Instruct that takes a single dashcam frame and returns structured JSON describing safety-critical hazards: bounding box, hazard label, severity, chain-of-thought reasoning, and the recommended ego-vehicle action.
Model details
| Base model | Qwen/Qwen2.5-VL-3B-Instruct |
| Adapter | LoRA (rank 32, alpha 64), merged into base weights |
| Quantization | bitsandbytes NF4 (4-bit), double-quant, bfloat16 compute |
| Vision encoder | Qwen2.5-VL ViT in fp16 (kept full-precision for grounding accuracy) |
| Output schema | JSON: hazards[]{bbox_2d, label, severity, reasoning, action}, scene_summary, ego_context |
| Image resolution | 672 × 448 (16h × 24w = 384 patches at 28×28 patch size) |
Training
| Dataset | 2,754 nuScenes examples (rarity-filtered + LLM counterfactual augmentation) |
| Epochs | 5 |
| Eval loss | 0.312 |
| LoRA targets | q_proj, k_proj, v_proj, o_proj, up_proj, down_proj |
| Hardware | Google Colab Pro A100 |
Evaluation
Detection quality
| Metric | Value |
|---|---|
| Parse rate (valid JSON) | 99.1% |
| Mean IoU | 0.550 |
| Severity classification | 82.9% accuracy |
| F1 (hazard detection) | 0.107 |
Optimization
| Metric | Value |
|---|---|
| Compression ratio | 3.1× (vs. fp16 base) |
| VRAM reduction | 68% |
torch.compile speedup |
1.48× over eager |
Quick start
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
import torch
REPO = "jayanth922/DriveSense-VLM"
processor = AutoProcessor.from_pretrained(REPO)
model = AutoModelForImageTextToText.from_pretrained(
REPO,
device_map="auto",
torch_dtype=torch.bfloat16,
)
model.eval()
PROMPT = (
"Analyze this dashcam image for safety hazards. Return JSON with hazards array "
"containing bbox_2d (normalized 0-1000), label, severity (low/medium/high/critical), "
"reasoning, and action for each hazard. Include scene_summary and ego_context "
"(weather, time_of_day, road_type)."
)
image = Image.open("dashcam.jpg").convert("RGB")
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": PROMPT},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=300, do_sample=False)
print(processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Intended use
- Portfolio / research demonstration of VLM fine-tuning, quantization, and grounding for the autonomous-driving domain.
- Educational reference implementation of a structured-output VLM pipeline.
Not intended for: deployment in any safety-critical or production autonomous-driving system.
Limitations
- Low recall (6.1%) — the model is conservative and frequently misses hazards present in the scene; suitable for ranking / triage, not as a sole detector.
- Label fragmentation — semantically similar hazards (e.g.
pedestrian_in_path,pedestrian_crossing) are treated as distinct classes by the F1 calculator, depressing the score. - Limited geographic / sensor diversity — trained on three nuScenes blobs only; expect degraded performance on dashcams that differ substantially in mounting, FoV, or weather.
- No temporal context — single-frame inference. Hazards that require motion cues (e.g. cut-ins, pedestrian intent) are weaker.
- Quantization noise — NF4 reduces VRAM but introduces a small accuracy delta vs. fp16.
Files
| File | Purpose |
|---|---|
*.safetensors |
NF4-quantized merged model weights |
config.json |
Model architecture + quantization config |
quant_config.json |
bitsandbytes quantization metadata |
tokenizer*, *.json |
Processor / tokenizer / chat template |
examples/*.jpg |
Sample dashcam frames for the Gradio demo |
README.md |
This model card |
Links
- GitHub repo: https://github.com/jayanth922/DriveSense-VLM
- Colab demo:
notebooks/05_demo.ipynb - Base model: Qwen/Qwen2.5-VL-3B-Instruct
- Datasets: nuScenes, DADA-2000
License
Apache-2.0. Inherits the Qwen2.5-VL license for the base weights.
- Downloads last month
- 126
Model tree for jayanth7111/DriveSense-VLM
Base model
Qwen/Qwen2.5-VL-3B-Instruct
docker model run hf.co/jayanth7111/DriveSense-VLM