Image-Text-to-Text
Transformers
Safetensors
qwen2_5_vl
autonomous-driving
hazard-detection
vision-language-model
lora
bitsandbytes
nf4
conversational
4-bit precision
Instructions to use jayanth7111/DriveSense-VLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jayanth7111/DriveSense-VLM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="jayanth7111/DriveSense-VLM") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("jayanth7111/DriveSense-VLM") model = AutoModelForImageTextToText.from_pretrained("jayanth7111/DriveSense-VLM") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use jayanth7111/DriveSense-VLM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jayanth7111/DriveSense-VLM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jayanth7111/DriveSense-VLM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/jayanth7111/DriveSense-VLM
- SGLang
How to use jayanth7111/DriveSense-VLM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "jayanth7111/DriveSense-VLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jayanth7111/DriveSense-VLM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "jayanth7111/DriveSense-VLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jayanth7111/DriveSense-VLM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use jayanth7111/DriveSense-VLM with Docker Model Runner:
docker model run hf.co/jayanth7111/DriveSense-VLM
File size: 5,610 Bytes
c3f0a73 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | ---
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen2.5-VL-3B-Instruct
tags:
- autonomous-driving
- hazard-detection
- vision-language-model
- lora
- bitsandbytes
- nf4
datasets:
- nuScenes
pipeline_tag: image-text-to-text
---
# DriveSense-VLM
**SFT-optimized vision-language model for autonomous-vehicle rare hazard detection.**
DriveSense-VLM is a LoRA-fine-tuned [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
that takes a single dashcam frame and returns structured JSON describing safety-critical
hazards: bounding box, hazard label, severity, chain-of-thought reasoning, and the
recommended ego-vehicle action.
[](https://colab.research.google.com/github/jayanth922/DriveSense-VLM/blob/main/notebooks/05_demo.ipynb)
[](https://github.com/jayanth922/DriveSense-VLM)
---
## Model details
| | |
|---|---|
| **Base model** | Qwen/Qwen2.5-VL-3B-Instruct |
| **Adapter** | LoRA (rank 32, alpha 64), merged into base weights |
| **Quantization** | bitsandbytes NF4 (4-bit), double-quant, bfloat16 compute |
| **Vision encoder** | Qwen2.5-VL ViT in fp16 (kept full-precision for grounding accuracy) |
| **Output schema** | JSON: `hazards[]{bbox_2d, label, severity, reasoning, action}`, `scene_summary`, `ego_context` |
| **Image resolution** | 672 × 448 (16h × 24w = 384 patches at 28×28 patch size) |
---
## Training
| | |
|---|---|
| **Dataset** | 2,754 nuScenes examples (rarity-filtered + LLM counterfactual augmentation) |
| **Epochs** | 5 |
| **Eval loss** | 0.312 |
| **LoRA targets** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `up_proj`, `down_proj` |
| **Hardware** | Google Colab Pro A100 |
---
## Evaluation
### Detection quality
| Metric | Value |
|---|---|
| Parse rate (valid JSON) | 99.1% |
| Mean IoU | 0.550 |
| Severity classification | 82.9% accuracy |
| F1 (hazard detection) | 0.107 |
### Optimization
| Metric | Value |
|---|---|
| Compression ratio | 3.1× (vs. fp16 base) |
| VRAM reduction | 68% |
| `torch.compile` speedup | 1.48× over eager |
---
## Quick start
```python
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
import torch
REPO = "jayanth922/DriveSense-VLM"
processor = AutoProcessor.from_pretrained(REPO)
model = AutoModelForImageTextToText.from_pretrained(
REPO,
device_map="auto",
torch_dtype=torch.bfloat16,
)
model.eval()
PROMPT = (
"Analyze this dashcam image for safety hazards. Return JSON with hazards array "
"containing bbox_2d (normalized 0-1000), label, severity (low/medium/high/critical), "
"reasoning, and action for each hazard. Include scene_summary and ego_context "
"(weather, time_of_day, road_type)."
)
image = Image.open("dashcam.jpg").convert("RGB")
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": PROMPT},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=300, do_sample=False)
print(processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
---
## Intended use
- **Portfolio / research demonstration** of VLM fine-tuning, quantization, and grounding for
the autonomous-driving domain.
- **Educational** reference implementation of a structured-output VLM pipeline.
**Not intended for**: deployment in any safety-critical or production autonomous-driving system.
---
## Limitations
- **Low recall (6.1%)** — the model is conservative and frequently misses hazards present in
the scene; suitable for ranking / triage, not as a sole detector.
- **Label fragmentation** — semantically similar hazards (e.g. `pedestrian_in_path`,
`pedestrian_crossing`) are treated as distinct classes by the F1 calculator, depressing
the score.
- **Limited geographic / sensor diversity** — trained on three nuScenes blobs only; expect
degraded performance on dashcams that differ substantially in mounting, FoV, or weather.
- **No temporal context** — single-frame inference. Hazards that require motion cues (e.g.
cut-ins, pedestrian intent) are weaker.
- **Quantization noise** — NF4 reduces VRAM but introduces a small accuracy delta vs. fp16.
---
## Files
| File | Purpose |
|---|---|
| `*.safetensors` | NF4-quantized merged model weights |
| `config.json` | Model architecture + quantization config |
| `quant_config.json` | bitsandbytes quantization metadata |
| `tokenizer*`, `*.json` | Processor / tokenizer / chat template |
| `examples/*.jpg` | Sample dashcam frames for the Gradio demo |
| `README.md` | This model card |
---
## Links
- **GitHub repo**: <https://github.com/jayanth922/DriveSense-VLM>
- **Colab demo**: [`notebooks/05_demo.ipynb`](https://colab.research.google.com/github/jayanth922/DriveSense-VLM/blob/main/notebooks/05_demo.ipynb)
- **Base model**: [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)
- **Datasets**: [nuScenes](https://www.nuscenes.org/), DADA-2000
## License
Apache-2.0. Inherits the [Qwen2.5-VL license](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE)
for the base weights.
|