BLIP Fine-Tuned for Traffic Navigation Captioning
Model card for the BLIP image-captioning model fine-tuned with a 3-stage progressive LoRA workflow for traffic navigation captions.
Model Details
- Model name (local): final_model
- Base model: Salesforce/blip-image-captioning-base (~248M parameters)
- Fine-tuning method: LoRA (Low-Rank Adaptation), 3-stage progressive (vision encoder β text decoder β joint)
- Framework: PyTorch + Hugging Face Transformers
- Files in repo:
model.safetensors,tokenizer.json,tokenizer_config.json,vocab.txt,preprocessor_config.json,config.json,generation_config.json,special_tokens_map.json
Short Description
This model was adapted from BLIP to generate grounded, navigation-style captions for traffic scenes using a parameter-efficient, three-stage LoRA fine-tuning procedure. The approach keeps the majority of the base weights frozen while adding and training small low-rank adapters in attention and projection layers.
Intended Use
- Primary: Research and prototyping of image-to-text captioning for traffic/navigation scenarios.
- Secondary: Integration into navigation-assist systems, dataset analysis, or as a baseline for further fine-tuning.
Limitations and Risks
- Trained on a small, domain-specific dataset (427 images); may not generalize to unseen cities, weather conditions, or camera viewpoints.
- Captions are not guaranteed to be safety- or privacy-compliant; do not rely on them for life-critical navigation decisions.
- The model may hallucinate objects or spatial relations; verify with downstream modules or human oversight when used in production.
Training Data
- Dataset name: Traffic Navigation Caption Dataset (Vijayawada, Andhra Pradesh, India)
- Size: 427 images (341 train / 42 val / 44 test)
- Annotations: COCO-style JSON with two caption levels β (1) global scene description, and (2) grounded navigation captions with region references.
- Data license: Not specified here β include the dataset license in the repo if redistributing.
Fine-tuning Setup (3-stage LoRA)
Stage 1 (Vision encoder - ViT)
- Target modules:
qkvprojections - Rank: 16, Alpha: 32, Dropout: 0.05
- Trainable params: ~589,824 (β0.24%)
- Epochs: 10, LR: 5e-5
- Target modules:
Stage 2 (Text decoder)
- Target modules:
query,value - Rank: 32, Alpha: 64, Dropout: 0.05
- Trainable params: ~2,359,296 (β0.95%)
- Epochs: 8, LR: 3e-5
- Target modules:
Stage 3 (Joint fine-tuning)
- Target: combined adapters on both vision and text modules
- Rank: 16, Alpha: 32, Dropout: 0.05
- Trainable params: ~1,769,472 (β0.71%)
- Epochs: 6, LR: 1e-5
Optimizer: AdamW
Batch: 4 (effective 16 with gradient accumulation)
Mixed precision: FP16 enabled
Hardware used (report): NVIDIA Tesla T4 (15GB)
Evaluation
Test set: 44 held-out images
Key metrics (mean):
- BLEU-1: Base 0.01936 β Fine-tuned 0.02158 (+11.45%)
- BLEU-4: Base 0.00787 β Fine-tuned 0.01033 (+31.27%)
- METEOR: Base 0.069998 β Fine-tuned 0.074931 (+7.05%)
- ROUGE-L: Base 0.12089 β Fine-tuned 0.13612 (+12.60%)
- Semantic similarity: Base 0.11853 β Fine-tuned 0.12770 (+7.74%)
Other stats (means): average caption length increased 8.86 β 9.77 tokens; inference time decreased ~451.6ms β ~395.0ms per image on the reported hardware.
For full metric JSON outputs see base_metrics.json and finetuned_metrics.json (included in the repository).
Example: Load & Inference
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
model = BlipForConditionalGeneration.from_pretrained(".")
processor = BlipProcessor.from_pretrained(".")
image = Image.open("path/to/traffic.jpg")
inputs = processor(images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=150, num_beams=5)
caption = processor.decode(outputs[0], skip_special_tokens=True)
print(caption)
Files
model.safetensorsβ model weights- tokenizer and vocab files β tokenizer.json, tokenizer_config.json, vocab.txt, special_tokens_map.json
- config files β
config.json,generation_config.json,preprocessor_config.json - evaluation outputs:
base_metrics.json,finetuned_metrics.json
Recommended Citation
If you use this model in research, cite the BLIP paper and reference the LoRA approach. Example:
Li et al., "BLIP: Bootstrapping Language-Image Pre-training" (ICML 2022) Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (ICLR 2021)
You may also cite the internal project report included in the repository: COMPLETE_METHODOLOGY_AND_RESULTS.txt.
License
License for the model weights and tokenizer is not specified here. Add a LICENSE file to the repo with the chosen license (e.g., Apache-2.0, CC-BY-4.0, or a non-commercial license) before publishing.
Contact
For questions, contact the model author/maintainer (add contact info in the repo or in the Hugging Face model settings).
Last updated: 2025-12-20
- Downloads last month
- -