BLIP Fine-Tuned for Traffic Navigation Captioning

Model card for the BLIP image-captioning model fine-tuned with a 3-stage progressive LoRA workflow for traffic navigation captions.

Model Details

Model name (local): final_model
Base model: Salesforce/blip-image-captioning-base (~248M parameters)
Fine-tuning method: LoRA (Low-Rank Adaptation), 3-stage progressive (vision encoder → text decoder → joint)
Framework: PyTorch + Hugging Face Transformers
Files in repo: model.safetensors, tokenizer.json, tokenizer_config.json, vocab.txt, preprocessor_config.json, config.json, generation_config.json, special_tokens_map.json

Short Description

This model was adapted from BLIP to generate grounded, navigation-style captions for traffic scenes using a parameter-efficient, three-stage LoRA fine-tuning procedure. The approach keeps the majority of the base weights frozen while adding and training small low-rank adapters in attention and projection layers.

Intended Use

Primary: Research and prototyping of image-to-text captioning for traffic/navigation scenarios.
Secondary: Integration into navigation-assist systems, dataset analysis, or as a baseline for further fine-tuning.

Limitations and Risks

Trained on a small, domain-specific dataset (427 images); may not generalize to unseen cities, weather conditions, or camera viewpoints.
Captions are not guaranteed to be safety- or privacy-compliant; do not rely on them for life-critical navigation decisions.
The model may hallucinate objects or spatial relations; verify with downstream modules or human oversight when used in production.

Training Data

Dataset name: Traffic Navigation Caption Dataset (Vijayawada, Andhra Pradesh, India)
Size: 427 images (341 train / 42 val / 44 test)
Annotations: COCO-style JSON with two caption levels — (1) global scene description, and (2) grounded navigation captions with region references.
Data license: Not specified here — include the dataset license in the repo if redistributing.

Fine-tuning Setup (3-stage LoRA)

Stage 1 (Vision encoder - ViT)
- Target modules: qkv projections
- Rank: 16, Alpha: 32, Dropout: 0.05
- Trainable params: ~589,824 (≈0.24%)
- Epochs: 10, LR: 5e-5
Stage 2 (Text decoder)
- Target modules: query, value
- Rank: 32, Alpha: 64, Dropout: 0.05
- Trainable params: ~2,359,296 (≈0.95%)
- Epochs: 8, LR: 3e-5
Stage 3 (Joint fine-tuning)
- Target: combined adapters on both vision and text modules
- Rank: 16, Alpha: 32, Dropout: 0.05
- Trainable params: ~1,769,472 (≈0.71%)
- Epochs: 6, LR: 1e-5
Optimizer: AdamW
Batch: 4 (effective 16 with gradient accumulation)
Mixed precision: FP16 enabled
Hardware used (report): NVIDIA Tesla T4 (15GB)

Evaluation

Test set: 44 held-out images

Key metrics (mean):

BLEU-1: Base 0.01936 → Fine-tuned 0.02158 (+11.45%)
BLEU-4: Base 0.00787 → Fine-tuned 0.01033 (+31.27%)
METEOR: Base 0.069998 → Fine-tuned 0.074931 (+7.05%)
ROUGE-L: Base 0.12089 → Fine-tuned 0.13612 (+12.60%)
Semantic similarity: Base 0.11853 → Fine-tuned 0.12770 (+7.74%)

Other stats (means): average caption length increased 8.86 → 9.77 tokens; inference time decreased ~451.6ms → ~395.0ms per image on the reported hardware.

For full metric JSON outputs see base_metrics.json and finetuned_metrics.json (included in the repository).

Example: Load & Inference

from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

model = BlipForConditionalGeneration.from_pretrained(".")
processor = BlipProcessor.from_pretrained(".")

image = Image.open("path/to/traffic.jpg")
inputs = processor(images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=150, num_beams=5)
caption = processor.decode(outputs[0], skip_special_tokens=True)
print(caption)

Files

model.safetensors — model weights
tokenizer and vocab files — tokenizer.json, tokenizer_config.json, vocab.txt, special_tokens_map.json
config files — config.json, generation_config.json, preprocessor_config.json
evaluation outputs: base_metrics.json, finetuned_metrics.json

Recommended Citation

If you use this model in research, cite the BLIP paper and reference the LoRA approach. Example:

Li et al., "BLIP: Bootstrapping Language-Image Pre-training" (ICML 2022) Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (ICLR 2021)

You may also cite the internal project report included in the repository: COMPLETE_METHODOLOGY_AND_RESULTS.txt.

License

License for the model weights and tokenizer is not specified here. Add a LICENSE file to the repo with the chosen license (e.g., Apache-2.0, CC-BY-4.0, or a non-commercial license) before publishing.

Contact

For questions, contact the model author/maintainer (add contact info in the repo or in the Hugging Face model settings).

Last updated: 2025-12-20

Downloads last month: 1

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support