YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

BLIP Fine-Tuned for Traffic Navigation Captioning

Model card for the BLIP image-captioning model fine-tuned with a 3-stage progressive LoRA workflow for traffic navigation captions.

Model Details

  • Model name (local): final_model
  • Base model: Salesforce/blip-image-captioning-base (~248M parameters)
  • Fine-tuning method: LoRA (Low-Rank Adaptation), 3-stage progressive (vision encoder β†’ text decoder β†’ joint)
  • Framework: PyTorch + Hugging Face Transformers
  • Files in repo: model.safetensors, tokenizer.json, tokenizer_config.json, vocab.txt, preprocessor_config.json, config.json, generation_config.json, special_tokens_map.json

Short Description

This model was adapted from BLIP to generate grounded, navigation-style captions for traffic scenes using a parameter-efficient, three-stage LoRA fine-tuning procedure. The approach keeps the majority of the base weights frozen while adding and training small low-rank adapters in attention and projection layers.

Intended Use

  • Primary: Research and prototyping of image-to-text captioning for traffic/navigation scenarios.
  • Secondary: Integration into navigation-assist systems, dataset analysis, or as a baseline for further fine-tuning.

Limitations and Risks

  • Trained on a small, domain-specific dataset (427 images); may not generalize to unseen cities, weather conditions, or camera viewpoints.
  • Captions are not guaranteed to be safety- or privacy-compliant; do not rely on them for life-critical navigation decisions.
  • The model may hallucinate objects or spatial relations; verify with downstream modules or human oversight when used in production.

Training Data

  • Dataset name: Traffic Navigation Caption Dataset (Vijayawada, Andhra Pradesh, India)
  • Size: 427 images (341 train / 42 val / 44 test)
  • Annotations: COCO-style JSON with two caption levels β€” (1) global scene description, and (2) grounded navigation captions with region references.
  • Data license: Not specified here β€” include the dataset license in the repo if redistributing.

Fine-tuning Setup (3-stage LoRA)

  • Stage 1 (Vision encoder - ViT)

    • Target modules: qkv projections
    • Rank: 16, Alpha: 32, Dropout: 0.05
    • Trainable params: ~589,824 (β‰ˆ0.24%)
    • Epochs: 10, LR: 5e-5
  • Stage 2 (Text decoder)

    • Target modules: query, value
    • Rank: 32, Alpha: 64, Dropout: 0.05
    • Trainable params: ~2,359,296 (β‰ˆ0.95%)
    • Epochs: 8, LR: 3e-5
  • Stage 3 (Joint fine-tuning)

    • Target: combined adapters on both vision and text modules
    • Rank: 16, Alpha: 32, Dropout: 0.05
    • Trainable params: ~1,769,472 (β‰ˆ0.71%)
    • Epochs: 6, LR: 1e-5
  • Optimizer: AdamW

  • Batch: 4 (effective 16 with gradient accumulation)

  • Mixed precision: FP16 enabled

  • Hardware used (report): NVIDIA Tesla T4 (15GB)

Evaluation

Test set: 44 held-out images

Key metrics (mean):

  • BLEU-1: Base 0.01936 β†’ Fine-tuned 0.02158 (+11.45%)
  • BLEU-4: Base 0.00787 β†’ Fine-tuned 0.01033 (+31.27%)
  • METEOR: Base 0.069998 β†’ Fine-tuned 0.074931 (+7.05%)
  • ROUGE-L: Base 0.12089 β†’ Fine-tuned 0.13612 (+12.60%)
  • Semantic similarity: Base 0.11853 β†’ Fine-tuned 0.12770 (+7.74%)

Other stats (means): average caption length increased 8.86 β†’ 9.77 tokens; inference time decreased ~451.6ms β†’ ~395.0ms per image on the reported hardware.

For full metric JSON outputs see base_metrics.json and finetuned_metrics.json (included in the repository).

Example: Load & Inference

from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

model = BlipForConditionalGeneration.from_pretrained(".")
processor = BlipProcessor.from_pretrained(".")

image = Image.open("path/to/traffic.jpg")
inputs = processor(images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=150, num_beams=5)
caption = processor.decode(outputs[0], skip_special_tokens=True)
print(caption)

Files

  • model.safetensors β€” model weights
  • tokenizer and vocab files β€” tokenizer.json, tokenizer_config.json, vocab.txt, special_tokens_map.json
  • config files β€” config.json, generation_config.json, preprocessor_config.json
  • evaluation outputs: base_metrics.json, finetuned_metrics.json

Recommended Citation

If you use this model in research, cite the BLIP paper and reference the LoRA approach. Example:

Li et al., "BLIP: Bootstrapping Language-Image Pre-training" (ICML 2022) Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (ICLR 2021)

You may also cite the internal project report included in the repository: COMPLETE_METHODOLOGY_AND_RESULTS.txt.

License

License for the model weights and tokenizer is not specified here. Add a LICENSE file to the repo with the chosen license (e.g., Apache-2.0, CC-BY-4.0, or a non-commercial license) before publishing.

Contact

For questions, contact the model author/maintainer (add contact info in the repo or in the Hugging Face model settings).


Last updated: 2025-12-20

Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support