Add initial model card.
Browse files
README.md
CHANGED
|
@@ -1,3 +1,57 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language: en
|
| 3 |
license: mit
|
| 4 |
+
tags:
|
| 5 |
+
- vision
|
| 6 |
+
- image-captioning
|
| 7 |
+
pipeline_tag: image-to-text
|
| 8 |
---
|
| 9 |
+
|
| 10 |
+
# PG-InstructBLIP model
|
| 11 |
+
|
| 12 |
+
Finetuned version of InstructBLIP with Flan-T5-xxl as the language model. PG-InstructBLIP was introduced in the paper [Physically Grounded Vision-Language Models for Robotic Manipulation](https://iliad.stanford.edu/pg-vlm/) by Gao et al.
|
| 13 |
+
|
| 14 |
+
## Model description
|
| 15 |
+
|
| 16 |
+
PG-InstructBLIP is finetuned using the [PhysObjects dataset](https://drive.google.com/file/d/1ThZ7p_5BnMboK_QE13m1fPKa4WGdRcfC/view?usp=sharing), an object-centric dataset of 36.9K crowd-sourced and 417K automated physical concept annotations of common household objects. This fine-tuning improves its understanding of physical object concepts, by capturing human priors of these concepts from visual appearance.
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
## Example Usage and Installation
|
| 20 |
+
|
| 21 |
+
This model is designed to be used with the LAVIS library. Please install [salesforce-lavis](https://pypi.org/project/salesforce-lavis/) and download this model through git-lfs or direct downloading.
|
| 22 |
+
|
| 23 |
+
```
|
| 24 |
+
import torch
|
| 25 |
+
from PIL import Image
|
| 26 |
+
from omegaconf import OmegaConf
|
| 27 |
+
|
| 28 |
+
from lavis.models import load_model, load_preprocess
|
| 29 |
+
from lavis.common.registry import registry
|
| 30 |
+
|
| 31 |
+
import requests
|
| 32 |
+
|
| 33 |
+
url = "https://iliad.stanford.edu/pg-vlm/example_images/ceramic_bowl.jpg"
|
| 34 |
+
example_image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
|
| 35 |
+
|
| 36 |
+
vlm = load_model(
|
| 37 |
+
name='blip2_t5_instruct',
|
| 38 |
+
model_type='flant5xxl',
|
| 39 |
+
checkpoint='pg-vlm/pgvlm_weights.bin', # replace with location of downloaded weights
|
| 40 |
+
is_eval=True,
|
| 41 |
+
device="cuda" if torch.cuda.is_available() else "cpu"
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
model_cls = registry.get_model_class('blip2_t5_instruct')
|
| 45 |
+
model_type = 'flant5xxl'
|
| 46 |
+
preprocess_cfg = OmegaConf.load(model_cls.default_config_path(model_type)).preprocess
|
| 47 |
+
vis_processors, _ = load_preprocess(preprocess_cfg)
|
| 48 |
+
processor = vis_processors["eval"]
|
| 49 |
+
|
| 50 |
+
question_samples = {
|
| 51 |
+
'prompt': 'Question: Classify this object as transparent, translucent, or opaque? Respond unknown if you are not sure. Short answer:',
|
| 52 |
+
'image': torch.stack([processor(example_image)], dim=0).to(vlm.device)
|
| 53 |
+
}
|
| 54 |
+
|
| 55 |
+
print(vlm.generate(question_samples, length_penalty=0, repetition_penalty=1, num_captions=3))
|
| 56 |
+
# (['opaque', 'translucent', 'transparent'], tensor([-0.0448, -4.1387, -4.2793], device='cuda:0'))
|
| 57 |
+
```
|