luodian
/

OTTER-Image-LLaMA7B-LA-InContext

Model card Files Files and versions

luodian commited on May 1, 2023

Commit

284388a

·

1 Parent(s): c121efd

Update README.md

Files changed (1) hide show

README.md +56 -0

README.md CHANGED Viewed

@@ -35,6 +35,62 @@ license: other
 [Live Demo (soon)](https://otter.cliangyu.com/) | [Paper (soon)]()
 ## 🦦 Overview
 <div style="text-align:center">

 [Live Demo (soon)](https://otter.cliangyu.com/) | [Paper (soon)]()
+## 🦦 Simple Code For Otter-9B
+Here is an example of multi-modal ICL (in-context learning) with 🦦 Otter. We provide two demo images with corresponding instructions and answers, then we ask the model to generate an answer given our instruct. You may change your instruction and see how the model responds.
+``` python
+import requests
+import torch
+import transformers
+from PIL import Image
+model = OtterForConditionalGeneration.from_pretrained(
+    "luodian/otter-9b-hf", device_map="auto"
+)
+tokenizer = model.text_tokenizer
+image_processor = transformers.CLIPImageProcessor()
+demo_image_one = Image.open(
+    requests.get(
+        "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
+    ).raw
+)
+demo_image_two = Image.open(
+    requests.get(
+        "http://images.cocodataset.org/test-stuff2017/000000028137.jpg", stream=True
+    ).raw
+)
+query_image = Image.open(
+    requests.get(
+        "http://images.cocodataset.org/test-stuff2017/000000028352.jpg", stream=True
+    ).raw
+)
+vision_x = (
+    image_processor.preprocess(
+        [demo_image_one, demo_image_two, query_image], return_tensors="pt"
+    )["pixel_values"]
+    .unsqueeze(1)
+    .unsqueeze(0)
+)
+model.text_tokenizer.padding_side = "left"
+lang_x = model.text_tokenizer(
+    [
+        "<image> User: what does the image describe? GPT: <answer> two cats sleeping. <|endofchunk|> <image> User: what does the image describe? GPT: <answer> a bathroom sink. <|endofchunk|> <image> User: what does the image describe? GPT: <answer>"
+    ],
+    return_tensors="pt",
+)
+generated_text = model.generate(
+    vision_x=vision_x.to(model.device),
+    lang_x=lang_x["input_ids"].to(model.device),
+    attention_mask=lang_x["attention_mask"].to(model.device),
+    max_new_tokens=256,
+    num_beams=1,
+    no_repeat_ngram_size=3,
+)
+print("Generated text: ", model.text_tokenizer.decode(generated_text[0]))
+```
 ## 🦦 Overview
 <div style="text-align:center">