luodian commited on
Commit
284388a
·
1 Parent(s): c121efd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -0
README.md CHANGED
@@ -35,6 +35,62 @@ license: other
35
 
36
  [Live Demo (soon)](https://otter.cliangyu.com/) | [Paper (soon)]()
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ## 🦦 Overview
39
 
40
  <div style="text-align:center">
 
35
 
36
  [Live Demo (soon)](https://otter.cliangyu.com/) | [Paper (soon)]()
37
 
38
+ ## 🦦 Simple Code For Otter-9B
39
+
40
+ Here is an example of multi-modal ICL (in-context learning) with 🦦 Otter. We provide two demo images with corresponding instructions and answers, then we ask the model to generate an answer given our instruct. You may change your instruction and see how the model responds.
41
+
42
+ ``` python
43
+ import requests
44
+ import torch
45
+ import transformers
46
+ from PIL import Image
47
+
48
+ model = OtterForConditionalGeneration.from_pretrained(
49
+ "luodian/otter-9b-hf", device_map="auto"
50
+ )
51
+ tokenizer = model.text_tokenizer
52
+ image_processor = transformers.CLIPImageProcessor()
53
+ demo_image_one = Image.open(
54
+ requests.get(
55
+ "http://images.cocodataset.org/val2017/000000039769.jpg", stream=True
56
+ ).raw
57
+ )
58
+ demo_image_two = Image.open(
59
+ requests.get(
60
+ "http://images.cocodataset.org/test-stuff2017/000000028137.jpg", stream=True
61
+ ).raw
62
+ )
63
+ query_image = Image.open(
64
+ requests.get(
65
+ "http://images.cocodataset.org/test-stuff2017/000000028352.jpg", stream=True
66
+ ).raw
67
+ )
68
+ vision_x = (
69
+ image_processor.preprocess(
70
+ [demo_image_one, demo_image_two, query_image], return_tensors="pt"
71
+ )["pixel_values"]
72
+ .unsqueeze(1)
73
+ .unsqueeze(0)
74
+ )
75
+ model.text_tokenizer.padding_side = "left"
76
+ lang_x = model.text_tokenizer(
77
+ [
78
+ "<image> User: what does the image describe? GPT: <answer> two cats sleeping. <|endofchunk|> <image> User: what does the image describe? GPT: <answer> a bathroom sink. <|endofchunk|> <image> User: what does the image describe? GPT: <answer>"
79
+ ],
80
+ return_tensors="pt",
81
+ )
82
+ generated_text = model.generate(
83
+ vision_x=vision_x.to(model.device),
84
+ lang_x=lang_x["input_ids"].to(model.device),
85
+ attention_mask=lang_x["attention_mask"].to(model.device),
86
+ max_new_tokens=256,
87
+ num_beams=1,
88
+ no_repeat_ngram_size=3,
89
+ )
90
+
91
+ print("Generated text: ", model.text_tokenizer.decode(generated_text[0]))
92
+ ```
93
+
94
  ## 🦦 Overview
95
 
96
  <div style="text-align:center">