Mantis-8B-Fuyu

DongfuJiang commited on May 3, 2024

Commit

601136d

verified ·

1 Parent(s): e1f3b59

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -27,7 +27,7 @@ language:
 ## Summary
-- Mantis is an LLaMA-3 based LMM with **interleaved text and image as inputs**, train on Mantis-Instruct under academic-level resources (i.e. 36 hours on 16xA100-40G).
 - Mantis is trained to have multi-image skills including co-reference, reasoning, comparing, temporal understanding.
 - Mantis reaches the state-of-the-art performance on five multi-image benchmarks (NLVR2, Q-Bench, BLINK, MVBench, Mantis-Eval), and also maintain a strong single-image performance on par with CogVLM and Emu2.
@@ -58,10 +58,11 @@ image2 = "image2.jpg"
 images = [Image.open(image1), Image.open(image2)]
 # load processor and model
-from mantis.models.mllava import MLlavaProcessor, LlavaForConditionalGeneration
-processor = MLlavaProcessor.from_pretrained("TIGER-Lab/Mantis-8B-siglip-llama3")
 attn_implementation = None # or "flash_attention_2"
-model = LlavaForConditionalGeneration.from_pretrained("TIGER-Lab/Mantis-8B-siglip-llama3", device_map="cuda", torch_dtype=torch.bfloat16, attn_implementation=attn_implementation)
 generation_kwargs = {
     "max_new_tokens": 1024,

 ## Summary
+- Mantis-Fuyu is a Fuyu based LMM with **interleaved text and image as inputs**, train on Mantis-Instruct under academic-level resources (i.e. 36 hours on 16xA100-40G).
 - Mantis is trained to have multi-image skills including co-reference, reasoning, comparing, temporal understanding.
 - Mantis reaches the state-of-the-art performance on five multi-image benchmarks (NLVR2, Q-Bench, BLINK, MVBench, Mantis-Eval), and also maintain a strong single-image performance on par with CogVLM and Emu2.
 images = [Image.open(image1), Image.open(image2)]
 # load processor and model
+# from mantis.models.mllava import MLlavaProcessor, LlavaForConditionalGeneration
+from mantis.models.mfuyu import MFuyuForCausalLM, MFuyuProcessor
+processor = MFuyuProcessor.from_pretrained("TIGER-Lab/Mantis-8B-Fuyu")
 attn_implementation = None # or "flash_attention_2"
+model = MFuyuForCausalLM.from_pretrained("TIGER-Lab/Mantis-8B-Fuyu", device_map="cuda", torch_dtype=torch.bfloat16, attn_implementation=attn_implementation)
 generation_kwargs = {
     "max_new_tokens": 1024,