Mantis-llava-7b

Image-Text-to-Text

Transformers

Model card Files Files and versions

xet

Community

DongfuJiang commited on Apr 13, 2024

Commit

8b56eb5

verified ·

1 Parent(s): 0ddead0

Update README.md

Browse files

Files changed (1) hide show

README.md +39 -33

README.md CHANGED Viewed

@@ -3,55 +3,61 @@ tags:
 - generated_from_trainer
 base_model: llava-hf/llava-1.5-7b-hf
 model-index:
-- name: llava_1.5_7b_v2_4096
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# llava_1.5_7b_v2_4096
-This model is a fine-tuned version of [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) on an unknown dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1e-05
-- train_batch_size: 1
-- eval_batch_size: 1
-- seed: 42
-- distributed_type: multi-GPU
-- num_devices: 8
-- gradient_accumulation_steps: 16
-- total_train_batch_size: 128
-- total_eval_batch_size: 8
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_ratio: 0.03
-- num_epochs: 1.0
-### Training results
-### Framework versions
-- Transformers 4.39.2
-- Pytorch 2.2.1
-- Datasets 2.17.1
-- Tokenizers 0.15.2

 - generated_from_trainer
 base_model: llava-hf/llava-1.5-7b-hf
 model-index:
+- name: Mantis-llava-7b
   results: []
 ---
+# Mantis: Interleaved Multi-Image Instruction Tuning
+**Mantis** is a multimodal conversational AI model that can chat with users about images and text. It's optimized for multi-image reasoning, where interleaved text and images can be used to generate responses.
+Mantis is trained on the newly curated dataset **Mantis-Instruct**, a large-scale multi-image QA dataset that covers various multi-image reasoning tasks.
+|[Demo](https://huggingface.co/spaces/TIGER-Lab/Mantis) | [Blog](https://tiger-ai-lab.github.io/Blog/mantis) | [Github](https://github.com/TIGER-AI-Lab/Mantis) |  [Models](https://huggingface.co/collections/TIGER-Lab/mantis-6619b0834594c878cdb1d6e4) |
+![Mantis](./overall_barchart.jpeg)
+## Inference
+You can install Mantis's GitHub codes as a Python package
+```bash
+pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
+```
+then run inference with codes here: [examples/run_mantis.py](https://github.com/TIGER-AI-Lab/Mantis/blob/main/examples/run_mantis_hf.py)
+Or, you can run the model without relying on the mantis codes, using pure hugging face transformers. See [examples/run_mantis_hf.py](https://github.com/TIGER-AI-Lab/Mantis/blob/main/examples/run_mantis_hf.py) for details.
+```python
+from mantis.models.mllava import chat_mllava
+from PIL import Image
+import torch
+image1 = "image1.jpg"
+image2 = "image2.jpg"
+images = [Image.open(image1), Image.open(image2)]
+# load processor and model
+from mantis.models.mllava import MLlavaProcessor, LlavaForConditionalGeneration
+processor = MLlavaProcessor.from_pretrained("TIGER-Lab/Mantis-bakllava-7b")
+model = LlavaForConditionalGeneration.from_pretrained("TIGER-Lab/Mantis-bakllava-7b", device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2")
+# chat
+text = "<image> <image> What's the difference between these two images? Please describe as much as you can."
+response, history = chat_mllava(text, images, model, processor)
+print("USER: ", text)
+print("ASSISTANT: ", response)
+# The image on the right has a larger number of wallets displayed compared to the image on the left. The wallets in the right image are arranged in a grid pattern, while the wallets in the left image are displayed in a more scattered manner. The wallets in the right image have various colors, including red, purple, and brown, while the wallets in the left image are primarily brown.
+text = "How many items are there in image 1 and image 2 respectively?"
+response, history = chat_mllava(text, images, model, processor, history=history)
+print("USER: ", text)
+print("ASSISTANT: ", response)
+# There are two items in image 1 and four items in image 2.
+```
+## Training
+Training codes will be released soon.