| # Introduction |
|
|
| MixSense is a series of models based on the widely adopted vision encoder-projector-LLM architecture. In this resource, we release Llama-3-MixSenseV1.1 checkpoint. Compared to [version 1.0](https://huggingface.co/Zero-Vision/Llama-3-MixSense), we changed the vision encoder from [SigLIP 400M](https://huggingface.co/google/siglip-so400m-patch14-384) to [Florence-2-large-ft's vision encoder DaViT](https://huggingface.co/microsoft/Florence-2-large-ft) and add more VQA data in finetune stage. |
|
|
| We have developed an innovative data processing method that complements the training process, reducing training costs while improving training effectiveness.,The models are trained on our restructured dataset. Details of the data organization and related research papers will be available soon. |
|
|
| # QuickStart |
|
|
| ## Requirements |
|
|
| ``` |
| conda create -n mixsense python==3.10 -y |
| conda activate mixsense |
| pip install torch transformers==4.37.2 accelerate pillow |
| ``` |
|
|
| ## Usage |
|
|
| Llama-3-Mixsense/demo.py |
|
|
| ```python |
| import torch |
| import transformers |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from PIL import Image |
| import warnings |
| import os |
| |
| |
| # disable some warnings |
| transformers.logging.set_verbosity_error() |
| transformers.logging.disable_progress_bar() |
| warnings.filterwarnings("ignore") |
| |
| # set device |
| device = "cuda" # or cpu, or npu (ASCEND 910B support) |
| |
| # create model |
| model = AutoModelForCausalLM.from_pretrained( |
| "Zero-Vision/Llama-3-MixSenseV1_1", |
| torch_dtype=torch.float16, # float32 for cpu |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| tokenizer = AutoTokenizer.from_pretrained( |
| "Zero-Vision/Llama-3-MixSenseV1_1", |
| trust_remote_code=True, |
| ) |
| |
| qs = "describe the image detailly." |
| input_ids = model.text_process(qs, tokenizer).to(device) |
| |
| image = Image.open("example.jpg") |
| image_tensor = model.image_process([image]).to(dtype=model.dtype, device=device) |
| |
| # generate |
| with torch.inference_mode(): |
| output_ids = model.generate( |
| input_ids, |
| images=image_tensor, |
| max_new_tokens=2048, |
| use_cache=True, |
| eos_token_id=[ |
| tokenizer.eos_token_id, |
| tokenizer.convert_tokens_to_ids(["<|eot_id|>"])[0], |
| ], |
| ) |
| |
| print(tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()) |
| ``` |
|
|
| ## Eval |
|
|
| We offer Llama-3-Mixsense/llama3mixsense.py for [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). |
|
|
| # License |
|
|
| This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.including but not limited to Llama3 and SigLIP. Meta Llama 3 is licensed under the [Meta Llama 3 Community License](https://llama.meta.com/llama3/license/), Copyright © Meta Platforms, Inc. All Rights Reserved. And [MIT LICENSE](https://www.apache.org/licenses/LICENSE-2.0) for Florence2 model. The project itself is licensed under the [Apache LICENSE 2.0](https://www.apache.org/licenses/LICENSE-2.0) . |
|
|
| # Acknowledgement |
|
|
| Our code is largely borrowed from [LLaVA](https://github.com/haotian-liu/LLaVA) |
| We bulid this demo according to [bunny](https://huggingface.co/BAAI/Bunny-Llama-3-8B-V) |
|
|