| # F-LMM: Grounding Frozen Large Multimodal Models | |
|  | |
| ## Introduction | |
| This is the official release of paper **F-LMM: Grounding Frozen Large Multimodal Models**. | |
| It is currently under construction. | |
| > [**F-LMM: Grounding Frozen Large Multimodal Models**](https://arxiv.org/abs/2406.05821), | |
| > Size Wu, Sheng Jin, Wenwei Zhang, Lumin Xu, Wentao Liu, Wei Li, Chen Change Loy | |
| > [Bibtex](https://github.com/wusize/F-LMM#citation) | |
| ## TODO | |
| - [x] Training code | |
| - [x] Evaluation code and checkpoints | |
| - [ ] Interactive Demo | |
| ## Dependencies | |
| 1. This project is built on [Xtuner](https://github.com/InternLM/xtuner). The segmentation modules | |
| including the U-Net and training losses are | |
| from [MMSegmentation](https://github.com/open-mmlab/mmsegmentation) and | |
| [MMDetection](https://github.com/open-mmlab/mmdetection). Please refer to the official documents of these toolkits for installation guidance. | |
| 2. The version of [transformers](https://github.com/huggingface/transformers) used in this project is v4.39.1. And we | |
| find using versions beyond v4.40.0 cannot reproduce the performances (we are debugging on this issue). | |
| 3. Accelerate is used to build the evaluation pipeline of our models. Please refer to its official | |
| [webpage](https://github.com/huggingface/accelerate) for installation. | |
| ## Data Preparation | |
| **[PNG](https://github.com/BCV-Uniandes/PNG) Dataset.** Download images `train2017` and `val2017` | |
| from COCO's official [website](https://cocodataset.org/#home) and put them under `data/coco`. Download annotation | |
| files `png_coco_train2017.json` and `png_coco_val2017.json` from PNG's project [page](https://bcv-uniandes.github.io/panoptic-narrative-grounding/#downloads) | |
| and put them under `data/coco/annotations`. Download mask annotations `panoptic_train2017(.json)` and `panoptic_val2017(.json)` from | |
| COCO's official [website](http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip) and put | |
| them under `data/coco/annotations`. | |
| **[RefCOCO Series](https://github.com/lichengunc/refer).** Please refer to MMDetection's | |
| [tutorial](https://mmdetection.readthedocs.io/en/latest/user_guides/dataset_prepare.html#refcoco-dataset-preparation) | |
| to prepare RefCOCO datasets. | |
| **[VisCoT](https://github.com/deepcs233/Visual-CoT).** We have prepared the test images under | |
| [Google Drive](https://drive.google.com/drive/folders/1j25nY7i47OudmyzZFyps8NmzVHx6sf5O?usp=drive_link). Download and | |
| extract the zip files under `data/cot`. | |
| ```text | |
| F-LMM/ | |
| βββ data | |
| βββ cot | |
| βββ coco | |
| βββ annotations | |
| βββ panoptic_train2017.json | |
| βββ panoptic_val2017.json | |
| βββ png_coco_train2017.json | |
| βββ png_coco_val2017.json | |
| βββ panoptic_train2017 # panoptic masks | |
| βββ panoptic_val2017 # panoptic masks | |
| βββrefcoco | |
| βββinstances.json | |
| βββrefs(unc).p | |
| βββrefcoco+ | |
| βββinstances.json | |
| βββrefs(unc).p | |
| βββrefcocog | |
| βββinstances.json | |
| βββrefs(umd).p | |
| βββ train2017 | |
| βββ val2017 | |
| βββ train2014 | |
| ``` | |
| ## Checkpoints | |
| **SAM.** Please obtain the checkpoint `sam_vit_l_0b3195.pth` of pretrained SAM model from SAM's official | |
| [webpage](https://github.com/facebookresearch/segment-anything#model-checkpoints). | |
| ```text | |
| F-LMM/ | |
| βββ checkpoints | |
| βββ sam_vit_l_0b3195.pth | |
| ``` | |
| **Large Multimodal Models.** Models of off-the-shelf LMMs can be automatically downloaded from huggingface when running | |
| training or evaluation. | |
| ## Run | |
| ### Train | |
| ```shell | |
| export PYTHONPATH=. | |
| NPROC_PER_NODE=8 xtuner train configs/deepseek_vl/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.py --deepspeed deepspeed_zero2 | |
| ``` | |
| Currently, there are bugs when deepspeed_zero3 is used, we are going to resolve this issue in the future. | |
| ### Test | |
| **Checkpoints.** | |
| The checkpoints of our trained models are available on | |
| [Google Drive](https://drive.google.com/drive/folders/1bvrDqm9m4MvcocuwvvkGf_qYRBfvr0K7?usp=sharing). Download and put | |
| them under `checkpoints/`. | |
| | # | LMM | Configs | Checkpoints | | |
| |:--:|:---------------------:|:------------------------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------:| | |
| | 1 | LLaVA-1.5-7B | [frozen_llava_1_5_vicuna_7b_unet_sam_l_refcoco_png](configs/llava/frozen_llava_1_5_vicuna_7b_unet_sam_l_refcoco_png.py) | [model](https://drive.google.com/file/d/1opjFe15B5L5JJ78gE_FsXvDnwSlwSHhh/view?usp=sharing) | | |
| | 2 | LLaVA-Next-Vicuna-7B | [frozen_llava_next_vicuna_7b_unet_sam_l_refcoco_png](configs/llava_next/frozen_llava_next_vicuna_7b_unet_sam_l_refcoco_png.py) | [model](https://drive.google.com/file/d/1N-olLqhZdPEySt8Asu2cvLJBaL1VHTqa/view?usp=drive_link) | | |
| | 3 | LLaVA-Next-Mistral-7B | [frozen_llava_next_mistral_7b_unet_sam_l_refcoco_png](configs/llava_next/frozen_llava_next_mistral_7b_unet_sam_l_refcoco_png.py) | [model](https://drive.google.com/file/d/13rHaEZ62Q-VX5iKhOQnlm4yH1TMOBalH/view?usp=drive_link) | | |
| | 4 | DeepSeekVL-1.3B | [frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png](configs/deepseek_vl/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.py) | [model](https://drive.google.com/file/d/1UXcjJrrpTm1bNphvPNjvol9gUfvzNbjA/view?usp=drive_link) | | |
| | 5 | DeepSeekVL-7B | [frozen_deepseek_vl_7b_chat_unet_sam_l_refcoco_png](configs/deepseek_vl/frozen_deepseek_vl_7b_chat_unet_sam_l_refcoco_png.py) | [model](https://drive.google.com/file/d/1LOwIAYVyR51e34ksV9jz-GGiFfmkZLj_/view?usp=drive_link) | | |
| | 6 | MiniGemini-2B | [frozen_mgm_gemma_2b_unet_sam_l_refcoco_png](configs/mgm/frozen_mgm_gemma_2b_unet_sam_l_refcoco_png.py) | [model](https://drive.google.com/file/d/13wHk-dHa4in1rfIRzKCf-xEHwhaCz_6Y/view?usp=drive_link) | | |
| | 7 | MiniGemini-7B | [frozen_mgm_vicuna_7b_unet_sam_l_refcoco_png](configs/mgm/frozen_mgm_vicuna_7b_unet_sam_l_refcoco_png.py) | [model](https://drive.google.com/file/d/1Gg57bLJfx2zvYQyyE7Fjfw3hCq9ucVyN/view?usp=drive_link) | | |
| | 8 | MiniGemini-HD-7B | [frozen_mgm_vicuna_7b_hd_unet_sam_l_refcoco_png](configs/mgm/frozen_mgm_vicuna_7b_hd_unet_sam_l_refcoco_png.py) | [model](https://drive.google.com/file/d/1CDRI1l0FdTra7EZH_NNEha_QfA2cdbYb/view?usp=drive_link) | | |
| | 9 | HPT-Air | [frozen_hpt_air_unet_sam_l_refcoco_png](configs/hpt/frozen_hpt_air_unet_sam_l_refcoco_png.py) | [model](https://drive.google.com/file/d/1_gU4olEjsYvBvcq6yWGklSxNAv-Yz44T/view?usp=drive_link) | | |
| | 10 | HPT-Air-1.5 | [frozen_hpt_air_1_5_unet_sam_l_refcoco_png](configs/hpt/frozen_hpt_air_1_5_unet_sam_l_refcoco_png.py) | [model](https://drive.google.com/file/d/1Q-asMx7C3onXnmxqEZzecMHHCccqkzaP/view?usp=drive_link) | | |
| **Panoptic Narrative Grounding (PNG).** | |
| ```shell | |
| export PYTHONPATH=. | |
| accelerate launch scripts/multiprocess_eval_png.py \ | |
| configs/deepseek_vl/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.py \ | |
| --checkpoint checkpoints/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.pth | |
| ``` | |
| **Referring Expression Segmentation (RES).** | |
| ```shell | |
| export PYTHONPATH=. | |
| accelerate launch scripts/multiprocess_eval_refcoco.py \ | |
| configs/deepseek_vl/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.py \ | |
| --checkpoint checkpoints/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.pth --concat | |
| ``` | |
| **Visual Chain-of-Thought Reasoning.** | |
| For now we only implement VisCot on DeepSeekVL models that work well with | |
| multi-image inputs. Some examples of visual cot is shown below. | |
|  | |
| ***1. Inference.*** | |
| ```shell | |
| export PYTHONPATH=. | |
| accelerate launch scripts/visual_cot/visual_cot_inference.py configs/deepseek_vl/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.py \ | |
| --checkpoint checkpoints/frozen_deepseek_vl_1_3b_chat_unet_sam_l_refcoco_png.pth \ | |
| --version v1 --save_folder the/directory/of/result/json/files --discard_sam | |
| ``` | |
| ***2. Evaluate using ChatGPT.*** | |
| ```shell | |
| export OPENAI_API_KEY="your_openai_api_key" | |
| python scripts/visual_cot/gpt_eval_cot_score_single.py --result_file a/single/json/file # evaluate a single json file | |
| python scripts/visual_cot/gpt_eval_cot_score.py --result_dir the/directory/of/all/json/files # evaluate all json files | |
| ``` | |
| ## Demo | |
| **Grounded Human-AI Conversation**. An interactive demo is coming soon. Below are some examples of grounded conversation. | |
|  | |
| ## Citation | |
| ```bibtex | |
| @misc{wu2024flmm, | |
| title={F-LMM: Grounding Frozen Large Multimodal Models}, | |
| author={Size Wu and Sheng Jin and Wenwei Zhang and Lumin Xu and Wentao Liu and Wei Li and Chen Change Loy}, | |
| year={2024}, | |
| eprint={2406.05821}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV} | |
| } | |
| ``` | |
| ## License | |
| This project is licensed under [NTU S-Lab License 1.0](LICENSE). | |
| ## Acknowledgement | |
| This project is impossible without open-source efforts of large multimodal models in the community, including | |
| [LLaVA](https://huggingface.co/llava-hf), [DeepSeek-VL](https://github.com/deepseek-ai/DeepSeek-VL), | |
| [MiniGemini](https://github.com/dvlab-research/MGM) and [HPT](https://github.com/HyperGAI/HPT). In addition, we also | |
| thank open-source code bases from [transformers](https://github.com/huggingface/transformers) and | |
| [openmmlab](https://github.com/open-mmlab) teams that facilitate the development of this project. | |