| | ---
|
| | license: apache-2.0
|
| | language:
|
| | - zho
|
| | - eng
|
| | - fra
|
| | - spa
|
| | - por
|
| | - deu
|
| | - ita
|
| | - rus
|
| | - jpn
|
| | - kor
|
| | - vie
|
| | - tha
|
| | - ara
|
| | metrics:
|
| | - bleu
|
| | base_model:
|
| | - Qwen/Qwen2.5-7B-Instruct
|
| | ---
|
| |
|
| |
|
| | [[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
|
| |
|
| |
|
| | ## Embodied Ability Evaluation: Performance in RoboVQA and OpenEQA
|
| |
|
| |
|
| |
|
| | | | | MLCD <br> Embodied-7B | LLaVA <br> OneVision-7B | GPT-4v | RoboMamba |
|
| | :-- | :-- | :-: | :-: | :-: | :-: |
|
| | | RoboVQA | BLEU1 | <span style="color:red">73.16</span> | 38.12 | - | 54.9 |
|
| | | | BLEU2 | <span style="color:red">66.39</span> | 33.56 | - | 44.2 |
|
| | | | BLEU3 | <span style="color:red">60.61</span> | 31.76 | - | 39.5 |
|
| | | | BLEU4 | <span style="color:red">56.56</span> | 30.97 | - | 36.3 |
|
| | | OpenEQA | Object State Recognition | <span style="color:red">71.83</span> | - | 63.2 | - |
|
| | | | Object Recognition | <span style="color:red">49.46</span> | - | 43.4 | - |
|
| | | | Functional Reasoning | 54.38 | - | <span style="color:red">57.4</span> | - |
|
| | | | Spatial Understanding | <span style="color:red">48.64</span> | - | 33.6 | - |
|
| | | | Attribute Recognition | <span style="color:red">67.08</span> | - | 57.2 | - |
|
| | | | World Knowledge | <span style="color:red">53.87</span> | - | 50.7 | - |
|
| | | | Object Localization | <span style="color:red">43.06</span> | - | 42.0 | - |
|
| |
|
| |
|
| |
|
| |
|
| | ## General Ability Evaluation: Comparison with LLaVA OneVision-7B and GPT-4
|
| |
|
| | | Dataset | Split | MLCD<br>Embodied-7B | LLaVA<br>OneVision-7B | GPT-4v | GPT-4o |
|
| | | :-- | :-: | :-: | :-: | :-: | :-: |
|
| | | A12D | test | 79.9 | 81.4 | 78.2 | 94.2 |
|
| | | ChartQA | test | 83.0 | 80.0 | 78.5 | 85.7 |
|
| | | DocVQA | test | 91.6 | 87.5 | 88.4 | 92.8 |
|
| | | InfoVQA | val | 73.9 | 70.7 | - | - |
|
| | | InfoVQA | test | 70.0 | 68.8 | - | - |
|
| | | MMMU | val | 47.3 | 48.8 | 56.8 | 69.1 |
|
| | | MMStar | test | 58.5 | 61.7 | 57.1 | 63.9 |
|
| | | OCRBench | - | 749.0 | 697.0 | 656.0 | 805.0 |
|
| | | RealWorldQA | test | 68.9 | 66.3 | 61.4 | 58.6 |
|
| | | SeedBench | image | 74.9 | 75.4 | 49.9 | 76.2 |
|
| | | MMbench | en-dev | 81.1 | 83.2 | 81.3 | 83.4 |
|
| | | MMbench | en-test | 80.1 | 80.8 | 75.0 | - |
|
| | | MME | test | 578/1603 | 418/1580 | 517/1409 | - |
|
| |
|
| | ## Usage
|
| |
|
| | ### A. Installation
|
| |
|
| | ```bash
|
| | git clone https://github.com/deepglint/unicom
|
| | cd unicom
|
| |
|
| | # Upgrade pip and install necessary dependencies
|
| | pip install --upgrade pip
|
| | pip install -e ".[train]"
|
| | ```
|
| |
|
| | ### B. Inference
|
| |
|
| | ```bash
|
| | git clone https://github.com/deepglint/unicom
|
| | cd unicom
|
| | pip install --upgrade pip
|
| | pip install -e ".[train]"
|
| | pip install flash-attn --no-build-isolation
|
| |
|
| | CUDA_VISIBLE_DEVICES=0 python infer.py --model_dir DeepGlint-AI/MLCD-Embodied-7B
|
| |
|
| | # example:
|
| | # >> Enter 'exit' to end the conversation, 'reset' to clear the chat history.
|
| | # >> Enter image file paths (comma-separated): ./asserts/logo.png
|
| | # >> User: <image>What kind of animal is it in this picture?
|
| | # >> Assistant: The image features a stylized representation of a cat, characterized by its vibrant and abstract depiction.
|
| | # >> User: What color is this cat?
|
| | # >> Assistant: The cat in the image is primarily white with blue, orange and pink accents, creating a visually appealing and unique appearance.
|
| | # >> User: <image>请你介绍一下这个图片
|
| | # >> Assistant: 这是一幅充满创意的猫头艺术作品。它采用了多色渐变和抽象风格,将猫的头部描绘成一个充满活力和色彩的视觉冲击。猫的眼睛用金色渲染,显得非常有神采,
|
| | # 而粉色的鼻子则增添了一丝可爱感。整体设计融合了现代艺术与传统猫头图案,创造出一种既独特又引人入胜的视觉效果。。
|
| | ```
|
| |
|
| | ### C. Evaluation for Embodied Ability
|
| |
|
| | #### Step 1
|
| |
|
| | Download raw data following [OpenEQA](https://github.com/facebookresearch/open-eqa/tree/main/data) and [RoboVQA](https://console.cloud.google.com/storage/browser/gdm-robovqa)(val part)
|
| |
|
| | #### Step 2
|
| |
|
| | Converting raw data into the format required for model evaluation.
|
| | ```bash
|
| | # convert OpenEQA benchmark. Note: replace the paths with your own.
|
| | python llava/benchmark/make_openeqa_bmk.py
|
| |
|
| | # convert RoboVQA benchmark. Note: replace the paths with your own.
|
| | python llava/benchmark/make_robovqa_bmk.py
|
| | ```
|
| |
|
| | #### Step 3
|
| |
|
| | Make sure that your top-level directory structure should look like this:
|
| | ```
|
| | |--/path/to/your/benchmarks
|
| | | |--OpenEQA
|
| | | | |--openeqa_scannet.parquet
|
| | | | |--openeqa_hm3d.parquet
|
| | | |--RoboVQA
|
| | | |--robovqa.parquet
|
| | |--/path/to/your/images
|
| | |--openeqa_val
|
| | | |--scannet-v0
|
| | | | |--002-scannet-scene0709_00
|
| | | | |--xxx-scannet-scenexxxx_xx
|
| | | |--hm3d-v0
|
| | | |--000-hm3d-BFRyYbPCCPE
|
| | | |--xxx-hm3d-xxxxxxxxxxx
|
| | |--robovqa_val
|
| | |--robovqa_221911
|
| | |--robovqa_xxxxxx
|
| | ```
|
| |
|
| | #### Step 4
|
| |
|
| | Run script for evaluation
|
| | ```bash
|
| | # Note: replace 'YOUR_API_KEY', 'YOUR_ENDPOINT', 'bmk_root', 'image_folder' with your own.
|
| | bash scripts/eval/eval_robo.sh /path/to/your/model
|
| | ```
|
| |
|
| | ### D. Evaluation for General Ability
|
| |
|
| | Install the evaluation tool and execute the evaluation script:
|
| | ```bash
|
| | pip install lmms-eval==0.2.0
|
| | PYTHONPATH=./ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m accelerate.commands.launch \
|
| | --main_process_port=12444 \
|
| | --num_processes=8 \
|
| | -m lmms_eval \
|
| | --model llava \
|
| | --model_args pretrained=DeepGlint-AI/MLCD-Embodied-7B,conv_template=qwen_1_5 \
|
| | --tasks mme \
|
| | --batch_size 1 \
|
| | --log_samples \
|
| | --log_samples_suffix mlcd \
|
| | --output_path ./eval_log/
|
| | ```
|
| |
|
| | We would like to express our gratitude to [Huajie Tan](https://huggingface.co/tanhuajie2001), [Yumeng Wang](https://huggingface.co/devymex), [Yin Xie](https://huggingface.co/Yin-Xie) for his significant contributions to the experimental validation in MLLMs. |