Spaces:

MJ-Bench
/

README

Running

App Files Files Community

Zhaorun commited on Jul 9, 2024

Commit

bbb3390

verified ·

1 Parent(s): f0c9ee0

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -3

README.md CHANGED Viewed

@@ -11,17 +11,19 @@ pinned: false
 Project page: https://mj-bench.github.io/
-![Dataset Overview](https://raw.githubusercontent.com/MJ-Bench/MJ-Bench.github.io/main/static/images/dataset_overview.png)
 While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes.
 To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: **alignment**, **safety**, **image quality**, and **bias**.
 Specifically, we evaluate a large variety of multimodal judges including
 - 6 smaller-sized CLIP-based scoring models
 - 11 open-source VLMs (e.g. LLaVA family)
 - 4 and close-source VLMs (e.g. GPT-4o, Claude 3)
 We are actively updating the [leaderboard](https://mj-bench.github.io/) and you are welcome to submit the evaluation result of your multimodal judge on [our dataset](https://huggingface.co/datasets/MJ-Bench/MJ-Bench) to [huggingface leaderboard](https://huggingface.co/spaces/MJ-Bench/MJ-Bench-Leaderboard).

 Project page: https://mj-bench.github.io/
 While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial to align these models with desired behaviors based on feedback from a multimodal judge. Despite their significance, current multimodal judges frequently undergo inadequate evaluation of their capabilities and limitations, potentially leading to misalignment and unsafe fine-tuning outcomes.
 To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: **alignment**, **safety**, **image quality**, and **bias**.
+![Dataset Overview](https://raw.githubusercontent.com/MJ-Bench/MJ-Bench.github.io/main/static/images/dataset_overview.png)
 Specifically, we evaluate a large variety of multimodal judges including
 - 6 smaller-sized CLIP-based scoring models
 - 11 open-source VLMs (e.g. LLaVA family)
 - 4 and close-source VLMs (e.g. GPT-4o, Claude 3)
+-
+![Evaluation result](https://github.com/MJ-Bench/MJ-Bench.github.io/blob/main/static/images/radar_plot.png)
 We are actively updating the [leaderboard](https://mj-bench.github.io/) and you are welcome to submit the evaluation result of your multimodal judge on [our dataset](https://huggingface.co/datasets/MJ-Bench/MJ-Bench) to [huggingface leaderboard](https://huggingface.co/spaces/MJ-Bench/MJ-Bench-Leaderboard).