Update README.md
Browse files
README.md
CHANGED
|
@@ -24,7 +24,7 @@ We also release **EYE4ALL**, a human-annotated dataset for evaluating multimodal
|
|
| 24 |
> However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) *Alignment with human judgments*, (2) *Long-sequence processing*, (3) *Inference efficiency*, and (4) *Applicability to multi-objective scoring*.
|
| 25 |
> To address these challenges, we propose a plug-and-play architecture to build a robust predictor, **MULTI-TAP** (**Multi**-Objective **T**ask-**A**ware **P**redictor), capable of both multi and single-objective scoring.
|
| 26 |
> **MULTI-TAP** can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs).
|
| 27 |
-
> We show that **MULTI-TAP** is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics (*e.g.*, +42.3 Kendall's
|
| 28 |
> By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, **MULTI-TAP** can produce fine-grained scores for multiple human-interpretable objectives.
|
| 29 |
> **MULTI-TAP** performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, **EYE4ALL**.
|
| 30 |
> Our new dataset, consisting of chosen/rejected human preferences (**EYE4ALLPref**) and human-annotated fine-grained scores across seven dimensions (**EYE4ALLMulti**), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals.
|
|
|
|
| 24 |
> However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) *Alignment with human judgments*, (2) *Long-sequence processing*, (3) *Inference efficiency*, and (4) *Applicability to multi-objective scoring*.
|
| 25 |
> To address these challenges, we propose a plug-and-play architecture to build a robust predictor, **MULTI-TAP** (**Multi**-Objective **T**ask-**A**ware **P**redictor), capable of both multi and single-objective scoring.
|
| 26 |
> **MULTI-TAP** can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs).
|
| 27 |
+
> We show that **MULTI-TAP** is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics (*e.g.*, +42.3 Kendall's tau c compared to IXCREW-S on FlickrExp) and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7-8B).
|
| 28 |
> By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, **MULTI-TAP** can produce fine-grained scores for multiple human-interpretable objectives.
|
| 29 |
> **MULTI-TAP** performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, **EYE4ALL**.
|
| 30 |
> Our new dataset, consisting of chosen/rejected human preferences (**EYE4ALLPref**) and human-annotated fine-grained scores across seven dimensions (**EYE4ALLMulti**), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals.
|