Spaces:

brl-xfact
/

README

Running

App Files Files Community

eunkey commited on Sep 29, 2025

Commit

8ab2bb6

verified ·

1 Parent(s): a27085e

Update README.md

Browse files

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ We also release **EYE4ALL**, a human-annotated dataset for evaluating multimodal
 > However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) *Alignment with human judgments*, (2) *Long-sequence processing*, (3) *Inference efficiency*, and (4) *Applicability to multi-objective scoring*.
 > To address these challenges, we propose a plug-and-play architecture to build a robust predictor, **MULTI-TAP** (**Multi**-Objective **T**ask-**A**ware **P**redictor), capable of both multi and single-objective scoring.
 > **MULTI-TAP** can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs).
-> We show that **MULTI-TAP** is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics (*e.g.*, +42.3 Kendall's $$\tau_{c}$$ compared to IXCREW-S on FlickrExp) and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7-8B).
 > By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, **MULTI-TAP** can produce fine-grained scores for multiple human-interpretable objectives.
 > **MULTI-TAP** performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, **EYE4ALL**.
 > Our new dataset, consisting of chosen/rejected human preferences (**EYE4ALLPref**) and human-annotated fine-grained scores across seven dimensions (**EYE4ALLMulti**), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals.

 > However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) *Alignment with human judgments*, (2) *Long-sequence processing*, (3) *Inference efficiency*, and (4) *Applicability to multi-objective scoring*.
 > To address these challenges, we propose a plug-and-play architecture to build a robust predictor, **MULTI-TAP** (**Multi**-Objective **T**ask-**A**ware **P**redictor), capable of both multi and single-objective scoring.
 > **MULTI-TAP** can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs).
+> We show that **MULTI-TAP** is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics (*e.g.*, +42.3 Kendall's tau c compared to IXCREW-S on FlickrExp) and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7-8B).
 > By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, **MULTI-TAP** can produce fine-grained scores for multiple human-interpretable objectives.
 > **MULTI-TAP** performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, **EYE4ALL**.
 > Our new dataset, consisting of chosen/rejected human preferences (**EYE4ALLPref**) and human-annotated fine-grained scores across seven dimensions (**EYE4ALLMulti**), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals.