Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study
Paper
•
2406.07057
•
Published
•
17
Collection of external scientific material relevant to the project
Note General overview of popular inference engines performance across different workload scenarions and GPU platforms.
Note This paper is about proper and systematic evaluation to appropriately adjust claims, revealing hidden fragilities and failure modes—such as reliance on masking artifacts, shortcut learning, or flawed reasoning—and highlighting how models may perform well on specific benchmarks while failing to generalize.