When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity Paper ⢠2509.20293 ⢠Published Sep 24, 2025 ⢠8
When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity Paper ⢠2509.20293 ⢠Published Sep 24, 2025 ⢠8
When Do Neural Nets Outperform Boosted Trees on Tabular Data? Paper ⢠2305.02997 ⢠Published May 4, 2023
MARVIS: Modality Adaptive Reasoning over VISualizations Paper ⢠2507.01544 ⢠Published Jul 2, 2025 ⢠13
LiveBench: A Challenging, Contamination-Free LLM Benchmark Paper ⢠2406.19314 ⢠Published Jun 27, 2024 ⢠23
TuneTables: Context Optimization for Scalable Prior-Data Fitted Networks Paper ⢠2402.11137 ⢠Published Feb 17, 2024