Update root eval_baseline.py to 17 tasks + long-horizon + health retry 951f2d1 verified Pathikreet commited on Apr 25