Add MiniF2F benchmark harness and improve proof agent robustness 8c51ce7 p4r5kpftnp-cmd Claude Sonnet 4.6 commited on May 15