Add evaluate.py: Benchmark evaluation on HumanEval + MBPP f7a5fb7 verified teolm30 commited on 4 days ago