SWE-bench-Live 's Collections

Cross-platform-bench

The benchmarks evaluate LM agent on SWE/Computer-use tasks across different operating systems.