Generate full app code from a simple description
Tracks perf of LLMs, VLMs and agents on web navigation tasks