I tried to use claude-4, gemini, o3 etc to run this stress test with the latest version of browser use
vision, memory, planning... all default setting
But I kept getting 8-10/31 on this
What's the expected score? It feels this should be passing at a much higher rate somehow