Google Launches Gemini 2.5 Computer Use Model for UI-Based Agents
Executive Summary
Google has released the Gemini 2.5 Computer Use model in public preview, accessible via the Gemini API in Google AI Studio and Vertex AI. This specialized model is built on Gemini 2.5 Pro's capabilities and allows developers to create AI agents that can visually understand and interact with web and mobile user interfaces to automate tasks. Google claims the model outperforms competing solutions on key benchmarks with lower latency and has integrated safety features to mitigate risks associated with computer control.
Key Takeaways
* Product: The Gemini 2.5 Computer Use model, a specialized model for powering AI agents that can operate graphical user interfaces (GUIs).
* Functionality: Enables agents to perform human-like actions such as clicking, typing, and scrolling to navigate websites, fill forms, and operate interactive elements.
* Implementation: Accessed via a new `computer_use` tool in the Gemini API, which operates in a loop by analyzing screenshots and user requests to generate the next UI action.
* Performance: Claims superior performance and lower latency compared to leading alternatives on multiple web and mobile control benchmarks like Online-Mind2Web.
* Use Cases: Designed for building personal assistants, workflow automation tools, and robust UI testing systems.
* Safety Controls: Includes built-in safety features, a per-step safety service to assess proposed actions, and requires user confirmation for potentially high-risk actions like making a purchase.
* Availability: Available now in public preview for developers via the Gemini API on Google AI Studio and Vertex AI. It is optimized for web browsers and has strong promise for mobile, but is not yet optimized for desktop OS control.
Strategic Importance
This release positions Google to compete directly in the growing agentic AI space, enabling developers to build more capable agents that can automate complex workflows beyond traditional API integrations. It provides a critical building block for creating general-purpose AI assistants that can interact with the digital world as humans do.