Gemini analyzes the scene and replies with a function call such as “click,” “type,” or “scroll,” which the client executes.