OpenAI’s Atlas Browser Agent Shows Promise But Faces Limitations in Real-World Testing

OpenAI’s Web Automation Agent Shows Mixed Results in Early Testing

OpenAI’s newly announced Atlas browser with integrated ChatGPT features includes an “Agent Mode” that reportedly can “get work done for you” by actively clicking, scrolling, and reading through webpages. According to reports from early testing, this agentic AI demonstrates promising capabilities but faces significant limitations in practical applications.

OpenAI’s Web Automation Agent Shows Mixed Results in Early Testing
Gaming Capabilities Demonstrate Basic Competence
Multi-Website Navigation Shows Promise
Email Processing and Data Collection Capabilities
Content Creation and Website Building
Practical Applications Show Mixed Success
Technical Constraints Limit Practical Utility

Sources indicate that while Agent Mode represents a significant step forward in web automation, the feature remains in “preview mode” with clear constraints on session length that impact its utility for extended tasks., according to market insights

Gaming Capabilities Demonstrate Basic Competence

In one test scenario, the Atlas agent was tasked with playing the popular tile-sliding game 2048. The report states that the agent successfully navigated to the game website, closed tutorial pop-ups, and figured out how to use arrow keys to play without human guidance.

Analysts suggest the agent showed some strategic thinking, with activity summaries indicating it was “looking ahead for simple strategies” and attempting to align tiles for merging. However, the testing revealed limitations when the agent stopped playing after just four minutes with a score of 356, requiring additional prompts to continue. The final score of 3,164 points was reportedly similar to what a novice human player might achieve.

Multi-Website Navigation Shows Promise

Perhaps more impressive was the agent’s performance in creating a radio playlist across multiple platforms. When asked to transform a Pittsburgh public radio station’s broadcast into a Spotify playlist, the agent reportedly navigated from Radio Garden to the station’s official website at WYEP.org after encountering difficulties.

The report indicates the agent successfully identified “Now Playing” information, logged into Spotify, searched for songs, and added them to a new playlist. However, technical constraints limited most sessions to just a few minutes, allowing only 2-4 songs to be added before the agent stopped working.

Email Processing and Data Collection Capabilities

In another test, the Atlas agent was tasked with scanning emails to create a reference spreadsheet of PR contacts. According to the testing report, the agent correctly identified Gmail as the email platform and differentiated between personal and professional accounts.

The agent reportedly performed targeted searches for recent PR emails, clicked through messages, and extracted contact information including names, email addresses, phone numbers, and company details. Within seven minutes, it had created a well-formatted Google Sheet with 12 complete entries, though it stopped before processing all 164 emails returned by the initial search.

Content Creation and Website Building

The testing also explored the agent’s ability to create web content. When asked to build a Star Trek fan site on Neocities dedicated to the character Tuvix, the agent reportedly generated a basic but functional website in just two minutes.

Sources indicate the agent aggregated information from various Trek-related sources including Memory Alpha and incorporated headers like “The Hero Starfleet Murdered” and “Justice for Tuvix.” However, the report notes the agent used direct links to externally hosted images rather than uploading copies, resulting in broken images on the final site.

Practical Applications Show Mixed Success

In a practical test involving Texas electricity plans, the agent spent eight minutes navigating PowerToChoose.org before recommending a specific plan with its associated fact sheet. According to analysis from an energy expert, the agent “didn’t screw up the assignment” by selecting a fixed-rate plan rather than variable pricing, though the specific recommendation might not have been optimal for all users.

However, the agent struggled significantly with downloading Steam game demos, reportedly spending nearly ten minutes in confusion loops while attempting to locate and download macOS demos before the test was abandoned.

Technical Constraints Limit Practical Utility

Across multiple tests, the most significant limitation appeared to be what OpenAI describes as “technical constraints on session length.” The report states that most tasks were limited to just a few minutes of automated activity, preventing the completion of more extensive assignments.

Analysts suggest that while Atlas’ Agent Mode demonstrates impressive capabilities in interpreting webpage content and navigating interfaces, the session limitations significantly reduce its utility as a “set it and forget it” automation tool. The testing concluded that for simple, repetitive tasks that can be spot-checked by humans, the technology already shows practical value despite its current constraints.