According to Tom’s Guide, classic Pokémon games from the 1990s are now a major benchmark for testing advanced AI systems like GPT 5.2, Claude Opus 4.5, and Gemini 3 Pro. In February 2025, Anthropic researcher David Hershey livestreamed Claude Sonnet 3.7 attempting to play Pokémon Red, where it frequently got stuck for hours and failed to complete the game. In contrast, Google’s Gemini 2.5 Pro did manage to beat Pokémon Blue in May 2025, but it took far more time and steps than a human player would. Researchers point to the game’s complex mix of long-term strategy, memory, and planning as the key reason these models struggle, even though the franchise itself has sold over 489 million video games worldwide as of early 2026.
The hilarious struggle is real
Here’s the thing that’s both funny and telling: these are models that can write code and pass bar exams. But ask one to navigate from Pallet Town to Pewter City? You might as well be asking a toddler to file your taxes. The report notes that earlier Claude models would just wander aimlessly, unable to get past the first town. It’s a stark reminder that “knowing” something in a training dataset is wildly different from being able to execute a plan in a dynamic, unpredictable environment. I think that’s the core insight. The AI might have the entire Pokémon wiki memorized, but turning that knowledge into a coherent, step-by-step journey to become the champion is a whole other beast. It’s clumsy. It gets stuck in menus. It forgets what it was doing. Sound familiar? It should—it’s basically how I play after not touching a game for six months.
Why Pokémon is the perfect AI sandbox
So why Pokémon? Why not a harder modern game? The article, citing an explanation from Perplexity AI, nails a few key reasons. First, it’s turn-based. This removes the need for lightning-fast reflexes, letting researchers isolate pure planning and reasoning. Second, it’s less constrained than old benchmarks like Pong or Atari games. There are hundreds of Pokémon, different paths, and countless strategies. This open-endedness exposes an AI’s weaknesses in a way a simple game of Pong never could. And finally, let’s be honest: it’s culturally familiar. Researchers grew up with it. They intuitively know what a “good” playthrough looks like, making it easier to evaluate failure. Plus, livestreams of an AI bumbling through Viridian Forest are just more entertaining to watch than a graph of accuracy percentages.
What this really tells us about AI’s limits
This isn’t just a quirky experiment. It’s exposing a fundamental gap in today’s large language models. As independent researcher Peter Whidden noted, these models “know almost everything about Pokémon” but can’t execute. They lack persistent, working memory and the ability to chain together hundreds of small decisions toward a distant goal. It’s the difference between having a perfect roadmap and actually driving the car across the country, dealing with traffic, detours, and your own need for a snack break. The success of specialized bots using methods like Monte Carlo Tree Search shows there are other paths to “intelligence” in games. But for the general-purpose LLMs we’re all talking about? Pokémon highlights that they’re brilliant assistants, but terrible independent operators when a task requires sustained, logical execution. That has huge implications for anyone hoping to deploy AI for complex, multi-step business or logistics tasks.
From social experiment to AI lab
It’s poetic, really. The article reminds us that this all has a precursor in the 2014 “Twitch Plays Pokémon” phenomenon, where millions of people collaboratively (and chaotically) input commands to play the game. That was a test of crowd-sourced, organic “intelligence.” Now, we’re testing silicon-based intelligence in the same digital arena. The game that taught a generation about type advantages and grinding levels is now teaching AI researchers about the boundaries of machine reasoning. It’s a fascinating full-circle moment for a 30-year-old franchise. The takeaway is clear: if you want to see if an AI can truly think ahead and adapt, don’t give it an exam. Give it a Game Boy and a copy of Pokémon Red. The results will be far more revealing, and honestly, a lot more fun to watch.
