Could AI “world models” could be the most viable path to fully photorealistic interactive VR?
Much of the “virtual reality” seen in science fiction involves putting on a headset, or connecting a neural interface, to enter an interactive virtual world that looks and behaves as if entirely real. In contrast, while high-end VR of today can have relatively realistic graphics, it’s still very obviously not real, and these virtual worlds take years of development and hundreds of thousands or millions of dollars to build. Worse, mainstream standalone VR has graphics that in the absolute best cases looks like an early PS4 game, and on average looks closer to late-stage PS2.
With each new Quest headset generation, Meta and Qualcomm have delivered a doubling of GPU performance. While impressive, this path will take decades to achieve even the performance of today’s PC graphics cards, never mind getting anywhere near photorealism, and much of the future gains will be spent on increasing resolution. Techniques like eye-tracked foveated rendering and neural upscaling will help, but can only go so far.
Gaussian splatting does allow for photorealistic graphics on standalone VR, but splats represent only a moment in time, and have to be captured from the real-world or pre-rendered as 3D environments in the first place. Adding real-time interactivity requires a hybrid approach that incorporates traditional rendering.
But there may be an entirely different path to photorealistic interactive virtual worlds. One far stranger, and fraught with its own issues, yet potentially far more promising.
Yesterday, Google DeepMind revealed Genie 3, an AI model that generates a real-time interactive video stream from a text prompt. It’s essentially a near-photorealistic video game but where each frame is entirely AI-generated, with no traditional rendering or image input involved at all.
Google calls Genie 3 a “world model”, but it could also be described as an interactive video model. The initial input is a text prompt, the real-time input is mouse and keyboard, and the output is a video stream.
What’s remarkable about the Genie series, as with many other generative AI systems, is the staggering pace of progress.
Revealed in early 2024, the original Genie was primarily focused on generating 2D side-scrollers at 256×256 resolution, and could only run for a few dozen frames before the world would glitch and smear into an inconsistent mess, hence why the sample clips shown were only a second or two long.
Genie 1, from February 2024.
Then, in December, Genie 2 stunned the AI industry by achieving a world model for 3D graphics, with first-person or third-person control via standard mouse-look and WASD or arrow key controls. It output at 360p 15fps and could run for around 10-20 seconds, after which the world begins to lose coherence.
Genie 2’s output was also blurry and low-detail, with the distinctly AI-generated look you might recognize from the older video generation models of a few years ago.
Genie 2 from December (left) vs Genie 3
Genie 3 is a significant leap forward. It outputs highly realistic graphics at 720p 24fps, with environments remaining fully consistent for 1 minute, and “largely” consistent for “several minutes”.
If you’re still not quite sure what Genie 3 does in practice, let me spell it out clearly: you type a description of the virtual world you want, and within a few seconds it appears on screen, navigable through standard keyboard and mouse movement controls.
And these virtual worlds are not static. Doors will open when you approach them, dynamic shadows will exist for moving objects, and you can even see physics interactions like splashes and ripples in water as objects disturb it.
In this demo, you can see the character’s boots disturbing puddles on the ground.
Arguably the most fascinating aspect of Genie 3 is that these behaviors are emergent from the underlying AI model developed during training, not preprogrammed. While human developers often spend months implementing simulations of just one aspect of physics, Genie 3 simply has this knowledge baked in. That’s why Google calls it a “world model”.
More involved interactivity can be achieved by specifying the interactions in the prompt.
In one example clip, the prompt “POV action camera of a tan house being painted by a first person agent with a paint roller” was entered to essentially generate a photorealistic wall painting mini-game.
Prompt: “POV action camera of a tan house being painted by a first person agent with a paint roller”
Genie 3 also adds support for “promptable world events”, from changing the weather to adding new objects and characters.
These event prompts could come from the player, for example through voice input, or be pre-scheduled by a world creator.
This could one day allow for a near-infinite variety of new content and events in virtual worlds, in contrast to the weeks or months it takes a traditional development team to ship updates.
Genie 3’s “promptable world events” in action.
Of course, 720p 24fps is far below what modern gamers expect, and gameplay sessions last a lot longer than a minute or two. Given the pace of progress, though, those basic technical limitations will likely fade away in coming years.
When it comes to adapting a model like Genie 3 for VR, other more mundane issues emerge.
The model would at minimum need to take a 6DoF head pose as input, as well as directional movement, and ideally would need to incorporate your hands and even body pose, unless you just want to walk around the world without directly interacting with any object.
None of that is impossible in theory, but would likely require a much wider training dataset and significant architectural changes to the model.
It would also of course need to output stereoscopic imagery. But the other eye could be synthesized, either with AI view synthesis or traditional techniques like YORO.
Latency could also be a concern, but Google claims Genie 3 has an end-to-end control latency of 50 milliseconds, which is surprisingly close to the 41.67 ms theoretical minimum for a 24 fps flatscreen game. If a future model can run at 90 fps, combined with VR reprojection this shouldn’t be an issue.
However, there is another far more fundamental issue with AI “world models” like Genie 3 that will limit their scope, and it’s why traditional rendering won’t be going away any time soon.
That issue is called steerability – how closely the output matches the details of your text prompt.
You’ve likely seen impressive examples of highly realistic AI image generation in recent years, and of AI video generation (such as Google DeepMind’s Veo 3) in recent months. But if you haven’t used them yourself, you may not realize that while these models will follow your instructions in a general sense, they’ll often fail to match details you specify.
Further, if their output includes something you don’t want, even adjusting the prompt to remove it will often fail. As an example, I recently asked Veo 3 to generate a video involving someone holding a hotdog with only ketchup, no mustard. But no matter how sternly I stressed that detail, the model would not generate a hotdog without mustard.
In traditionally rendered video games, you see exactly what the developers intended you to see. The minute details of the art direction and style create a unique feel to the virtual world, often painstakingly achieved through years of refinement.
In contrast, the output of AI models comes from a latent space shaped by patterns in the training data. The text prompt is closer to a hyperdimensional coordinate than a truly understood command, and thus will never match exactly what an artist had in mind. This becomes even trickier when promptable world events get involved.
Of course, the steerability of AI world models will improve over time too. But it’s a far tougher challenge than just boosting the resolution and memory horizon, and may never allow the kind of precise control of a traditional game engine.
Prompt: “A classroom where on the blackboard at the front of the room it says GENIE-3 MEMORY TEST and underneath is a beautiful chalk picture of an apple, a mug of coffee, and a tree. The classroom is empty except for this. Outside the window are trees and a few cars driving past.”
Still, even with the steerability issue, it would be foolish to not see the appeal of eventual photorealistic interactive VR worlds that you can conjure into existence by simply speaking or typing a description. AI world models seem uniquely well positioned to deliver on the promise of Star Trek’s Holodeck, more so than even AI-generating assets for traditional rendering.
To be clear, we’re still in the early days of AI “world models”. There are a number of major challenges to solve, and it will likely be many years until a VR-capable one that can run for hours arrives for your headset. But the pace of progress here is stunning, and the potential is tantalizing. This is an area of research we’ll be paying very close attention to.