Not to belittle this or anything (it does look good and show promise), it feels like they somehow generate several consistent (but discrete) views of a given world, then feed all that to the good old pose estimation + gaussian splatting workflow. Whenever you leave the generated area (which isn't exactly huge on the few I tested) you get tell-tale signs of GS.
Yeah, if the entire point is that you can move around inside those worlds, I'd have expected a bit more "walkability" - maybe a few different viewpoints that each have their own Gaussian splatting? Right now, it dissolves pretty quickly once you change the location.
Yeah, it's more of a somewhat 3D-drawing of a frame that you can navigate inside, rather than a world up that happen to fit with whatever image you use as an input, but makes sense as a standalone world when you walk around. For being a "world" model, it doesn't seem to grasp physical space very well.
The interior scenes look and walks great, but any scenes with/in exteriors seems kind of bad.
In this current generation, "world models" is basically a marketing term. You can research gaussian splatting, novel view synthesis, neural radiance fields (nerf), etc... I find Mr Nerf is good to follow: https://x.com/janusch_patas
There is another thing called world models that involves predicting the state of something after some action. But this is a very very limited area of research. My understanding of this is that there just isn't much data of action->reaction.
Same issue with gaussian splatting/nerf really, very little data (relative to text/images/videos) of text -> 3d splats. I'd guess what world labs are doing is text -> image -> splats, but of course it is just speculation.
> There is another thing called world models that involves predicting the state of something after some action. But this is a very very limited area of research. My understanding of this is that there just isn't much data of action->reaction.
Folks interested in this can look up Yann LeCun's work on world models and JEPA, which his team at Meta created. This lecture is a nice summary of his thinking on this space and also why he isn't a fan of autoregressive LLMs: https://www.youtube.com/watch?v=yUmDRxV0krg
It's amazing to see how this space is developing. About 7 years ago I was building "spatial media" with https://ayvri.com
Nobody believed us when we said AI would create 3D virtual worlds that were indistinguishable from the real thing, and we'd be able to transport people to different places.
I particularly like the artistic effect of the drawing that brings the person into this world. Like a point-cloud that then gets "filled in".
I have little doubt this was a design decision and I think it is very well executed.
Even more amazing to me is that the tech to create these really existed 7 years ago (would have been slower to train but most methods don't need the latest GPUs). This means there are no doubt more improvements just waiting to be discovered!
I'm looking forward to the future of games and movies if these world models keep improving. Imagine if anyone with an interesting idea could sketch it, plug it into a world model and share the result with everyone. It'd open up a huge amount of possibilities.
Not to mention being able to explore worlds from already existing works. Care to go for a ride on a broomstick? How about simply walking into Mordor? It's exciting.
Something about the camera perspective creates a skew that makes things feel artificial to me. It's a minor thing that bothers me, but I'd like the geometry to feel more like what I normally see. Video generation models tend to feel more natural in perspective.
OK, so I've talked about this phenomenon with ChatGPT, and I think that the issue here is that to a lot of people, a song needs to be more than just a "song". There's some sort of requirement for it to be the un-faked result of having certain experiences. It has to relate to something happening in reality, and to be derived from it, and cannot exist in a vacuum separated from the rest of reality. Otherwise to them, the music isn't "real".
Endless droned ambient music disagrees with you that there is any sort of "requirement of certain experiences". Some of it is basically someone hitting play on a modular synth patch and letting it play until it sounds done, (some) people are still fine with listening to it.
Not to belittle this or anything (it does look good and show promise), it feels like they somehow generate several consistent (but discrete) views of a given world, then feed all that to the good old pose estimation + gaussian splatting workflow. Whenever you leave the generated area (which isn't exactly huge on the few I tested) you get tell-tale signs of GS.
Yeah, if the entire point is that you can move around inside those worlds, I'd have expected a bit more "walkability" - maybe a few different viewpoints that each have their own Gaussian splatting? Right now, it dissolves pretty quickly once you change the location.
This was my take as well — this is just pose estimation from generated stereo panoramic images.
Yeah, it's more of a somewhat 3D-drawing of a frame that you can navigate inside, rather than a world up that happen to fit with whatever image you use as an input, but makes sense as a standalone world when you walk around. For being a "world" model, it doesn't seem to grasp physical space very well.
The interior scenes look and walks great, but any scenes with/in exteriors seems kind of bad.
Are there any experts that could help me bootstrap myself on the current literature on "world models?"
In this current generation, "world models" is basically a marketing term. You can research gaussian splatting, novel view synthesis, neural radiance fields (nerf), etc... I find Mr Nerf is good to follow: https://x.com/janusch_patas
There is another thing called world models that involves predicting the state of something after some action. But this is a very very limited area of research. My understanding of this is that there just isn't much data of action->reaction.
Same issue with gaussian splatting/nerf really, very little data (relative to text/images/videos) of text -> 3d splats. I'd guess what world labs are doing is text -> image -> splats, but of course it is just speculation.
> There is another thing called world models that involves predicting the state of something after some action. But this is a very very limited area of research. My understanding of this is that there just isn't much data of action->reaction.
Folks interested in this can look up Yann LeCun's work on world models and JEPA, which his team at Meta created. This lecture is a nice summary of his thinking on this space and also why he isn't a fan of autoregressive LLMs: https://www.youtube.com/watch?v=yUmDRxV0krg
It's amazing to see how this space is developing. About 7 years ago I was building "spatial media" with https://ayvri.com
Nobody believed us when we said AI would create 3D virtual worlds that were indistinguishable from the real thing, and we'd be able to transport people to different places.
I particularly like the artistic effect of the drawing that brings the person into this world. Like a point-cloud that then gets "filled in".
I have little doubt this was a design decision and I think it is very well executed.
Even more amazing to me is that the tech to create these really existed 7 years ago (would have been slower to train but most methods don't need the latest GPUs). This means there are no doubt more improvements just waiting to be discovered!
I'm looking forward to the future of games and movies if these world models keep improving. Imagine if anyone with an interesting idea could sketch it, plug it into a world model and share the result with everyone. It'd open up a huge amount of possibilities.
Not to mention being able to explore worlds from already existing works. Care to go for a ride on a broomstick? How about simply walking into Mordor? It's exciting.
Something about the camera perspective creates a skew that makes things feel artificial to me. It's a minor thing that bothers me, but I'd like the geometry to feel more like what I normally see. Video generation models tend to feel more natural in perspective.
It would be nice to have these world models integrated with Blender.
Blog post: https://www.worldlabs.ai/blog/marble-world-model (https://news.ycombinator.com/item?id=45907541)
Slightly off-topic: I've just watched this takedown of an AI-generated chart-topping song: https://www.youtube.com/watch?v=rGremoYVMPc&lc=UgxfDvqX1G6kp...
OK, so I've talked about this phenomenon with ChatGPT, and I think that the issue here is that to a lot of people, a song needs to be more than just a "song". There's some sort of requirement for it to be the un-faked result of having certain experiences. It has to relate to something happening in reality, and to be derived from it, and cannot exist in a vacuum separated from the rest of reality. Otherwise to them, the music isn't "real".
Endless droned ambient music disagrees with you that there is any sort of "requirement of certain experiences". Some of it is basically someone hitting play on a modular synth patch and letting it play until it sounds done, (some) people are still fine with listening to it.
Thats totally insane and amazing.
wow, it’s slop!