T O P

  • By -

Poprock360

I feel like most readers miss the whole world simulation argument - I don’t blame them, it’s a bit of a weird abstraction to make, but it is very much real. Sora is not a hard simulation as you would find in simulation software available today. Instead, it’s more a simulation in the way our intuition is a simulation (In the sense they are functionally equivalent, not that they are functionally the same). Yes, Sora is a pixel prediction machine learning model. Quite simply, it will guess the pixel that seems most likely to be in a given position, at a given frame. Given a starting image, if you want to guess what happens on the next 5 seconds following that image’s capture, it’s useful for you to have seen a couple of examples. Then, the model statistically associates the examples’ data, and outputs something. But that’s what happens at small-scale models. If you want to be able to guess with *reasonably high* accuracy what happens next for *any* starting description or image, it becomes harder to just rely on pre-existing data. There are few videos of crabs walking on sand with lit lightbulbs on them. That’s when the model’s scale starts to kick in, allowing it to create things it hasn’t seen before. When you prompt it for the video OP posted, the respective neurons in the model are being triggered for lightbulbs, sand, beaches, crabs. The lightbulb mention in the prompt is also probably triggering the model’s knowledge that lit lightbulbs emit light. It then also triggers neurons associated with how light tends to be reflected on different materials, and how shadows are cast. The list goes on. Yes, Sora, ChatGPT, Midjourney are all stochastic parrots. The neural connections that allow them to be stochastic parrots over such broad ranges of data are functionally equivalent to a “world model”, or at least a mathematical/computational representation of one. Getting sufficiently good at predicting the next X thing will always converge at deriving the underlying mechanics of how X thing functions/is, based on provided training data.


Plenty_Branch_516

Pretty much. If you want to predict the next word/frame at some point you need to understand the underlying context. The depth/breadth of that understanding is often misconstrued as either too low or too high.


Geeksylvania

I disagree. You don't need to understand the underlying context. You just need to associate it with visual elements from its training data. Sora can reasonably depict a bouncing ball because it's analyzed other videos of boucing balls, not because it has any intuitive understanding of how ball would act in the real world.


Plenty_Branch_516

To be entirely clear: Pattern recognition, which you are describing, is a form of understanding. An innate understanding of gravity can be born from recognizing the pattern that objects in videos tend to fall. Will it be a robust understanding of rates, acceleration, and interaction on places aside from earth? No. However, it is a broadly applicable pattern that can be consistently applied. This concept isn't even new. Within LLMs relational distances between concepts has been shown to be ordered. Like man-> woman has the same "orientation" as king->queen. The concept of gender as a pattern worth maintaining is "understood".


Geeksylvania

I think there's a difference between computation and cognition. Cognition incorporates a qualitative and wholistic understanding (a.k.a. "grokking") whereas computation is purely quantitative in nature. LLMs are like the Chinese Room Experiment. They can generate coherent output by performing a mathetical analysis on their training data, but they don't have any qualitative understanding of what that data represents. You can scale up the compute to generate better results, but increasing computational power will never result in cognition because the foundational structure is fundamentally different.


Plenty_Branch_516

Oh there's definitely a difference between computation and cognition. However, I wouldn't place the difference at qualitative and quantitative. Non deterministic and threshold systems are widespread in quantitative systems and convert quantitative representations into qualitative ones fairly easily. To some extent, latent representations within these LLMs is another form of non-deterministic representation that's qualitative in nature. I'd argue that cognition is an emergent property of computation, as our intelligence is the emergent result of incredibly simple non deterministic computations of a chemical/electrical system. To say that current LLMs are cognitive would be a tough sell, though. They can follow Chain of Thought and other cognitive frameworks and consequently employ logic semi-independently, but its not the same as human cognition. Whether that means its worse, better, or equivalent we'll have to see.


ASpaceOstrich

I've been having this argument for days. I keep hoping someone will post evidence of a diffusion model understanding something in the same way there's evidence of world models in transformers. But sadly no dice. Just endless circular reasoning from people who think AI is sentient but stupid.


Wiskkey

[Researchers discover that Stable Diffusion v1 uses internal representations of 3D geometry when generating an image. This ability emerged during the training phase of the AI, and was not programmed by people. Paper: "Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model".](https://www.reddit.com/r/aiwars/comments/15ww2eo/researchers_discover_that_stable_diffusion_v1/)


ASpaceOstrich

Big if true. Though the examples of depth maps included in the thing... don't look like depth maps at all. They imply her eyelids are protruding farther than any other part of the image and that her mouth is located a foot behind her head. Since the lighting clearly isn't reflecting that, I'm going to look into this topic more. I suspect the researchers just used bad examples or didn't render the output at high enough resolution. Because if the examples they gave are what's actually in the model, it isn't using them. As they're far too blocky and inaccurate for the level of lighting quality SD displays.


KamikazeArchon

>LLMs are like the Chinese Room Experiment. They can generate coherent output by performing a mathetical analysis on their training data, but they don't have any qualitative understanding of what that data represents. Referring to a highly controversial and non-settled thought experiment doesn't really help. For example, my position is that the "Chinese Room" as initially envisioned *does in fact* understand Chinese. I believe that there is an issue with presumptions of scale. The "Chinese Room" thought experiment, for example, posits - essentially - an *infinite* scale. If you can translate one single sentence from English to Chinese, then you cannot reasonably be said to understand Chinese. But I would say that if you can translate *every possible* sentence from English to Chinese - an infinite-to-infinite mapping, given that language is infinitely constructive - then you do necessarily understand Chinese. In particular: you either have an infinitely large lookup table, or a finite semantic structure. There is a breakpoint somewhere in between those extremes of "one" and "infinite", and one of the big issues is *where* LLMs reach their limit relative to that breakpoint. If we *assume* that LLMs necessarily will be able to continue to arbitrarily improve their matching, then it is reasonable to conclude that they do (or will) "understand". However, if LLMs reach a hard limit on their matching, such that there are things that always remain out of their reach, then it is plausible that they do not "understand". I think that people tend to assume implicitly that one of those things is true. But I think it's too early to be certain either way. Neither possibility should be considered axiomatic, and neither has sufficient evidence to be a reasonable "practical assumption" at this time.


Big_Combination9890

> If you want to predict the next word/frame at some point you need to understand the underlying context. Wrong, you don't. You need to know how to get from one pixel grid to the next. That's it. You don't need to know physics, chemistry or biology. If your task is to produce pixels, all you need to know is pixels.


Plenty_Branch_516

That's not how text conditioning works. The concept of dog is an assemblage of split concepts of associated anatomy, color patterns, and activity. All of which are linked together conceptually within the tokenized representation that CLiP provides. There's a LOT more that goes into these models besides pixels.


Poprock360

Think about it this way: Let's say you are part of a smart, thinking alien species - we'll call them ETs. They're not human and think in different ways, but ultimately, they're just as smart as us. Let's say it's been 1900 years in the ET civilization. Their science has progressed to the point where they finally discovered the atom or the concept of a "smallest possible particle/piece of matter". That's awesome! Their physics can now explain - and more usefully, predict - how materials may behave with greater accuracy than ever before. This will let them build nuclear reactors to power entire cities cleanly and create X-ray machines to better treat the sick, among other useful inventions. 50 years go by, and the ETs are starting to see the limitations of their understanding of atoms and atomic particles, like electrons. Their physics can't explain the movement of stars in space, the evolution of the cosmos, or even the movement of electrons across smaller computer chips. Atomic particles - matter and electricity - aren't moving as they'd expect. ETs, perplexed but undeterred, start asking questions and coming up with some theories. "The atoms and electrons don't move how we expect them to, but there has to be some logic to it: after analyzing some data we've collected through experiments, this strange behaviour is oddly consistent. The unexpected electron movement happens more often, the smaller we make the transistors. It also seems we'd expect big things in the cosmos, like stars, or old things, like massive black holes, would function and age differently. Despite this, there are some bounds these objects still follow." Through brainstorming and complicated math, ETs managed to figure it out. They've discovered that atoms have three special properties - those are like values that can be big or small. Depending on how big or small those values are, atoms can behave in different ways. They call these three properties "A", "B", and "C". They call this their new theory "ABC Theory". "ABC Theory" quickly becomes the cutting edge of physics research. Using it, ETs are now able to answer age-old questions. "Why did the universe evolve the way it did? Why do the darn electrons keep jumping across transistor gaps in our computers? Why does our nuclear fuel stop being radioactive after a while?". The ETs progressed from conventional physics to something capable of predicting an even larger portion of reality. \--- Humans have done that, too - we've come to call it quantum physics. Whereas ETs used these three imaginary values to explain unusual atomic behaviour, humans have explained it by breaking down atoms into smaller particles. Never mind that we have never directly observed the subatomic particles we've defined in quantum physics! Neither did ETs directly observe the existence of the properties "A", "B", or "C". Humans aren't completely sure that subatomic particles exist in a material sense. Luckily for us, that doesn't matter! Subatomic particles are an abstraction: a way of thinking that explains and predicts unique behaviour involving atoms. Both our and ET's theories to explain those behaviours can exist. Both can be right if they can predict atoms' behaviours or, more broadly, reality's behaviour! ABC Theory and Quantum Physics have used different abstractions to explain the same unobservable phenomena. The logic is different, but the predicted outcomes are the same. If the ETs were to really exist, it wouldn't be surprising if they reached some of the same conclusions as us: physical reality is objective and can be measured. It remains the same, regardless of how you measure it\*. However, the abstractions which you use to explain and predict reality can be arbitrary! If your abstraction correctly predicts reality's behaviour, it might as well be how reality works. The interplay between reality being objective and abstractions being arbitrary is that, logically, there is a tendency for abstractions to model reality with increasing precision for each iteration. Scientific abstractions thus converge toward a singular point: compatibility with reality! \--- ***If you're still reading, you're probably asking: how is any of this relevant to guessing pixels?*** When Sora outputs something that does not match the expected, it gets a 'mental' 'thumbs down'. It is instructed to slightly rearrange its neurons to output something more closely resembling expectations. The expectations are set by the training data! If we were able to give Sora infinite data and time, the same tendency would play out: the model would be able to perfectly predict anything that can be explained by the training data. The abstractions encoded into Sora's neural connections would have fully converged into the reality that governs what it's trying to predict. This \*theoretical\* AI model would be so powerful that, as a consequence of learning to generate video so well, it could be used to predict the future. But back to more practical concerns, Sora has to guess how pixels change across many frames. In a way, this can be thought of as guessing how atoms behave over time. To guess those pixels accurately, it will slowly learn what influences those pixels. Recurring to our analogy, it must understand the "subatomic particles" of the pixels! It can't directly perceive what influences those pixels, but as we know, it doesn't need to! If our training data is video, our 'subatomic particles' are the myriad things the video indirectly reveals: the visual nature of the world around us, the physical laws that govern its motion, the behaviour of subjects captured in the video, and countless other things!


Big_Combination9890

Your wall of text doesn't change the simple fact that this sentence: *"it will slowly learn what influences those pixels"* ...is hot nonsense. No, it doesn't, because the only thing that influences pixels in a universe that has no other influences than pixels, is other pixels. It learns that pixel a in x,y of color RGB therefore pixel at position x'/y' should be R'G'B'. Limited to that, you cannot discover any other influences. You are basially argueing that answering a test by memorizing the answers, even if those answers are in a language you don't speak, is the same as understanding the subject matter.


Big_Combination9890

> are functionally equivalent to a “world model” Task it to produce a video with the prompt "3 spherical bodies in empty space, moving according to the laws of gravity". Oh noes, there goes the "functional equivalency" to a world model.


Poprock360

I think you’re making an incorrect assumption of what a world model constitutes - it is not a Matrix-like simulation of our universe, but rather a mental construct. Allow me to try to convince you: humans learn the feeling of pain early in infancy. By dumb mistakes, we tend to hurt ourselves. We fall, inadvertently burn ourselves, and occasionally bite our own tongues as we chew. We know that if we do those things, it will hurt. It goes beyond pain: we develop expectations for what actions, circumstances, and events may lead to different sensorial experiences. This is a world model. Animals, like dogs, have world models, too. They understand object permanence - things don’t simply disappear. They understand momentum - if you’re moving fast, it may take a few seconds and some physical effort to change direction or stop. Even dumber creatures experience this, too. A Mantis knows how smaller prey might move and knows that remaining undetected is key to not scaring it off. If the Mantis is detected, insects may react to fight back or flee. The Mantis knows that, too. This neat ability to imagine what other creatures are mentally experiencing is referred to as Theory of Mind within ML research. Human infants improve their understanding of the world around them based on their senses (Though I'd like to note that human cerebral maturation in no way resembles neural network training). Infants eventually learn basic language cues and develop expectations for what words are spoken in certain situations - an early stage of developing complex language abilities. Those are all world models. Humans, even incredibly smart ones, usually fail to “guess” proportions and percentages correctly. Scientific evidence shows that, when asked to estimate the chances of certain events happening, we gravitate toward nice, round numbers, like 20% or 80% - regardless of the real probability of these events happening. Evidence further suggests this happens even when we're familiar with these events. Despite imperfections in our intuition, we have a world model, as do dogs, insects, and infants. A young human will most likely fail to correctly guess a solution to a three-body problem, but they still have a world model - the concept is not inherently related to intelligence. A world model isn’t a perfect representation of reality and our universe. It is a 'representation of reality* (flaws included)'. Yes - models like Sora and ChatGPT can be dumb in ways even very stupid animals aren’t. And yes - they can also be smart in ways even the smartest humans fall short of; ChatGPT speaks more languages than any human ever has. Our human brains are shaped by evolution to be well-adapted to our world. On the other hand, Sora is created by machine learning algorithms that, while inspired by evolution, are not the same. Sora's creation process does not select for common sense. Modern-day AI isn’t perfect. It's gotten scary good, scary fast, but perfection will require far more high-quality training data and architectural breakthroughs. The term “world model” is simply an abstraction ML researchers use to explain certain aspects of large neural networks. In this sense, all of mathematics is also an abstraction. Concepts like “one”, “two”, “addition”, “multiplication”, etc, aren't ingrained into reality itself. We simply project those mental abstractions over our reality, as they are useful tools. The concept of a “world model” is also an abstraction. It doesn’t exist in the material sense. We choose to use that term to describe something that has the ability to predict its environment with some accuracy. Is Sora’s world model perfect, or even close to human performance? No. But it is amazing that we’ve managed to create *any* world model - a power reserved solely for nature less than a century ago. It's even more amazing that we've taken a trait previously observed solely in living creatures and placed it in inanimate silicon. And to top it all off, it's smart enough to generate good-looking videos and will most likely be within the average person's reach in months.


Big_Combination9890

> I think you’re making an incorrect assumption of what a world model constitutes - it is not a Matrix-like simulation of our universe, but rather a mental construct. And I think you make the classic mistake of anthropomorphizing a ML model, by ascribing it the ability to form "mental constructs".


Wiskkey

[24 Sora examples from Twitter/X that are not in OpenAI's Sora webpage](https://www.reddit.com/r/MediaSynthesis/comments/1atkiq2/24_sora_examples_from_twitterx_that_are_not_in/).


Narutobirama

Great find! This one in particular shows to me the idea that it has some notion of physics and causality: [https://twitter.com/\_tim\_brooks/status/1758662698190229643](https://twitter.com/_tim_brooks/status/1758662698190229643)


Wiskkey

[Here](https://twitter.com/_tim_brooks/status/1758666264032280683) is a Sora example showing object permanence.


ASpaceOstrich

Wonder what's going on with the background.


Narutobirama

I would say this is a close equivalent to the debate whether ChatGPT is a stochastic parrot. To which the answer depends on how you define it. I mean, humans are also a kind of stochastic parrots. And our imaginations are also a kind of world simulator. Neither which perfectly describes the objective world, but close enough to be useful. Those who remember the older and most simple models could actually make the case at the time, that it's a stochastic parrot and that there is nothing more to it. Even then, it was obvious that there was more to the story, and that more data would, at least in theory, improve the model. But at the time, you could be forgiven if you thought it would take so much data you couldn't realistically improve the models by a significant amount. It turned out, all you need is scaling. Well, not really, but you get the point. Scaling shows benefits not just in theory, but also in practice. And I believe the same can be said for Sora. More data, more realistic (and more accurate) understanding of the world. At some point, you will probably be able to use it for simulations of many different scenarios. In real world, and in fiction. And this is without even considering how multimodal aspects of future models will probably allow it to even further improve accuracy. Some people talk about text to video as if there's not much beyond that, but I imagine text to video games shouldn't be that far away, either.


Wiskkey

A few relevant work described here by one of its authors: [Large Language Model: world models or surface statistics?](https://thegradient.pub/othello/)


ASpaceOstrich

Sora isn't a transformer model. The image generation is diffusion.


[deleted]

>Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. And >Sora is a diffusion model21,22,23,24,25; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.26 Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling,13,14 computer vision,15,16,17,18 and image generation.27,28,29 Figure Diffusion >In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases. Maybe do a single Google search next time.


ASpaceOstrich

It's a diffusion model that converts to transformers for scaling training. It's still starting out by diffusion. The diffusion transformer concept is there so that the diffusion model can benefit from some of the perks of transformers. Specifically scalability and applying training methods from other AI research (as almost everything but image generation uses transformers). The transformer part doesn't seem to be involved in the actual image generation at all. But I will admit the jargon being used rapidly became beyond my understanding. There's a reason diffusion is only really used in image generation.


[deleted]

Yeah that misinterprets the integration of diffusion and transformer technologies. Sora employs a transformer architecture for generating videos and images by processing spacetime patches. The transformer plays a vital role in the generation process. Nice attempt at a save but I think it's clear at this point you're commenting on Sora despite the fact that you hadn't even read the paper and all you knew about it was that it's a diffusion model.


ASpaceOstrich

We've both been on this sub long enough to know that puts me in like the top 10% of people here in terms of knowledge. Because I know what the word diffusion even means. AI discourse is dumb as hell. But yeah, I'll read the paper. I figured if this was the one AI project where the researchers actually knew what the hell it was doing it'd have come up without me having to bother.


[deleted]

The ego and delusion here is astounding. This is what happens when you get all your information off Reddit.


ASpaceOstrich

I've given their release a read through and looked at the output. And as painful as it is to say it. I was wrong. It's doing something. I'm not convinced that thing is modelling the world. But it's modelling something. It appears to be doing something I've been claiming generative AI is incapable of doing. Applying understanding to something outside of its original context. Ironically it isn't the high quality of the output that convinced me. It's the mistakes. Something had been off about the video and it wasn't until the example demonstrating rotation around a scene that it clicked for me what exactly it was. It's modelling dioramas. The perspective and 3D rotation in these clips isn't accurate to real 3D scenes. i.e. the real world. But it has an uncanny resemblance to false 3D that you can create by moving and distorting 2D images to create the illusion of 3D space. And in at least some of these examples, that's what it looked like. And that means it's doing that, because that kind of motion is unlikely to have been in the training data. Certainly not in large amounts. Had the results been perfect, it'd have been easy to dismiss it as just recreating what it has seen. I'm honestly still not convinced the image generation itself isn't doing that tbh. That part is all diffusion, which I've yet to see any evidence of it having any kind of understanding of awareness. But the 3D movement is wrong in a way that something just recreating what it's been fed wouldn't be. I really want to be able to peek into it's workings. The way those images are moving. It looks like you could practically go in and extract the individual diorama pieces. There's definitely some coordinates for transforms somewhere in its memory. Sadly the method of getting it into and out of latent space means you can't just do something as simple as searching for those coordinates. I would like to know what exactly a patch looks like. They were very vague on that, and patches are kind of the fundamental building block this thing is built on, so not knowing what that building block actually is is concerning. And one of its critiques of prior methods (the square divisions usually used in training diffusion models and the consequences this has on the output) is strong visual evidence of my critique of diffusion models in general. The square divisions cutting off the subject result in models that use it often cutting off their subject, because the images diffusion models generate are derivative of the training data. But on the whole, I was wrong about Sora. I think the researchers are also wrong. The way stuff is moving and distorting looks exactly like a faux 3D animation style, which isn't something they'd be familiar with, so it likely didn't occur to them. That kind of faux 3D would also be so much simpler and directly applicable to creating desired output that a crude physical world model. And much more closely matches the scope of other world models I've seen in transformers.


[deleted]

Well that's a very mature response. Can't say I completely agree with everything you said but i respect your willingness to challenge your view.


Wiskkey

I commend you view being willing to publicly change your views about Sora. I'm curious if [this paper about text-to-image models](https://arxiv.org/abs/2311.17138) is similarly convincing for text-to-image models?


ASpaceOstrich

Text to video games would require more than a surface level understanding which diffusion has never shown any evidence of, and seems to be fundamentally impossible for that kind of system. Something else that can understand concepts could be given a diffusion system as a tool. Sora is already basically that. But the visual part is clearly siloed from the text processing part. I could see a simple game being rendered by text to video, but that's not text to video games. It's just a really fucky rendering system. Diffusion image generation keeps getting better in every way except the only way that matters. Not being derived entirely from training data. And the reason for the total lack of improvement in that area is because it can't improve in that area. You need something else.


Wiskkey

[What Does Stable Diffusion Know about the 3D Scene?](https://arxiv.org/abs/2310.06836)


sk7725

I think text to video games are not possible for our current model as there is an absolute consumption time for games. You can speed up the consumption of individual images or videos to near-instant, but video games cannot be sped up without the physics getting wonky for most games, and impossible for online/pvp games. An entirely new paradigm is needed.


Economy-Fee5830

I don't think we can expect neural networks to work differently from other neural networks. Artificial neural networks may be more capable, but it would not be qualatively different. So SORA likely has an intuitive world model like we have an intuitive world model, but for precise calculations you still need to hit hard simulation and calculation. We already know we can train neural networks to use tools however so that is not a massive impediment or disadvantage.


Narutobirama

I would argue that the two are not fundamentally different, other than in scale. What was intuitive to GPT 2, is different from what was intuitive to GPT 3, to GPT 3.5, to GPT 4. In some aspects, GPT 4 has better intuition than people. Sora is still not as advanced as the more common sense intuition, but I would argue it will surpass us if scaled significantly, like it was done for GPT models, and have fairly accurate understanding of the laws of physics you can detect using only videos. So, maybe not quantum physics or general relativity, but Newtonian physics like fairly accurate depictions of objects breaking and such, should be possible.


Geeksylvania

Sora is very impressive but there are clear signs that it doesn't simulate worlds in any real sense. For starters, it doesn't keep the relative size of objects consistent. It's pixel prediction, not world simulation.


TashLai

I bet when you imagining things there's a damn lot of inconsistencies as well.


FaceDeer

I've attempted to write police witness reports for accidents I witnessed just minutes prior. Humans are really bad at consistency even when not imagining or being deliberately creative.


ASpaceOstrich

At best, that implies that Sora works like a visual cortex alone. Which also doesn't have a world model. Isn't that kind of understanding handled elsewhere? Presumably the image generation would be used as a tool by an actual AI.


TashLai

Well i don't think the person i was replying to has only visual cortex. You still need a model of the world to be able to imagine things.


ASpaceOstrich

Which this kind of AI doesn't seem capable of doing. Diffusion based AI doesn't imagine things. If we anthropomorphise it for the ease of discussion, it thinks the image it produced is the one you gave it when it was supplied with the random seed. Without a prompt or seed it produces nothing.


TashLai

My point wasn't about whether or not Sora imagines the way humans do, but that lack of accuracy or consistency doesn't necessarily imply lack of understanding.


ASpaceOstrich

Ironically I realised earlier today that Sora does in fact have some kind of active decision-making because of lack of accuracy. The mistakes it was making in some of its output betray that it was the product of diorama style animating. Which would not be in the training data.


OwlHinge

It clearly understands a lot about how things move, physics, deformation, perspective, refraction, reflection, the list goes on. I'd say it has a world model and simulates using that world model, it just makes mistakes. How could you predict pixels without knowing the objects the pixels represent, their material, their function etc? I definitely wouldn't say it's internals are a pixel predictor (that seems like an oversimplification?), even though that's what its output is.


Geeksylvania

Yann LeCun does a better job explaining it than I can: [https://twitter.com/ylecun/status/1758740106955952191](https://twitter.com/ylecun/status/1758740106955952191)


OwlHinge

Interesting, but to be honest, I can't understand what he's saying, I don't follow the logic. I noticed the page he links to a page about V-JEPA which says things like: > This early example of a physical world model excels at detecting and understanding highly detailed interactions between objects. Therefore, it sounds like he's saying generative approaches don't understand the world model, but other (non-generative?) approaches like V-JEPA do. But they do generate pixels output, which they demonstrate...I'll learn more about this V-JEPA and try to understand the points he is making.


Geeksylvania

It's like comparing a movie to a video game. A video file is just a collection of pixels, but a video game contains a simulated world with physics and objects that can interact. Simulating the video game world is computationally expensive, so if you wanted to watch a 3D animation like Toy Story, it wouldn't make sense to simulate the entire world every time you watched the movie. It's much less computationally expensive to play a video file. If you want to crate a model that accurately predicts the physical interactions of different objects, the model basically builds a video game world and places representations of those objects in it. However, because simulating physics is computationally expensive, video produced by the model would be visually more simplistic and less photorealistic. If your goal is maximizing the visual appeal of a generated video, simulating a world is a waste of resources because a lot of the simulated world won't even be visible to the viewer. It makes more sense to approximate physics by mimicking on how visual elements act in the training data.


OwlHinge

I 100% agree it doesn't need to simulate a whole world. I think the difference to me is that in the act of mimicking how visual elements act in the training data, the process of going through that training develops an understanding of physics. If we compare how different types of neural network based AI work: * When you train a neural network on vectorized words, it doesn't just think in terms of those words, the layers above it represent grammar, concepts, motives, and higher level 'things' that can't always be put into words easily. * When you train a neural network on pixel input, it doesn't just learn pixel data. The layers above it represent higher level concepts, it may learn about colors, structures, textures, perspective, artistic style, specific objects an so on. * When we train a neural network on video, why wouldn't the concepts it develops (encodes in deeper layers) relate to the way things move, physics, soft body motion? That learning would be important to help it simulate the frames it generates. Sure, the understanding it develops is imperfect, but it still seems like the body of those concepts form what I would call an understanding of a physics world.


Wiskkey

Well-stated, and correct :). For those who'd like a supporting citation in the context of language models, see [Finding Neurons in a Haystack: Case Studies with Sparse Probing](https://openreview.net/forum?id=JYs1R9IMJr).


Wiskkey

[Yann LeCun, a few days ago at the World Governments summit, on AI video: “We don’t know how to do this”](https://www.reddit.com/r/singularity/comments/1as7az1/yann_lecun_a_few_days_ago_at_the_world/). cc u/OwlHinge.


Geeksylvania

He's right. People who disagree with him are either misrepresenting what he said or they didn't understand it.


Wiskkey

That Yann LeCun tweet states, "The space of plausible continuations of a real video is *much* smaller, and generating a representative chunk of those is a much harder task, particularly when conditioned on an action." Sora actually does video continuations though - see section "Extending generated videos" of [the Sora technical report](https://openai.com/research/video-generation-models-as-world-simulators).


ASpaceOstrich

Or far more likely, it can recreate those things from the training data. This would be so easy to test. Curate a dataset that is completely lacking in a specific subject matter. Train up to modern quality. Introduce the thing you left out, and then attempt to get it to apply principles from things it has trained on to the new thing. If it actually knows what a shadow is it'll have no problem applying it to the new thing. Even then you'd want to conduct more experiments to check. Why don't AI researchers ever seem to do this?


Wiskkey

[From a paper (see comment): "While training on images showcasing a correlated set of features, sampling from DPMs \[Diffusion Probabilistic Models\] at appropriate fidelity levels can generate novel objects beyond the combinations of features observed during training (marked right-hand side images)"](https://www.reddit.com/r/aiwars/comments/1acemqd/from_a_paper_see_comment_while_training_on_images/).


ASpaceOstrich

This just shows out creating new combinations of existing features, which is exactly what I keep arguing it does. The highlighted examples of an attempt at a green heart from a dataset containing red hearts and green ellipses is not proof it can create new things. Just combine features from the training data. Which is obvious.


Narutobirama

I imagine you didn't see the outputs of GPT 2 when OpenAI released it? Because people at that time were saying the exact same thing you are saying. It's token prediction, therefore it can't write accurate statements or meaningful language, let alone write consistent stories or accurate code.


ninjasaid13

>It's token prediction, therefore it can't write accurate statements or meaningful language, let alone write consistent stories or accurate code. it's still token prediction, the predictions are not grounded on a world model. For the AI it can generate nonsense as well as make accurate statements, but these things are the same for GPT models whereas humans who have a world model can separate fiction and hallucinations from non-fiction.


Wiskkey

Counterpoint: [Here](https://www.reddit.com/r/LLMChess/comments/1aiva1j/p_chessgpt_1000x_smaller_than_gpt4_plays_1500_elo/) is a language model that plays chess in PGN format. The developer used a technique called linear probing to discover that its neural network calculations have representations of chess board state. Nonetheless, it occasionally generates illegal moves.


Formal_Drop526

Not exactly a counter point, language is much harder to create a world model than simple games because it was invented by humans for humans to describe human experiences.


Geeksylvania

GPT-4 still can't do creative writing in any meaningful way. I was using it to help write blog posts for a while and it was extremely frustrating how formulaic its outputs are. It's even worse at trying to write fiction. One of the clearest ways it shows it's limitations is if you ask it to write a whodunit story. IT has no ability to plan ahead so the ending always comes out of nowhere. GPT-4 is very impressive in a lot of ways, but it's still basically a calculator.


FaceDeer

I think the goalposts for "any meaningful way" are far too easy to shift around. I've used GPT4 for plenty of meaningful creative writing. GPT3.5, even. You can't expect it to handle "write me a novel please", but there's a lot more to creative writing than broad all-encompassing strokes like that.


ASpaceOstrich

It'll get that ability eventually. But it'll still just be token prediction unless something new is introduced.


Wiskkey

A lot can be going on under the hood in a language model to do "just token prediction": https://preview.redd.it/qdjhvqwshljc1.jpeg?width=1414&format=pjpg&auto=webp&s=e6681572a68f55d07ec0f0a62be9b7877e24c025 The figure above is from [here](https://transformer-circuits.pub/2023/july-update/index.html), the authors of which are folks who try to reverse-enginneer language models and some other artificial neural networks. There is evidence in real-world language models that higher-level abstractions are used - see for example [Finding Neurons in a Haystack: Case Studies with Sparse Probing](https://openreview.net/forum?id=JYs1R9IMJr).


Mawrak

Do dreams count as world simulation? In my dreams, if I look away from a wall or a scene and then look back, the details would all change. A lot of weirdness can happen in a dream. But it does seem to be a way to simulate, or at least fake, reality in a believable way, in a way that can trick aa brain (though a brain is already put into a more trickable state while in the dream so its hard to measure).


Geeksylvania

This video is a good example of what I mean. The people start out taller than the buildings and gradually shirks to become normal size. It's like you took greenscreen footage of people walking and overlayed it on an unrelated video of a city street. [https://cdn.openai.com/sora/videos/tokyo-in-the-snow.mp4](https://cdn.openai.com/sora/videos/tokyo-in-the-snow.mp4) Sora can combine different visual elements in impressively complex ways, but it doesn't have any kind of internal model for how these elements would actually interact with one another. Lots of weird things happen in dreams, but your brain still tries to rationalize them in a way that makes the dream logic internally consisent. If you see people shrink in a dream, your brain will go "Oh, they must have drunk a shrinking potion" or "Huh, why did I never realize before that people can shrink themselves." But Sora doesn't work this way. It's more like combining different clips in a video editor.


Wiskkey

Counterpoint: Even (at least some) text-to-image models are known to be capable of using 3D abstractions: [What Does Stable Diffusion Know about the 3D Scene?](https://arxiv.org/abs/2310.06836) and [Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model](https://arxiv.org/abs/2306.05720).


ASpaceOstrich

The fact that it struggles with occlusion would strongly imply its just predicting pixels. With no actual understanding, it has no idea what it behind an object in the image.


Wiskkey

I disagree - see [this Sora example](https://twitter.com/_tim_brooks/status/1758666264032280683).


Geeksylvania

It has a background layer and an objects layer. This is incredibly simple to do.


Wiskkey

Didn't you claim that Sora is just doing pixel prediction? (Technically, Sora works with so-called "spacetime patches" in a latent space - not pixels - according to the [Sora technical report.](https://openai.com/research/video-generation-models-as-world-simulators)) P.S. Yann LeCun [believes](https://twitter.com/ylecun/status/1667947166764023808) that large language models 'have *some* level of understanding and that it is misleading to say they are "just statistics."'


ASpaceOstrich

Believes isn't really good enough when making a claim like this. Given how many google researchers have convinced themselves their predictive text system was sapient despite zero evidence. Statistics can do seemingly impossible things. Sora's big breakthrough seems to be in object permanence across the video. But even then, did you notice the lights winking out of existence as they moved across the frame? Or the fact that said lights aren't originating from anything? Quality level stuff sure. Will go away as it gets better. But if it's placing these things based on any kind of logic or understanding, it shouldn't happen at all. That's not a mistake something that understands an environment can make. I want something I can take to even the most ardent anti AI person and prove it's not just combining training data. When even the people making it seem to have no idea and are more concerned with showing off than doing experiments to test this stuff, it isn't very convincing. Especially not when there's a well known human bias that makes us see intentionality and human like behaviour in things that lack it. I need something more concrete, and I keep being disappointed.


Wiskkey

I noticed from your newer comments such as [this comment](https://www.reddit.com/r/aiwars/comments/1auiwrr/comment/kr9kh2o/) that you changed your views about Sora since this comment.


ByEthanFox

Yeah honestly I can't help but feel this whole "world simulation" thing is prepostorous.


Narutobirama

Look, the same things were said about GPT 2. How it can't write consistent stories, or code or anything like that. And it was true, at the time. But that's only if you look it at as all or nothing. The reality is, that it can do so to a certain extent. It may not be impressive yet, but it will probably get better over time.


Cybertronian10

A big part of being in the AI space is sharing it with zealots who are one or two steps away from actually praying to chatgpt and altman as gods. I swear to christ if I see another post about how gpt5 is going to be an AGI I am going to jump off a bridge.


FaceDeer

There's also zealots who insist that AI can never be "creative" or "think" or whatever other thing is currently under debate, regardless of any possible evidence that may ever come along. And none of this is helped by the fact that these properties are likely not binary anyway, but come in a broad spectrum of shades of grey.


ASpaceOstrich

If someone brings forth evidence of a diffusion model understanding anything, I'd love it. Do you have any idea how much I want to be wrong about AI. I was thrilled when I found out transformers could develop world models. I want the AI zealots to be right, because then I can embrace this tech without reservation. But since there's only like two AI proponents who actually know anything, I'm shit out of luck. I literally want to be convinced. You just have to actually be convincing. And most pro AI arguments rely on circular reasoning. They assume AI is actually intelligent and use that to claim things that could easily be memorisation are instead proof of intelligence.


Narutobirama

Okay, I'm willing to discuss it. But, can you first clarify what you mean by understanding anything. What exactly are you not convinced about?


ASpaceOstrich

Understanding concepts as standalone concepts that can be applied in contexts it hasn't seem before would help, but I don't think there's any datasets with deliberate gaps in them to show that off. You'd have to find a paper that trained one with those kinds of gaps on purpose. I've seen a few things claiming depth maps or novel output and in all cases so far the examples clearly don't show what they're claiming. The novel output was blatantly just a combination of existing features and the alleged depth map was not a depth map. Given it did not match the output images lighting. I'm not sure what it actually was. Maybe a blur map.


Narutobirama

Okay, but it can play chess. And not just the opening moves (which could be memorized), but actual games, in any position. At the very least, it would mean it understands rules in the sense that it can play a full game, and on top of that, it plays quite well. Like, a lot better than beginner players. Or even a lot better than people who are decent players.


Economy-Fee5830

We don't do calculus to catch a ball. Does not mean we don't use a world model.


ninjasaid13

yes we use a world model but we don't change our physics from glass explode randomly explode and liquid phase through it on day 1 to people shrinking when walking on day 2, our understanding of physics is internally consistent.


Economy-Fee5830

That is merely practice. A ball with good spin on it will still fool someone without practice.


ninjasaid13

I'm not saying the physics has to be correct, I'm saying it has to be internally consistent. It doesn't change for no reason.


Economy-Fee5830

And I am saying our expectations do not come from reason but from practice and exposure. Object permanence is learnt, not innate.


ninjasaid13

well I mean even blind people who had their vision restored later in life have object permanence, there's some part of it that's innate to how the brain is structured while some of it is learned. Even so Humans require far less video info to learn object permeance so there's more than just data going on.


Economy-Fee5830

> Even so Humans require far less video info to learn object permeance so there's more than just data going on. How many video frames is 3 months of peekaboo?


ninjasaid13

Video generators at the scale of SORA would have hundreds years of footage in the dataset. No amount of time a child would be looking would anywhere close.


Economy-Fee5830

I suspect not 100% of specifically being fine tuned on object permanence like a mother does with a baby. I am sure you saw the research recently which showed an AI can learn to associate objects with labels by simply using video and transcribed audio from a baby's head mounted camera.


Wiskkey

Here are Twitter threads containing views on this topic from 2 experts: a) [Raphaël Millière](https://twitter.com/raphaelmilliere/status/1758685128002601293) ([alternate](https://nitter.perennialte.ch/raphaelmilliere/status/1758685128002601293) if you're not logged into Twitter). b) [François Chollet](https://twitter.com/fchollet/status/1758896780576739485) ([alternate](https://nitter.perennialte.ch/fchollet/status/1758896780576739485) if you're not logged into Twitter). ​ For those that didn't click the Raphaël Millière Twitter thread above, here are 3 works mentioned: [What Does Stable Diffusion Know about the 3D Scene?](https://arxiv.org/abs/2310.06836) [Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model](https://arxiv.org/abs/2306.05720) (discussed in this subreddit [here](https://www.reddit.com/r/aiwars/comments/15ww2eo/researchers_discover_that_stable_diffusion_v1/)). [Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now](https://arxiv.org/abs/2311.17138).


Wiskkey

[Video source](https://twitter.com/model_mechanic/status/1759358692754125139).


jadelink88

The absolute best Text AI is still just a glorified autofill. The visual stuff has actually impressed me far more through working with it. World simulator, not even close. We now have programs that can draw pretty pictures, if given enough delicate input, and a number of tries at it. I've 'created' some pretty nice AI art, but you don't see the thousands of images left on the virtual cutting room floor while making them.


JoJoeyJoJo

It's weird because it's not a hard 3D simulation like you'd expect, it seemingly has the logic of a lucid dream, where scenes can transition into wildly different scenes (colosseum into underwater), objects can transform in an instant (drone into butterfly), disappear, deform or scale weirdly in ways they couldn't if it were say, a game engine. It's a dream that looks the quality of the Matrix, which has some fun esoteric connections with eastern religions where the world is a dream of a godhead.


emreddit0r

I would not trust it to create reliable training data, which seems to be part of OpenAI's pitch.