Digital Event Horizon
A recent viral video from OpenAI's Sora AI video generator showcases the limitations of current AI video models in generating coherent gymnastics routines. The nonsensical synthesis errors, dubbed "jabberwocky," highlight the need for more accurate training data and better algorithms to capture human movement and physics rules.
The Sora AI video generator created a viral video of a gymnast with extra limbs that lost her head during an Olympic-style floor routine.AI models based on transformer technology struggle to produce coherent, original outputs and often rely on statistical associations between words and images.Sora's training data may not have included enough gymnastics videos to accurately capture limb-level precision and physics rules governing human movement.The model relies on statistical associations between pixels in the training dataset to predict the next frame, leading to nonsensical results.Improving AI video generation requires increasing training dataset sizes and incorporating accurate metadata labeling, as well as developing new algorithms that capture nuances of human movement and physics.
In a recent viral video, OpenAI's newly launched Sora AI video generator created a gymnast who sprouts extra limbs and briefly loses her head during what appears to be an Olympic-style floor routine. The nonsensical synthesis errors in the video, which we have dubbed "jabberwocky," hint at technical details about how AI video generators work and how they might get better in the future.
AI models based on transformer technology are fundamentally imitative in nature. They're great at transforming one type of data into another type, or morphing one style into another. What they're not great at (yet) is producing coherent generations that are truly original. So if you happen to provide a prompt that closely matches a training video, you might get a good result. Otherwise, you may get madness.
When examining how the video fails, one must first consider how Sora "knows" how to create anything that resembles a gymnastics routine to begin with. During the training phase, when the Sora model was created, OpenAI fed example videos of gymnastics routines (among many other types of videos) into a specialized neural network that associates the progression of images with text-based descriptions of them.
Later, when the finished model is running and you give a video-synthesis model like Sora a written prompt, it draws upon statistical associations between words and images to produce a predictive output. It's continuously making next-frame predictions based on the last frame of the video. But Sora has another trick for attempting preserving coherency over time: by giving the model foresight of many frames at a time.
The problem is not unique to Sora. All AI video generators can produce wildly nonsensical results when your prompts reach too far past their training data, as we saw earlier this year when testing Runway's Gen-3. In fact, we ran some gymnast prompts through the latest open source AI video model that may rival Sora in some ways, Hunyuan Video, and it produced similar twirling, morphing results.
The video itself is a prime example of what happens when an AI model tries to generate a coherent video sequence. We see a view of what looks like a floor gymnastics routine taking place at an Olympics-like sporting event. The subject of the video flips and flails as new legs and arms rapidly and fluidly emerge and morph out of her twirling and transforming body.
At one point, about 9 seconds in, she loses her head, and it reattaches to her body spontaneously. As venture capitalist Deedy Das pointed out when he originally shared the video on X, this is a perfect example of what we like to call "jabberwocky." This phenomenon occurs when an AI model completely fails to produce a plausible output.
To understand why Sora's gymnastics video generated these nonsensical synthesis errors, one must consider how the model was trained. During the training phase, OpenAI fed example videos into a specialized neural network that associates the progression of images with text-based descriptions of them. This type of training is crucial for achieving coherence in video sequences.
However, Sora's training data may not have included enough gymnastics videos to accurately capture limb-level precision and physics rules governing human movement. As a result, when generating new video sequences, the model relies on statistical associations between pixels in the training dataset to predict the next frame.
This approach can lead to wildly nonsensical results, as seen in Sora's gymnastics video. The problem is not unique to Sora, and all AI video generators face similar challenges when attempting to generate coherent videos that adhere to physics rules.
To improve AI video generation, researchers will need to increase the size of their training datasets and incorporate more accurate metadata labeling. Additionally, they will need to develop new algorithms that can better capture the nuances of human movement and physics.
Related Information:
https://arstechnica.com/information-technology/2024/12/twirling-body-horror-in-gymnastics-video-exposes-ais-flaws/
Published: Fri Dec 13 09:34:27 2024 by llama3.2 3B Q4_K_M