The Sora is displayed on a smartphone with the OpenAI logo visible in the background in this image. … [+]
There’s a new model in town – and pretty soon, we’re all going to be talking about it… a lot!
OpenAI has just released an announcement about its Sora model (named after the Japanese word for “sky”) that will immediately make all kinds of video marketing projects obsolete.
Want to generate a compelling video? You don’t need to hire dozens of people to operate the cameras or stand in front of them. You don’t need to go to set: just put some text in the template and you’ll get an amazing video that would have otherwise cost you tens of thousands of dollars to make.
It’s pretty hard to wrap your head around everything Sora is going to do, but it shouldn’t take long to see the effects when OpenAI finally releases this streaming model.
When OpenAI’s explainer page says “Sora is able to generate entire videos in one go, or extend generated videos to make them longer”, you kind of get an idea of the power of this model !
So how does it work?
OpenAI reveals that the diffusion model starts with something that looks like noise and begins to remove that noise gradually. The authors also note that this model is similar to previous ones, in that it uses small units of data to generate results.
You can find this explanation on the page:
“Similar to GPT models, Sora uses a transformer architecture, providing superior scalability performance. We represent videos and images as collections of smaller data units called patches, each of which is like a token in GPT. By unifying how we represent data, we can train broadcast transformers on a wider range of visual data than before, spanning different durations, resolutions, and aspect ratios.
There’s more about these patches in a technical resource related to the ad:
“We take inspiration from large language models, which gain general-purpose capabilities by training on Internet-scale data… The success of the LLM paradigm is made possible, in part, by the use of tokens that elegantly unify diverse text modalities: code, mathematics and various natural languages. …While LLMs have text tokens, Sora has visual fixes. Patches have previously been shown to be an effective representation of visual data patterns. We find that patches are a highly scalable and effective representation for training generative models on various types of videos and images.
There is also this piece, which clarifies further:
“At a high level, we transform videos into patches by first compressing videos into a lower-dimensional latent space and then decomposing the representation into spatiotemporal patches.”
The company is also upfront about some of the technology’s limitations. When you read about cookie commerce, for example, you can see how certain “cookies” will always tell us that an AI created a particular video.
To learn more, let’s head over to our own MIT Technology Review, with a piece by William Douglas Heaven last week.
Heaven goes over some of Sora’s most impressive abilities, including handling what he calls “occlusion” – in other words, the program can track objects as they disappear or emerge.
At the same time, he suggests the technology is “not perfect” and raises the possibility that OpenAI is cherry-picking its video results to make the model look better than it is – because Sora isn’t. not out yet, we can I’m not sure.
They’re working on security and trying to limit inputs that could create harmful types of deepfakes – but if you’ve been following the AI revolution, you know that’s easier said than done. Regardless, I wanted to get this out there so people are aware of what’s going on. As OpenAI writes:
“Sora serves as the foundation for models that can understand and simulate the real world, a capability that we believe will be an important step in achieving AGI.”