Skip to content

Are you ready to be shocked? Runway’s Gen-2 reveals the major flaws in text-to-video technology!

Creating Movies with AI – A Closer Look at Runway’s Gen-2

Artificial Intelligence (AI) is steadily inching its way into the film industry. Joe Russo, director of Marvel’s “Avengers: Endgame,” recently predicted that AI would be able to produce complete films within two years. While this may seem optimistic, it is clear that AI technology is advancing at an unprecedented pace. The latest addition to this trend is Runway’s Gen-2, a commercial AI model that can generate videos from text prompts or existing images. This new technology may be a boon for creatives who need a quick fix or a way to come up with novel ideas. However, our tests reveal that Gen-2 has some limitations, but also a lot of potential.

Text-to-Video is an Emerging Area of Focus

Text-to-video is an emerging area of focus in AI, increasingly becoming a hot topic among tech giants. Runway’s Gen-2 is one of the first commercially available text-to-video models in the market. It is an upgraded version of Gen-1, which was released in February, and has a larger dataset for training. Several tech giants have demoed text-to-video models in recent years, but these models are still in the research stage and are only accessible to a few data scientists and engineers.

Exploring Gen-2’s Capabilities

To see how well Gen-2 performed, we tested the model by submitting various text prompts. We tried to cover a wide range of genres, styles, and angles that a director, professional, or armchair would like to see on the silver screen. We found that the model’s output had certain limitations, such as low frame rates and blurriness, which could be due to optimization for reduced computing costs.

The model’s generated clips also tend to exhibit some artifacts, including pixelation around fast-moving objects. Additionally, inconsistency with physics and anatomy of the objects or people being generated was observed. The model also appeared to have difficulty comprehending the prompts’ nuances and sticking to specific descriptors, resulting in unexpected outcomes.

Furthermore, testing revealed the limitations of the model’s training dataset. Limited data on a particular genre, style, or animation type could harm the model’s ability to generate high-quality images or videos that resemble the particular style. However, overall, Gen-2 did pass a bias test, with relatively diversified content generated by the model.

Gen-2: From Novelty to Useful Tool?

The above findings suggest that while Gen-2 may be more of a toy or a novelty for its numerous artifacts and limitations, it is expected to be a valuable tool for video creation and images in the future, especially for artists and designers. As the technology stabilizes and AI algorithms continue to improve, it is possible that filmmakers and animators may be able to build concepts entirely around AI inputs.

However, building an entire movie solely with AI still seems like an over-optimistic speculation. The customizations and edits required to fix generated content may need more resources than filming the footage in the first place. Ethical concerns, such as creating deepfakes that can be misused, remain a risk that needs to be managed.

Conclusion

In conclusion, Runway’s Gen-2 is an exciting step forward for the AI industry in creating video content from text input. The outcomes of testing the model highlight some clear limitations that will need to be addressed if the technology is to become a valuable tool for video workflows. The technology is still evolving, but it is shaping up to be a reliable source of inspiration when you’re in a jam, and creatives may find use for it. Whether AI will eventually be able to produce an entire feature film remains unclear. There is still much work to be done before that point is reached.

Summary:

Runway’s Gen-2 is a commercially available text-to-video AI model that generates videos from text prompts or existing images. While it is one of the first in the market, the results of testing the model reveal several technical limitations. The frame rate of the model’s output is low, and the videos tend to be grainy and blurry with artifacts. The model may have issues with nuance and descriptors in the prompts and lacks consistency with physics and anatomy. The model’s output quality can be limited by its training dataset. Though Gen-2 has potential, its results are more like a novelty or a toy than a reliable video workflow tool. As technology evolves, it may become a valuable tool for creatives, but AI being able to produce a complete movie in two years is optimistic.

—————————————————-

Article Link
UK Artful Impressions Premiere Etsy Store
Sponsored Content View
90’s Rock Band Review View
Ted Lasso’s MacBook Guide View
Nature’s Secret to More Energy View
Ancient Recipe for Weight Loss View
MacBook Air i3 vs i5 View
You Need a VPN in 2023 – Liberty Shield View

In a recent panel interview with Collider, Joe Russo, the director of Marvel blockbuster movies like “Avengers: Endgame,” predicted that within two years, AI will be able to create an entire movie.

I’d say that’s a pretty optimistic timeline. But we are getting closer.

This week, Clueto Google-backed AI startup that helped develop the AI ​​image generator stable diffusion, launched Gen-2, a model that generates videos from text prompts or an existing image. (Gen-2 previously had limited and waitlisted access.) The follow-up to Runway’s Gen-1 model released in February, the Gen-2 is one of the first commercially available text-to-video models.

“Commercially available” is an important distinction. Being the next logical frontier in generative AI after images and text, text-to-video is becoming a larger area of ​​focus, particularly among tech giants, several of which have demoed text models. to video in the last year. But those models remain firmly in the research stages, inaccessible to all but a select few data scientists and engineers.

Of course, first is not necessarily best.

Out of personal curiosity and service to you dear readers, I ran some pointers through Gen-2 to get an idea of ​​what the model can and cannot achieve. (Runway currently provides around 100 seconds of video generation for free.) There wasn’t much method to my madness, but I tried to capture a variety of angles, genres, and styles that a director, professional or armchair, would like to see. on the silver screen, or a laptop, as the case may be.

One limitation of the Gen-2 that became immediately apparent is the frame rate of the four-second long videos that the model outputs. It’s pretty low and noticeably, to the point where it’s almost like a slideshow in places.

Track Gen-2

Image Credits: Clue

What’s unclear is whether this is a glitch with the technology or Runway’s attempt to save on computing costs. In any case, it makes Gen-2 a rather unattractive proposition for publishers hoping to avoid post-production work.

Beyond the frame rate issue, I found that Gen-2 generated clips tend to share some graininess or blurriness in common, as if some sort of old Instagram filter had been applied to them. Other artifacts occur in places as well, such as pixelation around objects when the “camera” (for lack of a better word) quickly circles or zooms in on them.

As with many generative models, the Gen-2 isn’t particularly consistent with regard to physics or anatomy, either. Like something conjured up by a surrealist, the arms and legs of people in the Gen-2-produced videos merge and separate again as objects melt to the floor and disappear, their reflections warping and distorting. And, depending on the message, the faces can resemble dolls, with glowing, emotionless eyes and pale skin reminiscent of cheap plastic.

Track Gen-2

Image Credits: Clue

To top it off, there is the issue of content. Gen-2 seems to have a hard time understanding nuance, clinging to particular descriptors in prompts while ignoring others, seemingly at random.

Track Gen-2

Image Credits: Clue

One of the prompts I tried, “A video of an underwater utopia, shot with an old camera, in the style of a ‘found footage’ movie, produced no such utopia, just what looked like a first-person dive through a anonymous coral reef Gen-2 had problems with my other indications as well, as it couldn’t generate a zoom shot for an indication that specifically asked for a “slow zoom” and didn’t quite get the look of your average astronaut.

Could the issues be related to the Gen-2 training dataset? Maybe.

Gen-2, like Stable Diffusion, is a diffusion model, which means that it learns to gradually subtract noise from an initial image made entirely of noise to bring it closer, step by step, to the indicator. Diffusion models learn through training on millions or billions of examples; in an academic paper Detailing the Gen-2 architecture, Runway says the model was trained on an internal dataset of 240 million images and 6.4 million video clips.

Diversity in examples is key. If the data set does not contain much footage of, for example, animation, the model, lacking reference points, will not be able to generate animations of reasonable quality. (Of course, animation is a wide field, even if the data set did have anime clips or hand-drawn animation, the model would not necessarily generalize well for all types of animation).

Track Gen-2

Image Credits: Clue

On the plus side, Gen-2 passes a test for bias at the surface level. While generative AI models like DALL-E 2 have been found to reinforce societal biases, generating images of positions of authority, such as “CEO or “director,” that are predominantly white male, Gen-2 was a little more diverse in the content it generated, at least in my tests.

Track Gen-2

Image Credits: Clue

Fed with the message “A video of a CEO walking into a conference room,” Gen-2 generated a video of men and women (though more men than women) sitting around something like a conference table. Meanwhile, the result of the message “A video of a doctor working in an office” shows a vaguely Asian-looking female doctor behind a desk.

However, the results for any order containing the word “nurse” were less promising and consistently showed young white women. The same goes for the phrase “a person who waits tables.” Clearly, there is work to be done.

The bottom line from all of this, for me, is that the Gen-2 is more of a novelty or toy than a truly useful tool in any video workflow. Could the results be edited into something more coherent? Maybe. But depending on the video, it would potentially require more work than shooting footage in the first place.

that should not be also dismissive of technology. It’s impressive what Runway has done here, effectively beating the tech giants in the text-to-video punch. And I’m sure some users will find uses for Gen-2 that don’t require photorealism, or a lot of customization. (CEO of Pasarela Cristóbal Valenzuela recently told Bloomberg that he sees Gen-2 as a way to offer artists and designers a tool that can help them with their creative processes).

Track Gen-2

Image Credits: Clue

I did it myself. In fact, Gen-2 can understand a variety of styles, such as anime and claymation, which lend themselves to the lower frame rate. With a bit of manipulation and editing work, it wouldn’t be impossible to stitch a few clips together to create a narrative piece.

Lest you worry about the potential for deepfakes, Runway says it’s using a combination of artificial intelligence and human moderation to prevent users from generating videos that include pornography, violent content, or violate copyrights. I can confirm that there is a content filter, an overzealous one indeed. But of course, those aren’t foolproof methods, so we’ll have to see how well they work in practice.

Track Gen-2

Image Credits: Clue

But at least for now, filmmakers, animators and CGI artists, and ethicists can rest easy. It will be at least a couple of iterations down the line before Runway’s technology comes anywhere close to generating movie-quality images, assuming it ever gets there.

Runway’s Gen-2 shows the limitations of today’s text-to-video tech


—————————————————-