Meta unveils AI tool that creates GIF-like videos from text prompts

Meta has unveiled its video generation AI
Meta has unveiled its video generation AI   -   Copyright  Meta   -  
By Luke Hurst

Meta has unveiled an artificial intelligence (AI) programme that takes the idea of image generation from text prompts to the next level - generating videos from text prompts.

Facebook’s parent company released a slew of short videos based on text prompts, building on the recent developments in text-to-image artificial intelligence creations.

The videos are created by an AI that learns what the world looks like from text and image data that is paired up. It also learns what motion looks like by studying video footage without any associated text.

By then melding these two sets of learnings together, it creates relevant video footage with just a basic text prompt.

It is a burgeoning field of AI research, and Meta says its new Make-A-Video system “has the potential to open new opportunities for creators and artists”.

“With just a few words or lines of text, Make-A-Video can bring imagination to life and create one-of-a-kind videos full of vivid colors, characters, and landscapes. The system can also create videos from images or take existing videos and create new ones that are similar,” the company said in a statement.

So what do these videos actually look like? Meta announced Make-A-Video with posts on social media, encouraging followers on Twitter to come up with some prompts, which it duly fed into its algorithm.

The results are impressive, but there is something distinctly unnerving about the videos.

Make-A-Video is not yet open to the public to use, but Meta has showcased the three functions it currently has.

The first one is making a video with just a line of text, and this can be rendered as a surreal, realistic, or stylised video.

Then it has the option to take a still image and bring it to life in the form of a video.

Finally it can take a video and generate different versions of it.

Images brought to life

Meta announced Make-A-Scene earlier this year, which generates photorealistic illustrations and art works using text and freeform sketches as prompts.

That came alongside another major leap forward in text-to-image technology, with the release of DALL-E 2 from AI research company OpenAI.

With DALL-E 2, anyone can sign up and feed prompts into it, creating their own weird and wonderful still images. If you wanted, for example, a picture of a cat wearing boots in the mud, voilà.

DALL-E 2
DALL-E 2's creation for the prompt 'a cat wearing boots in the mud'DALL-E 2

Or, aliens hovering over the London skyline.

DALL-E 2
DALL-E 2 creation of 'aliens hovering over Big Ben'DALL-E 2

With Make-A-Video, Meta has joined a number of other companies pushing at the forefront of AI-generated video, which is technically and financially a harder task than image creation.

That’s because, according to the authors behind another video creation model, Phenaki, “there is much less high quality data available and the computational requirements are much more severe”.

In a research paper announcing the results of their programme, which is capable of stringing together a video much longer than the Make-A-Video ones, they write that for image generation there are datasets with billions of image-text pairs, while for text-video datasets the numbers are “substantially smaller”.

Make-A-Video is attempting to overcome this shortage of text-video data with “unsupervised learning” - essentially leaving its AI to learn what realistic motion looks like without a text label attached to the videos it studies.

“Our intuition is simple,” wrote the authors behind Meta’s research paper. “Learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage”.

Meta has indicated its goal is to one day make the technology available to the public, but it has not said when this will happen.