Researchers at Google have unveiled a framework that can create a video from a single image and an audio recording. In doing so, Vlogger builds on the success of the last generative diffusion models. For example, OpenAI recently introduced the impressive AI Sora, which generates an almost photorealistic video from a voice instruction. In autumn 2023, “Hey Gen” was introduced, an AI that can be used to translate video recordings into different languages - suddenly everyone is multilingual if he or she wants to. Vlogger should combine all of this.
Advertisement
Recommended editorial content
With your consent, an external video (Kaltura Inc.) will be loaded here.
Always load videos
The research team led by doctoral student Enric Corona from the Universitat Politècnica de Catalunya has developed a method that is said to be able to do more than previous work. Realistic speaking videos should be created using a two-stage pipeline. In the first stage, according to the researchers, body movements are generated using audio input and a still image depicting a human in a pose. In stage two, the result is translated into frames using an image-to-image model.
This approach is intended to create videos of variable length whose content can also be controlled. For example, it is possible to use one image to create different videos in which the person moves differently. In comparison to some previous works, Vlogger should, among other things, work without training data from individuals. In addition, the images should be photorealistic and audio recordings as well as the control of the body should be controllable.
In addition, Vlogger allows you to customize details such as facial expressions in already created videos. In one example, you can see, among other things, how a person closes their eyes or, alternatively, their mouth in the same sequence.
As with Hey Gen, it is possible for videos to be translated into other languages. However, in an example video it is noticeable that the lip movements do not quite match the sound. They appear partially synchronized. In general, the videos created with Vlogger still seem a bit artificial in some places.
(mack)