ByteDance Unveils OmniHuman-1: First General Multi-Modal Input Human Video Generation Project, Bidding Farewell to Simple Audio-Driven Digital Humans
The rise of Diffusion Transformer technology has significantly improved the quality of general AI video generation. For instance, various AI video functionalities are already quite impressive in terms of effects. However, AI human body animation synthesis has not seen significant progress until now.
Existing end-to-end human animation models heavily rely on curated datasets during training, limiting their generalization capabilities and making it difficult to scale data.
Audio-driven models are mostly limited to facial or portrait animation, while pose-driven models have stringent requirements for input image perspectives and backgrounds. These limitations have hindered significant advancements in human body animation models.
ByteDance's open-sourced OmniHuman-1 is indeed quite impressive.
It can generate ultra-realistic human videos, handling speech, singing, and interactions with objects quite well. It also supports various input modalities and image styles.
Project Overview
OmniHuman-1, launched by the ByteDance team, is a novel human animation generation model. It generates high-quality human videos by blending multiple conditional signals such as text, audio, and pose. Based on the Diffusion Transformer architecture, the model employs an "all-condition training strategy" to incorporate large-scale data into the training process, breaking through the limitations of traditional methods in data scale and generalization ability.
OmniHuman can generate highly realistic human videos, supporting diverse human portrait content (such as close-ups, half-body, full-body), different image styles, as well as complex actions and interactions between humans and objects. It also supports multiple driving modalities (audio-driven, video-driven, and combined driving), greatly enhancing the diversity and flexibility of generated videos.
DEMO
-
Audio-Driven Single Image
-
Support for Cartoons, Animals, and Special Poses
-
Hand Optimization
-
Action Imitation
Features
-
Image to Dynamic Video:
With just a single portrait image, combined with motion signals such as pure audio, pure video, or a combination of audio and video, OmniHuman-1 can generate realistic full-body dynamic videos of humans. This breaks the previous limitation of only animating faces or upper bodies. Gestures and movements during speech are vividly displayed.
-
Multi-Modal Input Fusion:
By fusing multiple input signals like text, audio, and human body motion, and through the "all-condition" training method, OmniHuman-1 learns from larger and richer datasets. This enhances the quality of video generation and reduces data wastage.
-
Support for Diverse Content and Styles:
OmniHuman-1 can handle a wide range of human portrait content, including close-ups, half-body shots, and full-body shots. It accurately generates both speaking and singing scenarios. It can also process interactions between humans and objects, complex body movements, and is compatible with various image styles such as cartoon characters, artificial objects, animals, and even complex poses.
-
Flexible Input:
Supporting arbitrary aspect ratios for image input, OmniHuman-1 effectively improves hand gesture processing. It also adopts progressive, multi-stage training methods to adjust generation effects based on the influence of different conditions, ensuring seamless adaptation of generated videos to various formats.
Technical Features
-
Multi-Condition Mixed Training:
An innovative multi-condition training strategy is employed, integrating various motion-related conditions like text, audio, and pose for training. This strategy follows two training principles: First, strong condition tasks utilize weak condition tasks and data to achieve data expansion. Second, the stronger the condition, the lower the training ratio. By introducing different driving modalities in stages, such as text first, then audio, and finally pose, and by reasonably balancing the training ratio, the model fully utilizes large-scale mixed data, enhances learning ability, and reduces overfitting.
-
Unique Condition Injection Method:
OmniHuman-1 employs a unique method for injecting audio and pose conditions. For audio, features are extracted using the wav2vec model, compressed by MLP, and concatenated with features from adjacent timestamps to form audio tokens. These are then injected into the model through a cross-attention mechanism. For pose, a pose encoder is used to encode pose heatmap sequences. The generated pose features are concatenated with features from adjacent frames to form pose tokens, which are stacked with noise latent features along the channel dimension and input into the model, enabling precise motion control.
-
Efficient Appearance Condition Processing:
When processing appearance conditions, the original denoising DiT backbone network is reused to encode reference images. The reference image is encoded into a latent representation by VAE, and both the latent representation and the noise video latent representation are converted into token sequences and simultaneously input into DiT. By modifying 3D Rotational Position Embeddings (RoPE), the time component of reference tokens is set to zero, allowing the network to distinguish between reference and video tokens, effectively transferring appearance features without adding extra parameters.
-
Intelligent Inference Strategy:
During the inference stage, conditions are intelligently activated based on different driving scenarios. For example, when audio-driven, conditions other than pose are activated. For pose-related combinations, all conditions are activated. When pose is driven independently, audio is disabled. For audio and text applications under multiple conditions, Classifier-Free Guidance (CFG) is applied, and a CFG annealing strategy is proposed. This strategy gradually reduces the CFG amplitude during inference, balancing the expressiveness and computational efficiency of video generation, reducing wrinkles, and ensuring lip synchronization and motion expressiveness.