Animate Anyone 2

High-Fidelity Character Image Animation with Environment Affordance

Li Hu^*, Guangyuan Wang^*, Zhen Shen, Xin Gao, Dechao Meng, Lian Zhuo, Peng Zhang, Bang Zhang, Liefeng Bo

Tongyi Lab，Alibaba Group

Abstract

Recent character image animation methods based on diffusion models, such as Animate Anyone, have made significant progress in generating consistent and generalizable character animations. However, these approaches fail to produce reasonable associations between characters and their environments. To address this limitation, we introduce Animate Anyone 2, aiming to animate characters with environment affordance. Beyond extracting motion signals from source video, we additionally capture environmental representations as conditional inputs. The environment is formulated as the region with the exclusion of characters and our model generates characters to populate these regions while maintaining coherence with the environmental context. We propose a shape-agnostic mask strategy that more effectively characterizes the relationship between character and environment. Furthermore, to enhance the fidelity of object interactions, we leverage an object guider to extract features of interacting objects and employ spatial blending for feature injection. We also introduce a pose modulation strategy that enables the model to handle more diverse motion patterns. Experimental results demonstrate the superior performance of the proposed method.

Motivation

We propose Animate Anyone 2, which differs from previous character image animation methods that solely utilize motion signals to animate characters. Our approach additionally extracts environmental representations from the driving video, thereby enabling character animation to exhibit environment affordance.

Method

The framework of Animate Anyone 2. We capture environmental information from the source video. The environment is formulated as regions devoid of characters and incorporated as model input, enabling end-to-end learning of character-environment fusion. To preserve object interactions, we additionally inject features of objects interacting with the character. These object features are extracted by a lightweight object guider and merged into the denoising process via spatial blending. To handle more diverse motions, we propose a pose modulation approach to better represent the spatial relationships between body limbs.

Results

Environment Interaction

Animate Anyone 2 demonstrates remarkable capabilities in generating characters with contextually coherent environmental interactions, characterized by seamless character-scene integration and robust character-object interaction.

Dynamic Motion

Animate Anyone 2 demonstrates robust capability in handling diverse and intricate motions, while ensuring character consistency and maintaining plausible interactions with the environmental context.

Human Interaction

Animate Anyone 2 is capable of generating interactions between characters, ensuring the plausibility of their movements and coherence with the surrounding environment.

Comparisons

with Viggle

Viggle is capable of swapping characters in a video based on a provided character image, which is similar to the application scenario of our method. We compare our results with the latest Viggle V3. The outputs of Viggle demonstrate a rough blending of the characters with the environment, lack natural motion, and fail to capture the interaction between characters and the surroundings. In contrast, the results of our method exhibit higher fidelity.

with MIMO

MIMO is the most relevant method to our task setting, which decomposes videos into human, background and occlusion based on depth and composing these elements to generate character video. Our approach demonstrates superior robustness and finer detail preservation.