Recent character image animation methods based on diffusion models, such as Animate Anyone, have made significant progress in generating consistent and generalizable character animations. However, these approaches fail to produce reasonable associations between characters and their environments. To address this limitation, we introduce Animate Anyone 2, aiming to animate characters with environment affordance. Beyond extracting motion signals from source video, we additionally capture environmental representations as conditional inputs. The environment is formulated as the region with the exclusion of characters and our model generates characters to populate these regions while maintaining coherence with the environmental context. We propose a shape-agnostic mask strategy that more effectively characterizes the relationship between character and environment. Furthermore, to enhance the fidelity of object interactions, we leverage an object guider to extract features of interacting objects and employ spatial blending for feature injection. We also introduce a pose modulation strategy that enables the model to handle more diverse motion patterns. Experimental results demonstrate the superior performance of the proposed method.
We propose Animate Anyone 2, which differs from previous character image animation methods that solely utilize motion signals to animate characters. Our approach additionally extracts environmental representations from the driving video, thereby enabling character animation to exhibit environment affordance.
The framework of Animate Anyone 2. We capture environmental information from the source video. The environment is formulated as regions devoid of characters and incorporated as model input, enabling end-to-end learning of character-environment fusion. To preserve object interactions, we additionally inject features of objects interacting with the character. These object features are extracted by a lightweight object guider and merged into the denoising process via spatial blending. To handle more diverse motions, we propose a pose modulation approach to better represent the spatial relationships between body limbs.
Animate Anyone 2 demonstrates remarkable capabilities in generating characters with contextually coherent environmental interactions, characterized by seamless character-scene integration and robust character-object interaction.
Animate Anyone 2 demonstrates robust capability in handling diverse and intricate motions, while ensuring character consistency and maintaining plausible interactions with the environmental context.
Animate Anyone 2 is capable of generating interactions between characters, ensuring the plausibility of their movements and coherence with the surrounding environment.
Viggle is capable of swapping characters in a video based on a provided character image, which is similar to the application scenario of our method. We compare our results with the latest Viggle V3. The outputs of Viggle demonstrate a rough blending of the characters with the environment, lack natural motion, and fail to capture the interaction between characters and the surroundings. In contrast, the results of our method exhibit higher fidelity.
MIMO is the most relevant method to our task setting, which decomposes videos into human, background and occlusion based on depth and composing these elements to generate character video. Our approach demonstrates superior robustness and finer detail preservation.