DreamRelation

DreamRelation: Relation-Centric
Video Customization
ICCV 2025

¹Fudan University ²Alibaba Group ³Ant Group
⁴Nanyang Technological University ⁵Zhejiang University
^*Project Leader ^†Corresponding Author

Abstract

Relational video customization refers to the creation of personalized videos that depict user-specified relations between two subjects, a crucial task for comprehending realworld visual content. While existing methods can personalize subject appearances and motions, they still struggle with complex relational video customization, where precise relational modeling and high generalization across subject categories are essential. The primary challenge arises from the intricate spatial arrangements, layout variations, and nuanced temporal dynamics inherent in relations; consequently, current models tend to overemphasize irrelevant visual details rather than capturing meaningful interactions. To address these challenges, we propose \(\textbf{DreamRelation}\), a novel approach that personalizes relations through a small set of exemplar videos, leveraging two key components: Relational Decoupling Learning and Relational Dynamics Enhancement. First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships. Furthermore, we determine the optimal design of relation LoRA triplet by analyzing the distinct roles of the query, key, and value features within MM-DiT's attention mechanism, making DreamRelation the first relational video generation framework with explainable components. Second, in Relational Dynamics Enhancement, we introduce spacetime relational contrastive loss, which prioritizes relational dynamics while minimizing the reliance on detailed subject appearances. Extensive experiments demonstrate that DreamRelation outperforms state-of-the-art methods in relational video customization. Code and models will be made publicly available.

Overall Framework of DreamRelation

Overall framework of \(\textbf{DreamRelation}\). Our method decomposes relational video customization into two concurrent processes. (1) In Relational Decoupling Learning, Relation LoRAs in relation LoRA triplet capture relational information, while Subject LoRAs focus on subject appearances. This decoupling process is guided by hybrid mask training strategy based on their corresponding masks. (2) In Relational Dynamics Enhancement, the proposed space-time relational contrastive loss pulls relational dynamics features (anchor and positive features) from pairwise differences closer, while pushing them away from appearance features (negative features) of single-frame outputs. During inference, subject LoRAs are excluded to prevent introducing undesired appearances and enhance generalization.

Relational Video Customization Results of DreamRelation

"A bear is hugging with a panda."

Hugging

Generated Video

"A polar bear is punching a penguin on an icy plain."

Punching

Generated Video

"A panda is raising a toast with a red panda in a bamboo grove."

Cheers

Generated Video

"A dog is shaking hands with a cat in a cyberpunk city."

Shaking Hands

Generated Video

"A gorilla is raising a toast with a parrot at the top of Machu Picchu."

Cheers

Generated Video

"A walrus is high-fiving with an arctic fox on Saturn's rings."

High-Five

Generated Video

"A bear and a moose are walking towards each other in a serene mountain valley."

Walking Towards

Generated Video

"A penguin is hugging with a polar bear on an icy plain."