Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization.
Built on a pre-trained DiT, Proteus-ID integrates three key components: Multimodal Identity Fusion (MIF), Time-Aware Identity Injection (TAII), and Adaptive Motion Learning (AML). Given a reference image and user prompt, MIF uses a Q-Former to integrate identity text embeddings with visual features prior to denoising. TAII incorporates timestep embeddings to adaptively modulate identity conditioning during denoising. AML enhances motion realism by introducing a self-supervised motion signal to reweight the training loss—without requiring additional inputs at inference.
A woman with long dark hair, wearing dangling earrings and traditional attire with a red collar, is engaged in a conversation with another person. With subtle head movements and slight shifts in her gaze ... Her demeanor transitions from a neutral look to a gentle smile ...
ID-Animator
ConsisID
FantasyID
Concat-ID
EchoVideo
Proteus-ID
A man with brown short hair, wearing black-framed glasses and a dark navy sweater, meticulously adjusts camera settings in a studio, slightly leaning in as he aligns the viewfinder. He carefully sets the tripod height, frames the shot with his hands, and gently presses the shutter button ...
ID-Animator
ConsisID
FantasyID
Concat-ID
EchoVideo
Proteus-ID
ID-Animator
ConsisID
FantasyID
Concat-ID
EchoVideo
Proteus-ID
A relaxed man with a captivating smile, navigating through an underwater environment in a sophisticated diving suit with gold and black accents. He confidently gestures with black leather gloves. Red rock formations create a dramatic backdrop ...
ID-Animator
ConsisID
FantasyID
Concat-ID
EchoVideo
Proteus-ID
@article{Proteus-ID,
title={Proteus-ID: ID-Consistent and Motion-Coherent Video Customization},
author={Guiyu Zhang, Chen Shi, Zijian Jiang, Xunzhi Xiang, Jingjing Qian, Shaoshuai Shi, Li Jiang},
year={2025}
}