Proteus-ID: ID-Consistent and Motion-Coherent Video Customization

Guiyu Zhang¹, Chen Shi¹, Zijian Jiang¹, Xunzhi Xiang²,

Jingjing Qian¹, Shaoshuai Shi³, Li Jiang^1†

¹The Chinese University of Hong Kong, Shenzhen, ²Nanjing University, ³Voyager Research, Didi Chuxing

Abstract

Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization.

Video

Overview of Proteus-ID

Built on a pre-trained DiT, Proteus-ID integrates three key components: Multimodal Identity Fusion (MIF), Time-Aware Identity Injection (TAII), and Adaptive Motion Learning (AML). Given a reference image and user prompt, MIF uses a Q-Former to integrate identity text embeddings with visual features prior to denoising. TAII incorporates timestep embeddings to adaptively modulate identity conditioning during denoising. AML enhances motion realism by introducing a self-supervised motion signal to reweight the training loss—without requiring additional inputs at inference.

Comparison

A woman with long dark hair, wearing dangling earrings and traditional attire with a red collar, is engaged in a conversation with another person. With subtle head movements and slight shifts in her gaze ... Her demeanor transitions from a neutral look to a gentle smile ...

ID-Animator

ConsisID

FantasyID

Concat-ID

EchoVideo

Proteus-ID

A man with brown short hair, wearing black-framed glasses and a dark navy sweater, meticulously adjusts camera settings in a studio, slightly leaning in as he aligns the viewfinder. He carefully sets the tripod height, frames the shot with his hands, and gently presses the shutter button ...

ID-Animator

ConsisID

FantasyID

Concat-ID

EchoVideo

Proteus-ID

A relaxed woman wearing sunglasses and a light scarf, a car along a scenic mountain road. With one hand drives on the steering wheel, she navigates the curves as dramatic landscapes unfold around her. While driving, she turns her head to see something beside her ...

ID-Animator

ConsisID

FantasyID

Concat-ID

EchoVideo

Proteus-ID

A relaxed man with a captivating smile, navigating through an underwater environment in a sophisticated diving suit with gold and black accents. He confidently gestures with black leather gloves. Red rock formations create a dramatic backdrop ...

ID-Animator

ConsisID

FantasyID

Concat-ID

EchoVideo

Proteus-ID

Visual Results

A beautiful woman with delicate features and serene composure, dressed in stylish casual attire, strolls through European cobblestone streets as golden sunlight ... while coffee shop patrons nearby enjoy their leisurely moments.

A man in a modern spacesuit with glowing accents, slowly turning his head from left to right, gazing out from the viewport of a crystalline spacecraft, distant nebulae and cosmic phenomena illuminating his contemplative expression.

A man with chiseled jawline ... dressed in an immaculate white suit, strolling through an exotic marketplace in a desert city. His gaze shifts slightly as he walks, the golden afternoon light casting shadows throughout the bazaar.

A mystical woman with long, wavy brown hair and enigmatic violet eyes, adorned in a robe of shimmering indigo silk with celestial patterns, standing atop a windswept cliff at dawn, her hands raised as if to summon the rising sun.

In an indoor setting characterized by a plain white wall adorned with minimal decor such as a small yellow sign and a thermostat, a young woman with long black hair, bangs, and light makeup, dressed in a light black blouse, stands mostly stationary ...

A distinguished gentleman with gray hair, dressed in a white shirt, tending to potted plants in a garden area. He kneels beside the greenery, carefully examining the foliage, adjusting positioning of the plants with practiced hands.

A young man with dark wavy hair, dressed in a black polo shirt, sitting at a desk while working on a computer. He appears deeply focused on the screen before him, occasionally shifting his hand toward the mouse or keyboard. The warm wooden door in ...

A serene woman with gentle eyes and relaxed demeanor, walking through an art gallery holding a program guide. She is dressed in simple, elegant light-colored clothes and walking ... space against a backdrop of colorful artwork displayed in wooden frames ...

A woman with shoulder-length brown hair, dressed in a light beige sweater and blue jeans, carefully selecting fresh produce at an outdoor farmers market. She holds leafy greens, occasionally tilting her head with a pleased expression ...

A sophisticated man with classical features and poised movements, dressed in a dark suit, walking through a spacious hall with white walls. He moves slowly and purposefully, without paying attention to the surroundings.

A man in flowing scholarly robes walking between towering bookshelves in an ancient library, reaching up to grasp a floating tome while whispering an incantation that makes magical particles spiral around his outstretched hand.

A serene woman with gentle eyes and relaxed demeanor, feeding birds in a tranquil Japanese garden, her simple elegant outfit harmonizing with the setting as she extends her hand with seeds, causing birds to gather around her.

A young woman with brown hair, wearing a beige shirt under a white sweater, remains mostly stationary throughout the video, with only subtle head movements and slight shifts in gaze ... Set in what seems to be an indoor environment, possibly a office or ...

A middle-aged man with short brown hair sits at a table with a microphone in front of him … The setting clearly suggests a press conference environment, as evidenced by the backdrop featuring logos from various sponsors such as Adidas and others.

A young man with short dark hair, wearing glasses with black frames and dressed in a white shirt, stands in front of a microphone ... As he delivers his speech, he raises his right fist ... The background is adorned with colorful posters and drawings ...

A woman with dark hair, wearing a sparkling pink dress, holds a microphone in one hand ... Her expression is cheerful and engaging, indicating that she is addressing an audience ... The background features a vibrant red design with bold patterns ...

BibTeX


@article{zhang2025proteus,
  title={Proteus-ID: ID-Consistent and Motion-Coherent Video Customization},
  author={Zhang, Guiyu and Shi, Chen and Jiang, Zijian and Xiang, Xunzhi and Qian, Jingjing and Shi, Shaoshuai and Jiang, Li},
  journal={arXiv preprint arXiv:2506.23729},
  year={2025}
}