-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathindex.html
More file actions
136 lines (126 loc) · 7.66 KB
/
index.html
File metadata and controls
136 lines (126 loc) · 7.66 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>CameraCtrl</title>
<link href="./CameraCtrl_files/style.css" rel="stylesheet">
<script type="text/javascript" src="./CameraCtrl_files/jquery.mlens-1.0.min.js"></script>
<script type="text/javascript" src="./CameraCtrl_files/jquery.js"></script>
</head>
<body>
<div class="content">
<h1><strong>CameraCtrl: Enabling Camera Control for Video <br> Diffusion Models</strong></h1>
<p id="authors" class="serif">
<span style="font-size: 1.0em">
<a href="https://hehao13.github.io">Hao He<sup>1</sup></a>
<a href="https://justimyhxu.github.io">Yinghao Xu<sup>3</sup></a>
<a href="https://guoyww.github.io">Yuwei Guo<sup>1</sup></a>
<a href="https://web.stanford.edu/~gordonwz/">Gordon Wetzstein<sup>3</sup></a>
<a href="http://daibo.info">Bo Dai<sup>2</sup></a>
<a href="https://www.ee.cuhk.edu.hk/~hsli/">Hongsheng Li<sup>1</sup></a>
<a href="https://ceyuan.me">Ceyuan Yang<sup>2</sup></a>
</span>
<br>
<br>
<span style="font-size: 0.9em; margin-top: 0.6em">
<a><sup>1</sup>The Chinese University of Hong Kong</a>
<a><sup>2</sup>Shanghai Artificial Intelligence Laboratory</a>
<a><sup>3</sup>Stanford University</a>
</span>
</p>
<font size="+1">
<p style="text-align: center;" class="sansserif">
<a href="https://arxiv.org/abs/2404.02101" target="" style="font-weight: bold;">[arXiv Report]</a>
<a href="https://github.com/hehao13/CameraCtrl" target="" style="font-weight: bold;">[Code]</a>
<a href="#bibtex" style="font-weight: bold;">[BibTeX]</a>
<a href="https://huggingface.co/spaces/hehao13/CameraCtrl-svd" style="font-weight: bold;">[HF Demo]</a>
</p><br>
</font>
<div style="text-align:center;">
<img src="./CameraCtrl_files/teaser.png" width="100%" alt="teaser_figure">
</div>
</div>
<div class="content">
<p style="text-align:center; font-size: 2em; font-weight: bold" class="sansserif">Abstract</p>
<p style="font-size: 1.2em; margin-left:5em; margin-right:5em;" class="serif">Controllability plays a crucial role in video generation, as it allows users to create and edit content more precisely. Existing models, however, lack control of camera pose. To alleviate this issue, we introduce <code>CameraCtrl</code>, enabling accurate camera pose control for video diffusion models. Our approach explores effective camera trajectory parameterization along with a plug-and-play camera pose control module that is trained on top of a video diffusion model, leaving other modules of the base model untouched. Moreover, a comprehensive study on the effect of various training datasets is conducted, suggesting that videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of <code>CameraCtrl</code> in achieving precise camera control with different video generation models, marking a step forward in the pursuit of dynamic and customized video storytelling from textual and camera pose inputs.</p>
</div>
<div class="content">
<p style="text-align:center; font-size: 2em; font-weight: bold" class="sansserif">Demo Video</p>
<div style="text-align:center; margin-bottom:1em;">
<video class="clickplay" width="80%" controls>
<source src="./CameraCtrl_files/Demo_video.mp4" type="video/mp4">
</video>
</div>
</div>
<div class="content">
<p style="text-align:center; font-size: 2em; font-weight: bold" class="sansserif">Framework</p> <br>
<img src="./CameraCtrl_files/architecture.png" style="width:90%;" alt="architecture_figure" class="summary-img"> <br>
<p style="font-size: 1.2em; margin-left:5em; margin-right:5em;" class="serif"> <strong>Framework of <code>CameraCtrl</code></strong> (a) Given a pre-trained video diffusion model, <code>CameraCtrl</code> trains a camera encoder on it, which takes the Plücker embedding as input and outputs multi-scale camera representations. These features are then integrated into the temporal attention layers of the U-Net at their respective scales to control the video generation process. (b) Details of the camera injection process. The camera features and the latent features are first combined through the element-wise addition. A learnable linear layer is adopted to further fuse two representations which are then fed into the first temporal attention layer of each temporal block.
</div>
<div class="content">
<p style="text-align:center; font-size: 2em; font-weight: bold" class="sansserif">Visualization Results</p> <br>
<p style="font-size: 1.3em" class="serif"> <code>CameraCtrl</code> for general text-to-video generation</p> <br>
<div style="text-align:center; margin-bottom:1em;">
<video class="clickplay" width="90%" controls>
<source src="./CameraCtrl_files/generat_t2v_object.mp4" type="video/mp4">
</video>
</div>
<div style="text-align:center; margin-bottom:1em;">
<video class="clickplay" width="90%" controls>
<source src="./CameraCtrl_files/general_t2v_scene.mp4" type="video/mp4">
</video>
</div> <br>
<p style="font-size: 1.3em" class="serif"> Same text prompt + Different camera trajectories</p>
<div style="text-align:center; margin-bottom:1em;">
<video class="clickplay" width="90%" controls>
<source src="./CameraCtrl_files/different_traj_same_prompt.mp4" type="video/mp4">
</video>
</div> <br>
<p style="font-size: 1.3em" class="serif"> <code>CameraCtrl</code> for personalized text-to-video generation</p> <br>
<div style="text-align:center; margin-bottom:1em;">
<video class="clickplay" width="90%" controls>
<source src="./CameraCtrl_files/realistic_vision.mp4" type="video/mp4">
</video>
</div>
<div style="text-align:center; margin-bottom:1em;">
<video class="clickplay" width="90%" controls>
<source src="./CameraCtrl_files/toonyou.mp4" type="video/mp4">
</video>
</div> <br>
<p style="font-size: 1.3em" class="serif"> <code>CameraCtrl</code> for image-to-video generation</p>
<div style="text-align:center; margin-bottom:1em;">
<video class="clickplay" width="90%" controls>
<source src="./CameraCtrl_files/i2v_object.mp4" type="video/mp4">
</video>
</div>
<div style="text-align:center; margin-bottom:1em;">
<video class="clickplay" width="90%" controls>
<source src="./CameraCtrl_files/i2v_scene.mp4" type="video/mp4">
</video>
</div> <br>
<p style="font-size: 1.3em" class="serif"> Integration <code>CameraCtrl</code> with other video control methods</p> <br>
<div style="text-align:center; margin-bottom:1em;">
<video class="clickplay" width="96%" controls>
<source src="./CameraCtrl_files/integrate_with_others.mp4" type="video/mp4">
</video>
</div> <br>
</div>
<div class="content" id="bibtex">
<p style="text-align:left; font-size: 2em; font-weight: bold" class="serif">BibTeX</p>
<code>
@misc{he2024cameractrl,<br>
title={CameraCtrl: Enabling Camera Control for Text-to-Video Generation},<br>
author={Hao He and Yinghao Xu and Yuwei Guo and Gordon Wetzstein and Bo Dai and Hongsheng Li and Ceyuan Yang},<br>
year={2024},<br>
eprint={2404.02101},<br>
archivePrefix={arXiv},<br>
primaryClass={cs.CV}<br>
}
</code>
</div>
<div class="content">
<p class="serif">
We borrow the source code of this project page from <a href="https://dreambooth.github.io/">DreamBooth</a>.
</p>
</div>
</body>
</html>