Learning Camera Movement Control from Real-World Drone Videos

Yunzhong Hou¹, Liang Zheng¹, Philip Torr²,

¹Australian National University, ²University of Oxford

arXiv Code Data 🤗

"To record as is, not to create from scratch."

Abstract

This study seeks to automate camera movement control for filming existing subjects into attractive videos, contrasting with the creation of non-existent content by directly generating the pixels. We select drone videos as our test case due to their rich and challenging motion patterns, distinctive viewing angles, and precise controls. Existing AI videography methods struggle with limited appearance diversity in simulation training, high costs of recording expert operations, and difficulties in designing heuristic-based goals to cover all scenarios. To avoid these issues, we propose a scalable method that involves collecting real-world training data to improve diversity, extracting camera trajectories automatically to minimize annotation costs, and training an effective architecture that does not rely on heuristics. Specifically, we collect 99k high-quality trajectories by running 3D reconstruction on online videos, connecting camera poses from consecutive frames to formulate 3D camera paths, and using Kalman filter to identify and remove low-quality data. Moreover, we introduce DVGFormer, an auto-regressive transformer that leverages the camera path and images from all past frames to predict camera movement in the next frame. We evaluate our system across 38 synthetic natural scenes and 7 real city 3D scans. We show that our system effectively learns to perform challenging camera movements such as navigating through obstacles, maintaining low altitude to increase perceived speed, and orbiting tower and buildings, which are very useful for recording high-quality videos.

Data collection pipeline

Top left: For scraped YouTube videos, we run shot change detection to split the videos into clips of individual scene. Top right: We then use Colmap to reconstruct the 3D scene and recover camera poses from video frames. Bottom: Finally, we connect camera poses from consecutive frames to formulate 3D camera trajectories and apply Kalman filter to discard low quality reconstructions whose camera poses from neighboring frames are drastically different.

Model overview of DVGFormer

To predict camera motion a_t for time step t, the auto-regressive architecture uses as input a long horizon with camera poses {c₀, ..., c_t}, motion {a₀, ..., a_t-1}, images {x₀, ..., x_t}, and their monocular depth estimations from all previous frames. Each action a_t is broken into N intermediate steps {a_t⁰, ..., a_t^N-1} between time step t and t+1.

Camera path extension from user input

Given user input camera path, DVGFormer can smoothly extend its motion into longer sequence. See if you can spot the changes from the user input to the generated path by DVGFormer in video! 🤣 Scroll to zoom in and drag to change view.

Click to reveal spoiler

Gray camera path and small RGB camera axis are from user input, purple path and large RGB camera axis are generated by DVGFormer.

Comparison with baseline

In contrast to the motion stuttering or sudden movements or direction changes in the 3D camera path generated by the basline (top), DVGFormer (bottom) can produce a much smoother camera path. See paper for more details. Scroll to zoom in and drag to change view.

More results

BibTeX

@article{hou2024dvgformer,
  author    = {Hou, Yunzhong and Zheng, Liang and Torr, Philip},
  title     = {Learning Camera Movement Control from Real-World Drone Videos},
  journal   = {arXiv preprint},
  year      = {2024},
}