Artificial Cognition, MLLM - March 17, 2023

NVIDIA's New AI: Better AI Videos Are Here!

Abstract

Generating Long Videos of Dynamic Scenes

We present a video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time.

Existing video generation methods often fail to produce new content as a function of time while maintaining consistencies expected in real environments, such as plausible dynamics and object persistence.

A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency, such as a single latent code that dictates content for the entire video.

On the other extreme, without long-term consistency, generated videos may morph unrealistically between different scenes.

To address these limitations, we prioritize the time axis by redesigning the temporal latent representation and learning long-term consistency from data by training on longer videos.

To this end, we leverage a two-phase training strategy, where we separately train using longer videos at a low resolution and shorter videos at a high resolution.

To evaluate the capabilities of our model, we introduce two new benchmark datasets with explicit focus on long-term temporal dynamics.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
Cite as:	arXiv:2206.03429 [cs.CV]
	(or arXiv:2206.03429v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2206.03429

Code, datasets, pretrained models

Humanity

Universe

NVIDIA's New AI: Better AI Videos Are Here!

Abstract

0 comments

Leave a comment

Tags