PROGRAM 1

Visual Intelligence

RP1-3

Interpretable models for movie understanding

3D-aware Image Synthesis via Learning Structural and Texture Representations

Learning Hierarchical Cross-Modal Association for Co-speech Gesture Generation

Instance Localization for Self-supervised Detection Pretraining

BlockPlanner: City Block Generation with Vectorized Graph Representation

There is a large amount of multimodal knowledge stored in movies. Movies are a great resource of common-sense knowledge about the world, about actions and their effects, about people’s behaviors and emotions, and about stories. Movies provide rich visual content covering large periods of time, telling a full story with rich interactions, emotions, and events. Our goal is to produce a system (algorithm + database) that can take an image or video as input and produce a description of the semantic content of the scene, what the people are doing, and understand the situation depicted in the scene. To achieve this goal, we propose to build a knowledge database covering a very wide and varied number of situations and train a system to parse scenes and provide complex descriptions of the content.

Aim 1:

Network dissection and understanding the internal representation learned by dynamic neural networks. Network dissection consists in a family of methods to characterize the internal representation learned by a neural network when trying to solve a task. Characterizing the internal representation built by a neural network opens the door to new approach for unsupervised object discovery and to do unsupervised learning of common-sense knowledge.

Aim 2:

Understanding movies. Understanding a movie requires analyzing the video at different time scales and reasoning about different types of events. Following the gaze of people inside videos is an important signal for understanding people and their actions. In this project, we present an approach for the following gaze in the video by predicting where a person (in the video) is looking even when the object is in a different frame. This system can then be deployed to solve a variety of tasks: movie understanding, activity recognition, and social interaction prediction.