Title: Video Event Recognition and Prediction Based on Temporal Structure Analysis
Speaker: Kang Li
Date: Thursday, September 4th
Time: 9:30 AM-11:30 AM.
Location: 442 Dana
Professor Yun Fu (Advisor)
Professor Jennifer G. Dy
Professor Yizhou Sun
The increasing ubiquitousness of multimedia information in today's world has positioned video as a favored information vehicle, and given rise to an astonishing generation of social media and surveillance footage. Consumer-grade video is becoming abundant on the Internet, and it is now easier than ever to download multimedia material of any kind and quality. This raises a series of technological demands for automatic video understanding, which has motivated the research community to guide its steps towards a better attainment of such capabilities. As a result, current trends on cognitive vision promise to recognize complex events and self-adapt to different environments, while managing and integrating several types of knowledge.
One important problem that will significantly enhance semantic-level video analysis is activity and event understanding, which aims at accurately describing video contents using key semantic elements, such activities and events. One well-known challenge is the long-standing semantic gap between computable low-level features and semantic information that they encode. In this thesis, several studies of high-level video content understanding were presented, which address these difficulties and narrow the semantic gap effectively. In particular, we have focused on two types of videos, namely human activity video and unconstrained consumer video. The proposed temporal structure analysis frameworks significantly extend the domains of video that can be understood by machine vision systems.
In aspect of human activity recognition, we notice that in case a time-critical decision is needed, there is no work that utilizes the temporal structure of videos for early prediction of ongoing human activity. Thus we present a general activity prediction framework in which human activities can be characterized by a complex temporal composition of constituent simple actions and interacting objects. Then we extend our work to the 3D cases of action prediction motivated by recent advent of the cost-effective sensors, such as depth camera Kinect. By considering 3D action data as multivariate time series (m.t.s.) synchronized to a shared common clock (frames), we proposed a stochastic process called Marked Point Process (MPP) modelling the 3D action as temporal dynamic patterns, where both timing and strength information are captured.
In aspect of unconstrained consumer video understanding, we also focus on the temporal structure of the video content through a semantic-segment based design, in which each video clip can be represented as a series of varying videography words. Then, unique videography signatures from different events can be automatically identified, using statistical analysis methods. We explore the use of videography analysis for different types of applications, including content-based video retrieval, video summarization (both visual and textual), videography based feature pooling.