Reconnaissance d’actions humaines dans des séquences vidéo RGB-D monoculaires
Generative Adversarial Networks (GANs)
In recent years, Generative Adversarial Networks have gained a lot of popularity in the field of computer vision. GAN-based approaches have been used and shown great results in image synthesis (Reed et al., 2016), image super-resolution (Ledig et al., 2017), image-to-image translation (Isola et al., 2017) and so on. In this section, we briefly review the mathematical model behind a GAN framework and its training procedure. A GAN model consists of two components (see FIGURE 2.71 ): a generator G and a discriminator D. Given an input noise vector z, which is sampled from a normal distribution pz(z), the generator G is trained to generate an image x in order to ensure that x is indistinguishable from a real data distribution pdata(x). While training G, we maximize the probability so that x belongs to the given distribution pdata(x). The generated image x is fed into the discriminator D alongside a stream of images taken from the real distribution. In other words, D is trained to estimate the probability of a given sample coming from the real distribution. To this end, we need to make sure that the decisions of the discriminator D over real data are accurate by maximizing Ex∼pdata(x) [log D(x)]. Meanwhile, given a fake sample G(z), z ∼ pz(z), the discriminator is expected to output a probability, D(G(z)), close to zero by maximizing Ez∼pz (z) [log(1 − D(G(z)))]. On the other hand, the generator is trained to increase the chances of D producing a high probability for a fake example, thus to minimize Ez∼pz (z) [log(1 − D(G(z)))]. When combining both aspects together, D and G are playing a minimax game, in which we should optimize the following loss function L(D, G) min G max D L(D, G) = min G max D (Ex∼pdata(x) [log D(x)] +Ez∼pz (z) [log(1 − D(G(z)))]). (2.13) In practice, both components G and D are two neural networks. The loss function L(D, G) from equation (2.13) can be optimized using gradient-based methods since both G and D are differentiable with respect to their inputs and parameters. In 2016, Radford, Metz, and Chintala, 2015 introduced a set of architectures called Deep Convolutional GANs (DCGANs) in order to train GANs in a better way. This study showed that GANs can learn good representations of images for supervised learning and generative modeling. In Chapter 3, we will examine the potentials of GANs in analyzing actions in videos.
Related reviews and public datasets
Previous reviews
We first consider related earlier reviews in video-based human action recognition. Looking at the major conferences and journals in computer vision and image processing, several earlier surveys have been published (Aggarwal and Cai, 1999; Moeslund and Granum, 2001; Wang, Hu, and Tan, 2003; Moeslund, Hilton, and Krüger, 2006; Turaga et al., 2008). For instance, Aggarwal and Cai, 1999 reviewed methods for human motion analysis, focusing on three major areas: motion analysis, tracking a moving human from a single view or multiple cameras and recognizing human actions from image sequences. Moeslund and Granum, 2001 reviewed approaches on human motion capture. They considered a general structure for motion analysis systems as a hierarchical process with four steps: initialization, tracking, pose estimation, and recognition and then reviewed the papers based on this taxonomy. Wang, Hu, and Tan, 2003 presented an overview on human motion analysis, in which motion analysis was illustrated as a three-level process including human detection, human tracking, and behavior understanding. Moeslund, Hilton, and Krüger, 2006 described the work in human capture and analysis, centered on initialization of human motion, tracking, pose estimation, and recognition. Turaga et al., 2008 reviewed the major approaches for recognizing human actions and activities. They considered “actions » as characterized by simple motion patterns, typically executed by a single person. Meanwhile, “activities » are more complex and involve coordinated actions among a small number of humans. Many reviews on human action recognition approaches have been made since 2010 (e.g. Poppe, 2010; Weinland, Ronfard, and Boyer, 2011; Popoola and Wang, 2012; Ke et al., 2013; Aggarwal and Xia, 2014; Guo and Lai, 2014). For instance, Poppe, 2010 focused on image representation and action classification methods. A similar survey by Weinland, Ronfard, and Boyer, 2011 also concentrated on approaches for action representation and classification. Popoola and Wang, 2012 presented a survey focusing on contextual abnormal human behavior detection for surveillance applications. Ke et al., 2013 reviewed human action recognition methods for both static and moving cameras, covering many problems such as feature extraction, representation techniques, action detection and classification. Aggarwal and Xia, 2014 introduced a review of human action recognition based on 3D data, especially using RGB and depth information acquired by 3D sensors. Meanwhile Guo and Lai, 2014 provided a review of existing approaches on still image-based action recognition. Recently, Cheng et al., 2015 reviewed approaches on human action recognition in which all methodologies are classified into two categories: single-layered approaches and hierarchical approaches. Vrigkas, Nikou, and Kakadiaris, 2015 categorized human action recognition methods into two main categories including “unimodal » and “multimodal ». The authors then reviewed action classification methods for each of these two categories. The work of Subetha and Chitrakala, 2016 mainly focused on human action recognition and human-object interaction methods. Presti and La Cascia, 2016 provided a review of human action recognition based on 3D skeletons, summarizing the main technologies, including both hardware and software for solving the problem of action classification inferred from skeletal data. Recently, another review by Kang and Wildes, 2016 summarized various action recognition and detection algorithms, focused on encoding and classifying motion features.
Benchmark datasets for human action recognition in videos
With the increase in the study of human action recognition methods, many benchmark datasets have been recorded and published. Much progress in human action recognition has been demonstrated on these standard benchmark datasets. They allow researchers to develop, evaluate and compare new approaches for the problem of human action recognition in videos. In this section, we summarize the most important benchmark datasets, from the early datasets that contain simple actions and acquired under controlled environments, e.g. KTH (Schuldt, Laptev, and Caputo, 2004), Weizmann (Gorelick et al., 2007) or IXMAS (Weinland, Ronfard, and Boyer, 2006), to recent benchmark datasets with millions of video samples providing complex actions and human behaviors from the real world scenarios, e.g. Sports-1M (Karpathy et al., 2014) and NTU-RGB+D (Shahroudy et al., 2016). TABLE 3.2 shows the datasets and their main descriptions.
Abstract |