Objectives and structure of the thesis:
The main objective of this thesis is to design specialized classifier ensembles to address the issue of imbalance in recognition of faces from video in face re-identification application. This system must avoid bias of performance towards the correct classification of the negative class and result in a high performance in correctly classifying faces from videos in presence of imbalance. The design of such systems must handle the challenges of the methods in the literature for imbalanced data classification in general and in face re-identification application. First, the literature is reviewed in Chapter 1 for (1) face re-identification methods (2) imbalanced data classification approaches, and (3) performance metrics to evaluate classification systems under imbalance. Then in Chapter 2, the experimental methodology used in this thesis is presented. A sample selection technique called Trajectory Under-Sampling (TUS) is proposed in Chapter 3 that is specialized for face re-identification to design ensembles of classifiers trained on imbalanced data. This sampling technique benefits from the fact that there exists some natural sub-clusters in this application that can be found using tracking information.
In fact this sampling method can be used in any application where there exists some natural sub-clusters and use one-versus-all learning strategy. Two ensembles are proposed using TUS and it appears to be more effective than the general-purpose sampling methods in the literature for this application because TUS improves the diversity-accuracy trade off between classifiers of the ensemble for this application. In the proposed ensembles the classifiers are trained on data subsets that have different imbalance levels and the combination of them is optimized based on their performance in terms of F-measure. Then we extend this ensemble method by further under-sampling the trajectories to support vectors with the intention of decreasing the bias of performance in classifiers that are trained on imbalanced subsets and to use only the most important information from the data. The content of Chapter 3 is an extended version of a conference paper presented in ICPRAM 2016. In the next step (Chapter 4), a new Boosting ensemble method is proposed that is called Progressive Boosting (PBoost). In this Boosting algorithm, the skew level of data used for designing the classifiers in the ensemble increases progressively.
In contrast to the ensembles proposed in Chapter 3, the classifiers in the proposed ensemble in this chapter are not trained on imbalanced subsets of data and are only validated on imbalanced subsets of data. Instead, the classifiers are trained on balanced sets because training on balanced subsets of data increases the accuracy of each individual classifier in the ensemble which can lead to an overall better performance of the ensemble. These subsets are drawn from a growing imbalanced set and this selection depends on the importance of samples based on weights assigned through the proposed Boosting learning strategy. The classifiers in this ensemble are validated on this growing set and therefore result in an ensemble with more robustness to varying imbalance. The weights of the samples and the classifiers in this ensemble are updated based on the performance in terms of F-measure from one iteration to the other. To this aim, the loss function of the ensemble is modified using the F-measure and each classifier is assigned with a weight according to its performance in terms of F-measure. Similarly to the ensemble methods proposed in Chapter 3, the proposed TUS appears to be more effective for this application than the general-purpose sampling methods in the literature. However, note that the proposed PBoost algorithm can be used in any application since if there is no application-based under-sampling method for an application, this algorithm works with general-purpose sampling methods like random under-sampling and cluster under-sampling.
Face Recognition from Video Surveillance
One of the applications of biometrics especially face is in video surveillance systems. A video surveillance system consists of a number of video cameras used at airports, banks, department stores etc. for security purposes. In the traditional video surveillance systems, a number of videos are displayed for human operators to detect specific people or behaviours. Human operators have low performance in this situation because the scene containing a person or behaviour of interest might not appear so often and also the time spent in watching videos is a problem. The new video surveillance systems are improving by utilizing automatic face recognition algorithms for assisting or replacing human operators. Face recognition can be applied in many video surveillance systems. Still-to-video face recognition is used in watch list screening application. In this type of application, a high quality image of a person is captured and a model of the face is created and stored in the gallery during enrolment. In deployment (or operational) stage, the faces are captured from the video streams and the model of the detected person’s face is compared to the models in the gallery. In systems for video-to-video face recognition (face re-identification) the reference models are created from the face captures from video streams.
In a more general form of « Person re-identification », both face and/or overall appearance of the individuals could be used to recognize them. Video possesses two main characteristics that make it more practical for recognition than still images: – Video of a person contains multiple images of that person and might be captured in different uncontrolled conditions from the operational environment. Therefore, it includes different types of information compared to still images. For example, different pose angles of a person’s face in video could provide extra information of the face. – Video affords temporal information. For example, the trajectory of feature points along consecutive frames differs from one person to the other (Zhao et al., 2003). A typical face recognition system for video surveillance that use both spatial and temporal information from video is comprised of five major modules which are face segmentation (detection), feature extraction, classification, decision modules, and tracking module. Figure 1.1 presents a general block diagram for this application.
The segmentation (face-head detector) module initiates a new track when the face appears in the scene and extracts the region of interest (ROI) of the face (or the bounding box around the face). The tracker employs local object detection to find the location of the face in the video frame based on the information from the previous frames. Therefore, the tracker must periodically interact with the detector (segmentation module) to initiate new tracks, or to validate and/or update the object template with new detector bounding boxes (Kiran et al., 2019) (see papers by Breitenstein et al. (2009); Smeulders et al. (2014); Huang et al. (2019) for more on visual tracking and tracking-by-detection). Then the features are extracted by the feature extraction module from the ROIs to be used in the classification module.
INTRODUCTION |