Humans invariably perceive the world multimodally. From a piece of cloth to a car, from a conversation to a concert, each object and event around us is discovered through perceptual senses of vision, sound, touch, smell and taste. Among our many innate abilities is the one to associate and use information from these different modalities (or senses) to understand and interact with our surroundings (Smith and Gasser, 2005). Specifically, with regard to audition and vision, humans are adept at meaningfully integrating auditory and visual characteristics of many real-world objects, events and actions. Among other things, we would like our audio-visual (AV) sensory systems to identify and locate the vehicles in our vicinity, listen to our neighbour amid the background noise, recognize occluded events from audio or inaudible objects from their appearance. It is noteworthy how we carry out these and many other related tasks with ease. Taking inspiration from this remarkable human capability, our objective in this thesis is to design algorithms that enable machines to describe and extract such objects and events from videos using joint AV analysis.
Robust scene understanding, as in our previous example, is achieved through both similarity and differences between what the audio and visual signals capture. Similarity allows enhancement of one signal using the other and complementarity enables compensating for one in the presence of the other. To illustrate this, note that temporal variations in sound are also indicated through visual motion for certain sources. Indeed, humans subconsciously use the lip movements to “enhance” what someone says in noisy environments (Krishnan et al., 2014). On the other hand, object or event identification through audio (or vision) is not hindered by noise or changes in the other modality. These are interesting phenomena that we wish to utilize and highlight in this thesis .
Research on this topic is not only interesting intellectually but also with regard to its practical applicability. The ubiquity of AV data opens up several application areas where jointly analyzing audio and visual signals could help both humans and machines alike. A few of them are listed below:
• Joint processing can be utilized for film post–production and more generally video content creation and manipulation in the entertainment industry. Specifically, audio-visual speech analysis could aid the painstaking process of dubbing, mixing or automated dialogue replacement (ADR). The latter involves re-recording dialogues in a studio after filming due to issues such as synchronization, dramatic effects, line corrections, etc. (Woodhall, 2011). More recently it has found an interesting application for virtual character animation (Zhou et al., 2018; Shlizerman et al., 2018).
• Video content explosion on the Internet is now common knowledge. This research will find significant use in structuring and indexing this data for efficient storage and better content-based retrieval.
• Surveillance cameras capturing both audio and visual signals are becoming increasingly popular. However, these recordings are often of low quality (Ramakrishnan, 2018). This presents the perfect use case for joint processing to improve speech intelligibility, detect anomalous events for e.g. gunshots or enhance signals. Application to biometrics is also well-known (Bredin and Chollet, 2007).
• Assisting hearing/sight impaired people by using joint models to automatically go from audio to video synthesis and vice versa.
• Such research should find immediate applications in assisting unimodal approaches in provably important tasks such as video object segmentation and audio source separation.
• Translating multimodal sensory capabilities to algorithms will be a boon to the area of robotics. Robust AV analysis would not only support machine understanding but also interaction with surroundings and people (Karras et al., 2017).
Promising research directions often come with equally daunting challenges and the topic under consideration is no exception. It is noteworthy that the mechanism and location where AV signals are bound in the brain to create multimodal objects still remain unknown to neuroscientists (Atilgan et al., 2018). With particular regard to this thesis, we must (i) define AV objects/events to decide what we want to look for; (ii) hypothesize ways of associating the modalities to discover the said objects; and (iii) deal with limited annotations, noise and scale of datasets we wish to use for training and testing. Some of these main difficulties are explained below. We tackle several of them in this thesis.
What is an AV object? Providing a general definition for the so called AV object is not straight-forward (Kubovy and Schutz, 2010). A recent study (Bizley et al., 2016) defines them as a perceptual construct which occurs when a constellation of stimulus features are bound within the brain. Such a definition encompasses correlated AV signals like the link between a speaker’s lip movement and voice. However, whether the definition takes into account objects like cars where such relations are not visible, is a question of interpretation. In this thesis, for all practical purposes, we simply focus on the class of sounds which have an identifiable visual source. These would typically fall into one of the following sub-categories: (1) objects such as musical instruments with unique distinguishable sound characteristics, (2) objects such as vehicles which produce multiple sounds possibly from different sub-parts such as wheels, engine, etc. and (3) sounds like air horns which are produced by multiple visual sources. We rely on underlying tasks, data annotations and/or our formulations to evade the ambiguity and subjectivity associated with our definition. Also, we avoid commonly encountered visual objects such as television, tape recorder as these devices essentially re–produce sound and only add to the complexity of source identifiability and differentiability .
Intra–class audio and visual variations. Audio and visual characteristics of objects (events) can vary widely even within a single class. For a class like motorcycle the instance-level characteristics could be as diverse as the number of bike models. Other common sources of visual diversity are illumination, lighting, viewpoint and color variations.
Scarcity of annotated datasets. We wish to use videos from real-world events. While large amounts of such data can be obtained from websites like YouTube, its annotation is usually limited to a few, usually unreliable/noisy textual tags as provided by the user at upload time. This requires developing algorithms capable of learning and performing several complex tasks at scale with minimal supervision.
Noise in data. Amateurish quality and unconstrained nature of user–generated videos result in different types of noise in both the modalities. Standard sources of noise include background audio and poor microphone or camera quality. Many videos are artificially altered to suit the content generator’s needs. Specifically, the original audio track could be overlaid or completely replaced with music. In other cases, the content is modified or created with static, sometimes irrelevant images. Moreover, the appearance of salient cues in both the modalities, like the visual appearance of a car and the sound of its engine may be asynchronous. This is to say that audio and visual cues appear at different times in the video. This is also referred to as the problem of alignment in some other works using visual and textual modalities (Alayrac et al., 2016).
1 Introduction |