Yonsei University and QMUL workshop on Audio and Visual Learning for Multimedia Perception and Production

Where: UG1, G.O. Jones Building, Mile End Campus, Queen Mary University of London (number 25 in [map]).

  • 28 January 2019
    9:00-9:30 Registration, tea & coffee
    9:30-9:40 Welcome & introduction to CIS, Prof Andrea Cavallaro
    Deep learning based speech signal processing: recent research activities of DSP Lab in Yonsei University

    Speech is still one of the most convenient ways of designing natural human-computer-interaction systems, partly because of the convenience of sensors such as microphones. Deep learning methods, which have brought about paradigm shifts in many research fields, now also play a key role in natural speech interface areas such as speech enhancement, automatic speech/speaker recognition, and text-to-speech. In this talk, we introduce the recent research activities of the DSP & AI Lab in Yonsei University. We present brief descriptions of three projects: 1) background noise removal for audio/video clips, 2) emotional text-to-speech systems, and 3) audio-visual signal processing applications. All of the projects described in this talk require in-depth knowledge of deep learning techniques as well as speech signal processing theory.

    Hong-Goo Kang
    Hong-Goo Kang
    Signal processing methods for sound recognition

    Audio analysis - also called machine listening - involves the development of algorithms capable of extracting meaningful information from audio signals such as speech, music, or environmental sounds. Sound recognition, also called sound event detection, is a core problem in machine listening and is still considered to be an open problem in the case of recognizing multiple overlapping sounds in the presence of complex noisy environments. Applications of sound recognition are numerous, including but not limited to smart homes/smart cities, ambient assisted living, biodiversity assessment, security/surveillance and audio archive management amongst others. In this talk I will present current research on methods for sound recognition in complex acoustic environments, using elements from signal processing and machine learning theory. The talk will also cover ongoing research community efforts for public evaluation of sound recognition methods, as well as a discussion on the limitations of current audio analysis technologies and on identifying promising directions for future machine listening research.

    Emmanouil Benetos
    Emmanouil Benetos
    An active power normalization for the panning-based binaural rendering

    Binaural rendering of audio objects can be performed in two different ways. The first is to perform the rendering directly through convolution using BRIR/HRIR corresponding to the location of the object. The second method is a panning-based rendering in which a panning of the object to a virtual location is performed prior to the rendering. Panning-based rendering performs a downmix of the panned signals obtained through convolution using BRIR / HRIR of each virtual location. In this downmixing process, high-frequency components are often attenuated due to the comb-filter effect. As a remedy to this problem, a frequency-dependent gain normalization can be applied to the panning process, but the normalization can only be applied when the panning gain is known. This presentation introduces an active power normalization method for the panning-based binaural rendering. The proposed method calculates subband power-normalization factors so that the power of the down-mixed signals equal to those of the panned signals. This method can prevent the high-frequency attenuation of the down-mixed signals and, as a result, it compensate for the spectral distortion due to the comb-filter effect. This method can also be applied to channel signals rendered with vector-based panning algorithms such as VBAP andMDAP. This method was applied to the binaural renderer of MPEG-H 3D Audio, the latest standard for panning-based binaural rendering, to evaluate subjective and objective performances. The test results showed that the MUSHRA score was improved by about 15 points.

     Young-cheol Park
    Young-cheol Park
    Combining digital signal processing and computer modelling and simulation in cardiac arrhythmia studies

    The heart is a complex biological system whose dynamics, the cardiac rhythms, never cease to fascinate us. Since the very beginning of cardiac electrophysiology, some form of signal processing, modelling and simulation (the so-called thought experiments) have been combined for understanding cardiac rhythms. Nowadays, the availability of computing power has led to the development of digital signal processing and computer modelling and simulation techniques for common clinical tasks such as diagnosis, prevention and treatment. In this talk, I will discuss my work in combining digital signal processing and computer simulation techniques for understanding cardiac rhythms. Specifically, I will focus on an abnormal cardiac rhythm called fibrillation, whose mechanisms remain to date poorly understood.

    Jesus Requena Carrion
    Jesús Requena Carrión
    11:00-11:30 Coffee break
    3D tracking from a compact microphone array co-located with a camera

    We address the problem of 3D audio-visual speaker tracking using a compact platform with co-located audio-visual sensors, without a depth camera. We present a face detection driven approach supported by 3D hypothesis mapping to image plane for visual feature matching. We thenpropose a video-assisted audio likelihood computation, which relies on a GCC-PHAT based acoustic map. Audio and video likelihoods are fused together in a particle filtering framework. The proposed approach copes with a reverberant and noisy environment, and can deal with speaker being occluded, outside the camera’s Field of View (FoV), as well as not facing or far from the sensing platform. Experimental results show that we can provide accurate speaker tracking both in 3D and on the image plane.

    Xinyuan Qian
    Xinyuan Qian
    A deep learning-based stress detection algorithm with speech signal

    In this research, we propose a deep learning-based psychological stress detection algorithm using speech signals. With increasing demands for communication between human and intelligent systems, automatic stress detection is becoming an interesting research topic. Stress can be reliably detected by measuring the level of specific hormones (e.g., cortisol), but this is not a convenient method for the detection of stress in human-machine interactions. The proposed algorithm first extracts mel-filterbank coefficients using preprocessed speech data and then predicts the status of stress output using a binary decision criterion (i.e., stressed or unstressed) using long short-term memory (LSTM) and feed-forward networks. To evaluate the performance of the proposed algorithm, speech, video, and bio-signal data were collected in a well-controlled environment. We utilized only speech signals in the decision process from subjects whose salivary cortisol level varies over 10%. Using the proposed algorithm, we achieved 66.4% accuracy in detecting the stress state from 25 subjects, thereby demonstrating the possibility of utilizing speech signals for automatic stress detection.

    Hyewon Han Hyewon Han
    Background light estimation for depth-dependent underwater image restoration

    Light undergoes a wavelength-dependent attenuation and loses energy along its propagation path in water. In particular, the absorption of red wavelengths is greaterthan that of green and blue wavelengths in open ocean waters. This reduces the red intensity of the scene radiance reaching the camera and results in non-uniform light, known as background light, due to the scene depth. Restoration methods that compensatefor this colour loss often assume constant background light and distort the colour of the water region(s). To address this problem, we propose a restoration method that compensates for the colour loss due to the scene-to-camera distance of non-water regions without altering the colour of pixels representing water. This restoration is achieved by ensuring background light candidates are selected from pixels representing water and then estimating the non-uniform background light without prior knowledge of the scene depth. Experimental results shows that the proposed approach outperforms existing methods in preserving the colour of water regions.

    Chau Yi Li
    Chau Yi Li
    12:00-12:10 Multi-camera matching of spatio-temporal binary features

    Matching image features across moving cameras with unknown poses and separated by a wide baseline decreases considerably the matching performance. To improve matching accuracy, we accumulate temporal information within each view by tracking local binary features that are described by comparison of pixel pair intensities.

    Alessio Xompero
    Alessio Xompero
    Deformable kernel networks for joint image filtering

    Joint image filters are used to transfer structural details from a guidance picture used as a prior to a target image, in tasks such as enhancing spatial resolution and suppressing noise. Previous methods based on convolutional neural networks (CNNs) combine nonlinear activations of spatially-invariant kernels to estimate structural details and regress the filtering result. In this work, we instead learn explicitly sparse and spatially-variant kernels. We propose a CNN architecture and its efficient implementation, called the deformable kernel network (DKN), that outputs sets of neighbors and the corresponding weights adaptively for each pixel. The filtering result is then computed as a weighted average. We demonstrate the effectiveness and flexibility of our models on the tasks of depth map upsampling, saliency map upsampling, cross-modality image restoration, texture removal, and semantic segmentation. In particular, we show that the weighted averaging process with sparsely sampled 3 × 3 kernels outperforms the state of the art by a significant margin.

    Bumsub Ham
    Bumsub Ham
    Multimodal acoustic sensing from a micro aerial vehicle

    The ego-noise generated by the motors and propellers of a micro aerial vehicle (MAV) masks the environmental sounds and considerably degrades the quality of the on-board sound recording. We propose a multi-modal analysis approach that jointly exploits audio and video to enhance the sounds of multiple targets captured from an MAV equipped with a microphone array and a video camera.First, we develop a time-frequency spatial filtering algorithm which is able to enhance the sound from a desired direction by exploiting the time-frequency sparsity of acoustic signals. We then develop audio-visual joint analysis algorithm which is able to estimate the locations of potential sound sources from the video recording. Finally, we extract the sound from each individual speaker by steering multiple time-frequency spatial filters towards the estimated target directions. Experimental demonstrationresults with real outdoor data verify the robustness of the proposed multi-modal approach for multiple speakers in extremely low-SNR scenarios.

    Lin Wang
    Lin Wang
    12:50-13:00 Closing remarks

    Organiser Logistics
    Changjae Oh

    Vandana Rajan

    Mohamed Ilyes Lakhal

    Imran Rashid

    Summary of the day

    We welcome the delegation from Yonsei University to Queen Mary University of London.

    The workshop on Audio and Visual Learning for Multimedia Perception and Production started with a presentation of the Centre for Intelligent Sensing by Prof Andrea Cavallaro, followed by an introduction of the Workshop and of the collaboration QMUL-Yonsei University by Dr Changjae Oh.

    2019 Yonsei and QMUL workshop
    2019 Yonsei and QMUL workshop The first half of the event saw the presentations of four academics.

    Prof Hong-Goo Kang focused on speech signal processing and deep learning for human computer interaction. You can learn more on Prof Kang’s DSP&AI lab [here].

    The next talk was on machine listening for sound scenes recognition by Dr Emmanouil Benetos, co-leader of the Machine Listening Lab [link].

    Prof Young-cheol Park presented his work on binaural renderer for MPEG-H and 3D audio. More works by Prof Park can be found [here].

    Dr Jesus Requena [website] concluded the first session with how to model the heart and identify abnormal cardiac rhythms.
    The second half of the workshop had four research students presenting their recently published papers on 3D audio/visual tracking [paper], deep learning for stress detection from speech signals [paper], underwater image enhancement [paper] and multi-camera matching of feature points [paper].

    Two academics concluded this session.

    Dr Bumsub Ham who talked about deformable kernel networks for joint image filtering. More information on his research can be found in his [website].

    And Dr Lin Wang who presented his work on sound enhancement from a microphone array mounted on a drone and whose work was published in the IEEE Sensors Journal [link].
    2019 Yonsei and QMUL workshop
    2019 Yonsei and QMUL workshop The event concluded with a networking session where technical discussions highlighted the intersecting research among the two institutions.

    We would like to thank those people that made the event successful: Dr Changjae Oh as the main organiser, the logistics team Vandana Rajan, Mohamed Ilyes Lakhal and Imran Rashid, all the speakers and the participants.

    Queen Mary University of London through his Centre for Intelligent Sensing looks forward to continuing the collaboration with Yonsei University.

    [Pictures of the event]