2020 Intelligent Sensing Summer School (September, 1-4)
Themes: AI for Audio, AI for Vision, AI for Multimodal data.
CORSMAL challenge: The CORSMAL challenge will see participants divided into teams to compete on a task around multimodal, audio and visual data to be solved within a limited time span. Solutions by the teams will be presented to a judging panel that will vote the best results. Datasets will be provided.
Platform: Zoom. The main Zoom room will be allocated for Presentations, Q&A sessions and Award sessions. Dedicated breakout Zoom rooms will be allocated to the teams participating in the CORSMAL challenge.
Target audience: Researchers from the industry, Postdocs, PhD & MSc students. QMUL PhD students will receive Skills Points for participating, presenting or helping.
Programme at a glance
BST times | Tuesday, 1 September | Wednesday, 2 September | Thursday, 3 September | Friday, 4 September |
10:00 | The CORSMAL challenge (working in teams) |
The CORSMAL challenge (working in teams) |
The CORSMAL challenge (working in teams) |
|
11:00 | ||||
12:00 | AI for Multimodal Data | AI for Audio | AI for Vision | The CORSMAL challenge (judges meet) |
13:00 | The CORSMAL challenge (presentation of the results) |
|||
14:00 | The CORSMAL challenge (working in teams) |
The CORSMAL challenge (working in teams) |
The CORSMAL challenge (working in teams) |
|
15:00 | ||||
16:00 | Closing and awards |
Registration: [here].
Send an [email] for late registrations.
For queries: [email]
Detailed programme (all times are in BST)
1 September | ||||||
12:00 | Welcome and opening [slides] | |||||
AI for Multimodal Data | ||||||
12:10 |
Tutorial: Audio-visual variational speech enhancement Speech enhancement is the task of extracting clean speech from a noisy mixture. While the sound signal may suffice in low-noise conditions, the visual input can bring complementary information to help the enhancement process. In this tutorial, two strategies based on variational auto-encoders will be described and compared: (i) the systematic use of both audio and video against (ii) the automatic inference of the optimal mixture between audio and video.
|
Xavier Alameda-Pineda INRIA, France | ||||
12:50 | Q&A and discussion | |||||
13:00 |
Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision We tackle the problem of audiovisual scene analysis for weakly-labeled data. To this end, we build upon our previous audiovisual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorization for the audio modality. Our approach is founded on the multiple instance learning paradigm. Its effectiveness is established through experiments over a challenging dataset of music instrument performance videos. We also show encouraging visual object localization results.
|
Slim Essid Telecom ParisTech, France |
||||
Audio-visual modalities fusion for efficient emotion recognition with active learning capabilities The efficient fusion of different modalities, mainly audio and visual data, forms an imperative task for succeeding advanced recognition performances. At this short presentation, we are going to discuss some of the available training strategies for developing a multi-modal emotion recognition system, exploiting state-of-the-art Convolutional Neural Networks (CNNs) as unimodal extractors. The extensive results will demonstrate the main advantages and disadvantages of each strategy, aiming to generalize the corresponding findings within the more generic challenge of modalities fusion. Moreover, the idea of incorporating temporal properties in such a system with the use of Long Short-Term Memory (LSTM) cells will be further displayed, as well as efficient Reinforcement Learning (RL) techniques will be discussed in order to allow online operation capabilities.
|
Antonios Gasteratos DUT, Greece |
|||||
Multimodal speech enhancement Human-human and human-machine communications occur in verbal or nonverbal forms. Among the nonverbal forms, visual cues play a major role. In various applied speech technologies, systems that integrate audio and visual cues generally provide better performance than audio-only counterparts. In this talk, we first present our recent research progress on incorporating audio and video information to facilitate better speech enhancement (SE) capability based on a multimodal deep learning framework. In addition to visual cues, bone-conducted speech signals, which manifest speech-phoneme structures, also complement theirs air-conducted counterparts. In the second part of this talk, we will present our recently proposed ensemble-learning-based strategies that integrate the bone- and air-conducted speech signals to perform SE. The experiment results on two Mandarin corpora indicate that the multimodal SE structure significantly outperforms the single-source SE counterparts. |
Yu Tsao CITI, Taiwan |
|||||
When vision fails, using language and audio to understand videos In this talk, I will discuss the benefit of using language when dealing with the problem of video understanding as well as potential issues that arise from the open set of language expressions. I will also show how considering both language and audio provide a boost in performance and strengthen the underlying learned multi-modal space.
|
Michael Wray University of Bristol, UK |
|||||
14:00 | Q&A and discussion | |||||
14:15 | Presentation of the CORSMAL challenge | |||||
14:30 | Challenge starts |
2 September | |||||
11:00 | Open hour for challenge questions | ||||
AI for Audio | |||||
12:00 |
AI and audio in DCASE Challenge The DCASE Challenge is already a known and eagerly awaited yearly evaluation campaign focused on audio research for sound scene analysis. Currently dominated by deep learning solutions, the challenge has become a good testing ground for AI in audio, with a variety of topics that include acoustic scene classification, sound event detection, detection and localization, and audio captioning. This talk will introduce shortly a selection of audio tasks from different years, related datasets and solutions.
|
Annamaria Mesaros Tampere University, Finland |
|||
Understanding the reverberation environment for immersive media Just as light and shading provide spatial information in an image, the way sound reflects and reverberates around an environment can be illuminating too. For example, from a room's response to an impulsive sound, it is possible to perceive the presence of nearby reflectors (such as a wall), the size of the room, whether the materials are hard or soft, and the approximate location of the sound source, including its distance. Hence reverb is among the most valuable tools in the sound designer's media production process for setting the scene, depth and perspective. In immersive media, the reverb takes on an additional spatial dimension. This talk presents recent audio, visual and audio-visual techniques to extract the information needed to reproduce the room impression spatially, with an emphasis on the kinds of acoustical processing and evaluation that may be less familiar. In concluding, various applications of the spatial reverb and future work will be discussed.
|
Philip Jackson University of Surrey, UK |
||||
Frontiers for machine learning in speech (and hearing) technology: A brief analysis For many decades speech technology and machine learning have not just co-existed but driven progress in the respective fields. This trend continues, with fundamental changes to the approaches taken in the last decade to solve the highly complex problem of understanding speech. In this short talk a glimpse onto the current strategies and novel methods used to progress core speech and hearing technology is provided.
|
Thomas Hain The University of Sheffield, UK |
||||
12:45 | Q&A and discussion | ||||
13:00 |
ConflictNET: end-to-end learning for speech-based conflict intensity estimation Computational paralinguistics aims to infer human emotions, personality traits and behavioural patterns from speech signals. In particular, verbal conflict is an important example of human-interaction behaviour, whose detection would enable monitoring and feedback in a variety of applications. In this work, we propose an end-to-end convolutional-recurrent neural network architecture that learns conflict-specific features directly from raw speech waveforms.
|
Vandana Rajan |
|||
Deep generative variational autoencoding for replay spoof detection in automatic speaker verification Voice biometric systems use automatic speaker verification (ASV) technology for user authentication. Even if it is among the most convenient means of biometric authentication, the robustness and security of ASV in the face of spoofing attacks (or presentation attacks) is of growing concern and is now well acknowledged by the speech research community. A spoofing attack involves illegitimate access to personal data of a targeted user. Replay is among the simplest attacks to mount yet difficult to detect reliably and is the focus of this talk. In this talk, I will present our preliminary results on the application of variational autoencoders for replay spoofing detection using two benchmark spoofing datasets.
|
Bhusan Chettri |
||||
Memory controlled sequential self attention for sound recognition Attention mechanism is arguably one of the most powerful concepts in the machine learning field nowadays. In this talk, we discuss our ideas to efficiently implement self-attention on top of a baseline machine learning model for the task of sound event detection (SED).
|
Arjun Pankajakshan |
||||
Automatic lyrics transcription based on hybrid speech recognition Automatic Lyrics Transcription (ALT) can be defined as translating the linguistic content in singing voice signals, which can be perceived as speech recognition in the singing domain. In this talk, we will go through how to adapt the state-of-the-art on speech recognition to the singing domain. In detail, we will study how to build the acoustic, language and pronunciation models based on the domain-specific properties of singing. Our talk is shaped around the state-of-the-art system in lyrics transcription.
|
Emir Demirel |
||||
Audio tagging using a linear noise modelling layer In this work, we explore the potential of modelling the noise distribution of target labels in an audio dataset by adding a linear noise modelling layer on top of a baseline deep neural network. We show that adding the layer improves the accuracy of the model and we compare the performance of the noise modelling layer with an alternative approach of adopting a noise robust loss function. |
Shubhr Singh |
||||
13:40 | Q&A and discussion |
3 September | |||||
11:00 | Open hour for challenge questions | ||||
AI for Vision | |||||
12:00 |
Fully articulated hand tracking for HoloLens 2 In this talk, I'll present an overview of the work behind the fully articulated hand tracking for HoloLens 2. Our journey starts with an academic paper, published four years ago. Developing the hand tracking technology that was shipped on HoloLens 2 required many technical innovations - combining both high-level algorithmic changes and low-level optimization. In this talk I'll focus, in particular, on two of these innovations: the introduction of a novel 3D surface model for the hand and the use of DNNs. |
Federica Bogo Microsoft, Switzerland |
|||
Perception through action and for action Computer vision has recently made great progress on problems such as finding categories of objects and scenes, and poses of people in images. However, studying such tasks in isolated disembodied contexts, divorced from the physical source of their images, is insufficient to build intelligent robotic agents. In this talk, I will give an overview of my research focusing on remarrying vision to action, by asking: How might vision benefit from the ability to act in the world, and vice versa? Could embodied visual agents teach themselves through interaction and experimentation? Are there actions they might perform to improve their visual perception? Could they exploit vision to perform complex control tasks?
|
Dinesh Jayaraman University of Pennsylvania, US |
||||
Underwater vision and 6D pose estimation To enable robust manipulation in underwater, several components are required covering from perception to manipulation. Unfortunately, underwater experience different light conditions than the terrestrial environment and RGB sensing is very limited. In this talk, I would like to introduce (1) underwater image dehazing for visibility improvement and (2) deep-learning-based pose estimation for object grasping. For dehazing, three aspects including model-based, image-processing based, and learning-based approaches will be introduced. We will discuss the pros and cons of each approach. The second topic of 6D pose estimation exploits deep learning to learn underlying primitive for improved orientation estimation.
|
Ayoung Kim KAIST, South Korea |
||||
12:45 | Q&A and discussion | ||||
13:00 |
Cast-GAN: learning to remove colour cast from underwater images Underwater images are degraded by blur and colour cast caused by the attenuation of light in water. To remove the colour cast with neural networks, images of the scene taken under white illumination are needed as reference for training, but are generally unavailable. We exploit open data and typical colour distributions of objects to create a synthetic image dataset that reflects degradations naturally occurring in underwater photography. We use this dataset to train Cast-GAN, a Generative Adversarial Network whose loss function includes terms that eliminate artefacts that are typical of underwater images enhanced with neural networks.
|
Chau Yi Li |
|||
Temporal action localization with Variance-Aware Networks This work addresses the problem of temporal action localization with Variance-Aware Network (VAN), i.e., DNNs that use second-order statistics in the input and the output of regression tasks. Results show that the incorporation of second order statistics surpasses the accuracy of virtually all other two-stage networks without involving any additional parameters.
|
Tingting Xie |
||||
Novel-view human action synthesis Given a video of a person performing an action recorded from an input camera view, the novel-view human action synthesis aims at synthesizing the appearance of the person as seen from a target camera view. In this talk, we present the View Decomposition Network (VDNet), a deep generative model that addresses the synthesis problem using the view-decomposition assumption. The design of the solution is inspired by the fact that each view shares a common information between all the views and a view-specific one.
|
Mohamed Ilyes Lakhal |
||||
Semantic adversarial attacks for privacy protection Images shared on social media are routinely analysed by machine learning models for content annotation and user profiling. These automatic inferences reveal to the service provider sensitive information that a naive user might want to keep private. The unwanted inference of trained machine learning models can be prevented via data modification. We show how to modify images, while maintaining or even enhancing their visual quality, prior to sharing them online, so that classifiers cannot infer private information from the visual content.
|
Ali Shahin Shamsabadi |
||||
13:40 | Q&A and discussion |
4 September | |||
The CORSMAL Challenge | |||
11:00 | Challenge submission (code and presentation) |
12:00 | Panel of judges meet |
13:00 | Presentation by the teams |
14:00 | Panel of judges deliberate |
16:00 | Awards |
Organisers | CORSMAL Challenge | ||||
Emmanouil Benetos
|
Changjae Oh
|
Lin Wang
|
Alessio Xompero
|
Sponsors | ||||
|
|
|
|