CIS Intelligent Sensing | Summer School

2020 Intelligent Sensing Summer School (September, 1-4)

Themes: AI for Audio, AI for Vision, AI for Multimodal data.

CORSMAL challenge: The CORSMAL challenge will see participants divided into teams to compete on a task around multimodal, audio and visual data to be solved within a limited time span. Solutions by the teams will be presented to a judging panel that will vote the best results. Datasets will be provided.

Platform: Zoom. The main Zoom room will be allocated for Presentations, Q&A sessions and Award sessions. Dedicated breakout Zoom rooms will be allocated to the teams participating in the CORSMAL challenge.

Target audience: Researchers from the industry, Postdocs, PhD & MSc students. QMUL PhD students will receive Skills Points for participating, presenting or helping.

Programme at a glance

BST times	Tuesday, 1 September	Wednesday, 2 September	Thursday, 3 September	Friday, 4 September
10:00		The CORSMAL challenge (working in teams)	The CORSMAL challenge (working in teams)	The CORSMAL challenge (working in teams)
11:00
12:00	AI for Multimodal Data	AI for Audio	AI for Vision	The CORSMAL challenge (judges meet)
13:00				The CORSMAL challenge (presentation of the results)
14:00	The CORSMAL challenge (working in teams)	The CORSMAL challenge (working in teams)	The CORSMAL challenge (working in teams)
15:00
16:00				Closing and awards

Registration: [here]. Send an [email] for late registrations.

For queries: [email]

Detailed programme (all times are in BST)

1 September

12:00

Welcome and opening [slides]

AI for Multimodal Data

12:10

Tutorial: Audio-visual variational speech enhancement

Speech enhancement is the task of extracting clean speech from a noisy mixture. While the sound signal may suffice in low-noise conditions, the visual input can bring complementary information to help the enhancement process. In this tutorial, two strategies based on variational auto-encoders will be described and compared: (i) the systematic use of both audio and video against (ii) the automatic inference of the optimal mixture between audio and video.

Xavier Alameda-Pineda

Xavier Alameda-Pineda
INRIA, France

12:50

Q&A and discussion

13:00

Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

We tackle the problem of audiovisual scene analysis for weakly-labeled data. To this end, we build upon our previous audiovisual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorization for the audio modality. Our approach is founded on the multiple instance learning paradigm. Its effectiveness is established through experiments over a challenging dataset of music instrument performance videos. We also show encouraging visual object localization results.

Slim Essid

Slim Essid
Telecom ParisTech, France

Audio-visual modalities fusion for efficient emotion recognition with active learning capabilities

The efficient fusion of different modalities, mainly audio and visual data, forms an imperative task for succeeding advanced recognition performances. At this short presentation, we are going to discuss some of the available training strategies for developing a multi-modal emotion recognition system, exploiting state-of-the-art Convolutional Neural Networks (CNNs) as unimodal extractors. The extensive results will demonstrate the main advantages and disadvantages of each strategy, aiming to generalize the corresponding findings within the more generic challenge of modalities fusion. Moreover, the idea of incorporating temporal properties in such a system with the use of Long Short-Term Memory (LSTM) cells will be further displayed, as well as efficient Reinforcement Learning (RL) techniques will be discussed in order to allow online operation capabilities.

Antonios Gasteratos

Antonios Gasteratos
DUT, Greece

Multimodal speech enhancement

Human-human and human-machine communications occur in verbal or nonverbal forms. Among the nonverbal forms, visual cues play a major role. In various applied speech technologies, systems that integrate audio and visual cues generally provide better performance than audio-only counterparts. In this talk, we first present our recent research progress on incorporating audio and video information to facilitate better speech enhancement (SE) capability based on a multimodal deep learning framework. In addition to visual cues, bone-conducted speech signals, which manifest speech-phoneme structures, also complement theirs air-conducted counterparts. In the second part of this talk, we will present our recently proposed ensemble-learning-based strategies that integrate the bone- and air-conducted speech signals to perform SE. The experiment results on two Mandarin corpora indicate that the multimodal SE structure significantly outperforms the single-source SE counterparts.

Yu Tsao

Yu Tsao
CITI, Taiwan

When vision fails, using language and audio to understand videos

In this talk, I will discuss the benefit of using language when dealing with the problem of video understanding as well as potential issues that arise from the open set of language expressions. I will also show how considering both language and audio provide a boost in performance and strengthen the underlying learned multi-modal space.

Michael Wray

Michael Wray
University of Bristol, UK

14:00

Q&A and discussion

14:15

Presentation of the CORSMAL challenge

14:30

Challenge starts

2 September

11:00

Open hour for challenge questions

AI for Audio

12:00

AI and audio in DCASE Challenge

The DCASE Challenge is already a known and eagerly awaited yearly evaluation campaign focused on audio research for sound scene analysis. Currently dominated by deep learning solutions, the challenge has become a good testing ground for AI in audio, with a variety of topics that include acoustic scene classification, sound event detection, detection and localization, and audio captioning. This talk will introduce shortly a selection of audio tasks from different years, related datasets and solutions.

Annamaria Mesaros

Annamaria Mesaros
Tampere University, Finland

Understanding the reverberation environment for immersive media

Just as light and shading provide spatial information in an image, the way sound reflects and reverberates around an environment can be illuminating too. For example, from a room's response to an impulsive sound, it is possible to perceive the presence of nearby reflectors (such as a wall), the size of the room, whether the materials are hard or soft, and the approximate location of the sound source, including its distance. Hence reverb is among the most valuable tools in the sound designer's media production process for setting the scene, depth and perspective. In immersive media, the reverb takes on an additional spatial dimension. This talk presents recent audio, visual and audio-visual techniques to extract the information needed to reproduce the room impression spatially, with an emphasis on the kinds of acoustical processing and evaluation that may be less familiar. In concluding, various applications of the spatial reverb and future work will be discussed.

Philip Jackson

Philip Jackson
University of Surrey, UK

Frontiers for machine learning in speech (and hearing) technology: A brief analysis

For many decades speech technology and machine learning have not just co-existed but driven progress in the respective fields. This trend continues, with fundamental changes to the approaches taken in the last decade to solve the highly complex problem of understanding speech. In this short talk a glimpse onto the current strategies and novel methods used to progress core speech and hearing technology is provided.

Thomas Hain

Thomas Hain
The University of Sheffield, UK

12:45

Q&A and discussion

13:00

ConflictNET: end-to-end learning for speech-based conflict intensity estimation

Computational paralinguistics aims to infer human emotions, personality traits and behavioural patterns from speech signals. In particular, verbal conflict is an important example of human-interaction behaviour, whose detection would enable monitoring and feedback in a variety of applications. In this work, we propose an end-to-end convolutional-recurrent neural network architecture that learns conflict-specific features directly from raw speech waveforms.

Vandana Rajan

Vandana Rajan

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Voice biometric systems use automatic speaker verification (ASV) technology for user authentication. Even if it is among the most convenient means of biometric authentication, the robustness and security of ASV in the face of spoofing attacks (or presentation attacks) is of growing concern and is now well acknowledged by the speech research community. A spoofing attack involves illegitimate access to personal data of a targeted user. Replay is among the simplest attacks to mount yet difficult to detect reliably and is the focus of this talk. In this talk, I will present our preliminary results on the application of variational autoencoders for replay spoofing detection using two benchmark spoofing datasets.

Bhusan Chettri

Bhusan Chettri

Memory controlled sequential self attention for sound recognition

Attention mechanism is arguably one of the most powerful concepts in the machine learning field nowadays. In this talk, we discuss our ideas to efficiently implement self-attention on top of a baseline machine learning model for the task of sound event detection (SED).

Arjun Pankajakshan

Arjun Pankajakshan

Automatic lyrics transcription based on hybrid speech recognition

Automatic Lyrics Transcription (ALT) can be defined as translating the linguistic content in singing voice signals, which can be perceived as speech recognition in the singing domain. In this talk, we will go through how to adapt the state-of-the-art on speech recognition to the singing domain. In detail, we will study how to build the acoustic, language and pronunciation models based on the domain-specific properties of singing. Our talk is shaped around the state-of-the-art system in lyrics transcription.

Emir Demirel

Emir Demirel

Audio tagging using a linear noise modelling layer

In this work, we explore the potential of modelling the noise distribution of target labels in an audio dataset by adding a linear noise modelling layer on top of a baseline deep neural network. We show that adding the layer improves the accuracy of the model and we compare the performance of the noise modelling layer with an alternative approach of adopting a noise robust loss function.

Shubhr Singh

Shubhr Singh

13:40

Q&A and discussion

3 September

11:00

Open hour for challenge questions

AI for Vision

12:00

Fully articulated hand tracking for HoloLens 2

In this talk, I'll present an overview of the work behind the fully articulated hand tracking for HoloLens 2. Our journey starts with an academic paper, published four years ago. Developing the hand tracking technology that was shipped on HoloLens 2 required many technical innovations - combining both high-level algorithmic changes and low-level optimization. In this talk I'll focus, in particular, on two of these innovations: the introduction of a novel 3D surface model for the hand and the use of DNNs.

Federica Bogo

Federica Bogo
Microsoft, Switzerland

Perception through action and for action

Computer vision has recently made great progress on problems such as finding categories of objects and scenes, and poses of people in images. However, studying such tasks in isolated disembodied contexts, divorced from the physical source of their images, is insufficient to build intelligent robotic agents. In this talk, I will give an overview of my research focusing on remarrying vision to action, by asking: How might vision benefit from the ability to act in the world, and vice versa? Could embodied visual agents teach themselves through interaction and experimentation? Are there actions they might perform to improve their visual perception? Could they exploit vision to perform complex control tasks?

Dinesh Jayaraman

Dinesh Jayaraman
University of Pennsylvania, US

Underwater vision and 6D pose estimation

To enable robust manipulation in underwater, several components are required covering from perception to manipulation. Unfortunately, underwater experience different light conditions than the terrestrial environment and RGB sensing is very limited. In this talk, I would like to introduce (1) underwater image dehazing for visibility improvement and (2) deep-learning-based pose estimation for object grasping. For dehazing, three aspects including model-based, image-processing based, and learning-based approaches will be introduced. We will discuss the pros and cons of each approach. The second topic of 6D pose estimation exploits deep learning to learn underlying primitive for improved orientation estimation.

Ayoung Kim

Ayoung Kim
KAIST, South Korea

12:45

Q&A and discussion

13:00

Cast-GAN: learning to remove colour cast from underwater images

Underwater images are degraded by blur and colour cast caused by the attenuation of light in water. To remove the colour cast with neural networks, images of the scene taken under white illumination are needed as reference for training, but are generally unavailable. We exploit open data and typical colour distributions of objects to create a synthetic image dataset that reflects degradations naturally occurring in underwater photography. We use this dataset to train Cast-GAN, a Generative Adversarial Network whose loss function includes terms that eliminate artefacts that are typical of underwater images enhanced with neural networks.

Chau Yi Li

Chau Yi Li

Temporal action localization with Variance-Aware Networks

This work addresses the problem of temporal action localization with Variance-Aware Network (VAN), i.e., DNNs that use second-order statistics in the input and the output of regression tasks. Results show that the incorporation of second order statistics surpasses the accuracy of virtually all other two-stage networks without involving any additional parameters.

Tingting Xie

Tingting Xie

Novel-view human action synthesis

Given a video of a person performing an action recorded from an input camera view, the novel-view human action synthesis aims at synthesizing the appearance of the person as seen from a target camera view. In this talk, we present the View Decomposition Network (VDNet), a deep generative model that addresses the synthesis problem using the view-decomposition assumption. The design of the solution is inspired by the fact that each view shares a common information between all the views and a view-specific one.

Mohamed Ilyes Lakhal

Mohamed Ilyes Lakhal

Semantic adversarial attacks for privacy protection

Images shared on social media are routinely analysed by machine learning models for content annotation and user profiling. These automatic inferences reveal to the service provider sensitive information that a naive user might want to keep private. The unwanted inference of trained machine learning models can be prevented via data modification. We show how to modify images, while maintaining or even enhancing their visual quality, prior to sharing them online, so that classifiers cannot infer private information from the visual content.

Ali Shahin Shamsabadi

Ali Shahin Shamsabadi

13:40

Q&A and discussion

4 September
The CORSMAL Challenge
11:00	Challenge submission (code and presentation)

12:00

Panel of judges meet

13:00

Presentation by the teams

14:00

Panel of judges deliberate

16:00

Awards

Organisers				CORSMAL Challenge
Emmanouil Benetos	Changjae Oh	Lin Wang		Alessio Xompero

Sponsors

Past events

Summer schools

2019 Summer School

2018 Summer School

2017 Summer School

2016 Summer School

2015 Summer School

2014 Summer School

2013 Summer School

Other events

2018/19 CIS PhD Welcome day

2017/18 CIS PhD Welcome day

2016/17 CIS PhD Welcome day

2015/16 CIS PhD Welcome day

CIS Spring Camp 2016

Sensing and graphs week

Commercialisation bootcamp

Sensing and IoT week

Software workshop

[ Disclaimer ] [ Privacy ] [ Webmaster ]
CIS, School of Electronic Engineering and Computer Science, Queen Mary University of London, Mile End Road, London E1 4NS, UK
© Copyright QMUL 2012-2014