2021 Intelligent Sensing Winter School (December, 7-10)
Themes: AI for sound perception, AI for visual perception, AI for multimodal perception.
CORSMAL challenge: The CORSMAL challenge will see participants complete a task using audio, visual or multimodal data. Solutions by the teams will be presented to a judging panel that will vote the best results. Datasets and baseline models will be provided. To participate in the challenge, please fill this online form (or send this form to corsmal-challenge@qmul.ac.uk).
Platform: Zoom [link].
Registration: [here]. Send an [email] for late registrations.
Target audience: Researchers from the industry, Postdocs, PhD & MSc students. QMUL PhD students will receive Skills Points for participating.
Programme at a glance
GMT times | Tuesday, 7 December | Wednesday, 8 December | Thursday, 9 December | Friday, 10 December |
13:00 | AI for multimodal perception |
AI for visual perception |
AI for sound perception |
CORSMAL Challenge Presentations by each team and final leaderboard |
14:00 | ||||
CORSMAL Challenge |
For queries: [email]
Detailed programme (all times are in GMT)
7 December | ||||||
12:45 | Welcome and opening talk | |||||
AI for multimodal perception | ||||||
13:00 |
Efficient, robust, multimodal representations for human behavior analysis
The talk will start with an overview of the area of automated human behavior recognition and some of the major challenges in the field. We will discuss why and how we can leverage the latest advances in Computer Vision and Artificial Intelligence for building multimodal, computational models, able to infer human emotions and activities. Such tools usually need to function in demanding environments, both in terms of highly personalized behavioral cues, as well as in terms of high levels of spontaneity in performing activities or expressing one’s emotions. For the above reasons, different modalities can come together for human behavior recognition, while their temporal interplay can be taken into account, by employing technologies able to discover informative parts or patterns in a sequence and learn their own representations in the absence of widely available annotations.
|
Stylianos Asteriadis University of Maastricht, Netherlands |
||||
Multimodal speech understanding
Human speech interaction is a multifaceted process. As researchers strive to develop robust and desirable speech technology solutions, there has been an increasing realisation that we need to exploit this multimodal aspect of speech. This short talk will focus on the potential of multimodal speech analysis and look at how the advent of deep learning architectures has allowed us make significant progress in the domains of audio-visual speech recognition and multimodal conversational analysis.
|
Naomi Harte Trinity College Dublin, Ireland |
|||||
Separating object sounds from videos and robot interactions
Objects generate unique sounds due to their physical properties and interactions with other objects and the environment. However, object sounds are usually observed not as separate entities, but as a single audio channel that mixes all their frequencies together. In this talk, I will first present how we disentangle object sounds from unlabeled video by leveraging the natural correspondence between visual objects and the sounds they make. Then I will introduce our latest work that enable robots to actively interact with objects in the environment and separate the sounds they make.
|
Ruohan Gao Stanford University, USA |
|||||
13:45 | Q&A and discussion | |||||
14:00 |
Multi-modal robotic visual-tactile localisation and detection of surface cracks
Localising and recognising the presence of mechanical fractures is an important task necessary in hazardous environments during waste decommissioning. We present a novel method to detect surface cracks with visual and tactile sensing. The proposed algorithm localises cracks in remote environments through videos/photos taken by an onboard robot camera. The identified areas of interest are then explored by a robot with a tactile sensor. This approach may be implemented also in extreme environments since gamma radiation does not interfere with the sensing mechanism of fibre optic-based sensors.
|
Francesca Palermo |
||||
Grasping robot integration and prototyping: the GRIP software framework
Robotic manipulation is fundamental to many real-world applications; however, it is an unsolved problem, which remains a very active research area. New algorithms for robot perception and control are frequently proposed by the research community. These methods must be thoroughly evaluated in realistic conditions, before they can be adopted by industry. This process can be extremely time consuming, mainly due to the complexity of integrating different hardware and software components. Hence, we propose the Grasping Robot Integration and Prototyping (GRIP) system, a robot-agnostic software framework that enables visual programming and fast prototyping of robotic grasping and manipulation tasks. We present several applications that have been programmed with GRIP.
|
Brice Denoun |
|||||
CORSMAL Challenge: Audio-visual object classification for human-robot collaboration
Acoustic and visual sensing can support the contactless estimation of the weight of a container and the amount of its content when a person manipulate them, prior to the handover to a robot. However, opaqueness and transparencies (both of the container and of the content) and the variability of materials, shapes and sizes make this problem challenging. I will present the challenge and its tasks that participants have to solve during the summer school, the accompanying dataset, performance measures, baselines and state-of-the-art methods.
|
Alessio Xompero |
|||||
14:40 | Q&A and discussion | |||||
15:00 | Challenge starts |
8 December | |||||
11:00 | Open hour for challenge questions [link] | ||||
AI for visual perception | |||||
13:00 |
Towards robots manipulating objects like humans
We are interested in robot learning from human demonstration in the context of object grasping. In our pipeline, a robot observes a person manipulating some objects, then given a single image of these objects, it can predict how a human would grasp them and eventually grasps them in the similar way. First, this requires understanding the interaction of the human hand with the object during manipulation, a non-trivial problem exacerbated by the occlusions of the hand by the manipulated object. Then, in order to predict feasible grasps for the robots, we need to understand the semantic content of the image, its geometric structure and all potential interactions with a hand physical model. In this talk, I will give an overview of our recent work along these lines published in CVPR'20, ECCV'20, ICRA'21, and 3DV'21. |
Grégory Rogez Naverlabs Europe, France |
|||
Coarse-to-fine imitation learning: robot manipulation from a single demonstration
I will describe a new visual imitation learning method which we have recently developed for robot manipulation. The method enables novel, everyday tasks to be learned in the real world, from just a single human demonstration, and without any prior object knowledge. This is in stark contrast to alternative imitation learning methods today, which usually require either multiple demonstrations, continual environment resetting, or prior object knowledge. The method is simple: following the single demonstration, train a visual state estimator with self-supervised learning, which represents a certain point near the object. Then during testing, move the robot to this point in a straight line, and then simply replay the original demonstration. This work was published at ICRA 2021, and I will also describe our follow-up work which extends this to multi-stage tasks, published at CoRL 2021.
|
Edward Johns Imperial College London, UK |
||||
Active scene understanding with robot manipulations
Most computer vision algorithms are built with the goal to understand the physical world. Yet, as reflected in standard vision benchmarks and datasets, these algorithms continue to assume the role of a passive observer -- only watching static images or videos, without the ability to interact with the environment. This assumption becomes a fundamental limitation for applications in robotics, where systems are intrinsically built to actively engage with the physical world. In this talk, I will present some recent work from my group that demonstrates how we can enable robots to leverage their ability to interact with the environment in order to better understand what they see: from discovering objects' identity and 3D geometry, to discovering physical properties of novel objects through different dynamic interactions.
|
Shuran Song Columbia University, USA |
||||
13:45 | Q&A and discussion | ||||
14:00 |
Vision-based localization methods to benefit camera pose estimation
Camera pose is used to describe the position and orientation of a camera in a world coordinate system, in this talk, I will introduce several methods used in camera pose estimation, either using end-to-end regression ingrate with classification process or using image matching method with stricter constraints to improve the estimation performance. |
Meng Xu |
|||
Towards safe human-to-robot handovers of unknown containers
Human-to-robot handovers of unknown containers such as drinking glasses with liquids can be unsafe for the human and the robot. Simulating the handover can enable safe training and testing of robot control algorithms. However, recreating the scenario in simulation often require the use of scanned 3D object models or expensive equipment, such as motion capture systems and markers. In this talk, we propose a real-to-simulation framework to assess robot controllers for human-to-robot handovers safely using vision-based estimations of the human hands and containers from videos of humans manipulating the containers. We also propose a method for estimating a safe region on the container for grasping and quantify the safeness of the human and object in simulation using noisy estimates from a range of perceptual algorithms.
|
Yik Lung Pang |
||||
3S-Net: arbitrary semantic-aware style transfer
A cluster of stunning style transfer approaches evolve to include single style transfer and multi-style transfer these years. However, few studies consider the style consistency between identical semantic objects in style images and the content image respectively. Especially for multi-style transfer, the merged style is obtained through a simple linear combination with given weights. So, the textures of each source style are mixed, which makes the result lack aesthetic value. To overcome this problem, I would like to introduce a 3S-Net to achieve semantic-aware style transfer mainly by Two-Step Semantic Instance Normalization (2SSIN) and Semantic Style Swap.
|
Bingqing Guo |
||||
An affordance detection pipeline for resource-constrained devices
Affordance detection consists in predicting the possibility of a specific action on an object. In particular, we are interested in a semi-autonomous scenario, with a human in the loop. In this scenario, a human first moves their robotic prosthesis (e.g. lower arm and hand) towards an object and then the prosthesis selects the part of the object to grasp. The proposed solution tackles the main challenges: the indirectly controlled camera position, which influences the quality of the view, and the limited computational resources available.
|
Tommaso Apicella |
||||
Decentralised person re-identification with selective knowledge aggregation
Existing person re-identification (Re-ID) methods mostly follow a centralised learning paradigm which shares all training data to a collection for model learning. This paradigm is limited when data from different sources cannot be shared due to privacy concerns. However, current attempts on privacy-protected decentralised person re-identification are poor on how to adapt the generalised model to maximise its performance on individual client domain Re-ID tasks, due to a lack of understanding of data heterogeneity across domains, known as poor ‘model personalisation’. In this work, we present a new Selective Knowledge Aggregation approach to decentralised person Re-ID to optimise the trade-off between model personalisation and generalisation.
|
Shitong Sun |
||||
14:40 | Q&A and discussion |
9 December | ||||||
11:00 | Open hour for challenge questions [link] | |||||
AI for sound perception | ||||||
13:00 |
Target speech extraction
Human selective listening ability enables us to listen to a desired speaker or sound in a mixture using clues about that speaker or sound. Target speech extraction aims at computationally reproducing this ability of humans, i.e., estimating the speech signal of the desired speaker in a speech mixture using a short enrollment utterance or video of the speaker as clues. In this talk, we will introduce neural network-based target speech extraction focusing on enrollment-based approaches. We will also briefly discuss the extension of the idea beyond speech to realize universal sound extraction, where the target signal can be any arbitrary sound.
|
Marc Delcroix NTT Communication Science Laboratories, Japan |
||||
Machine learning for indoor acoustics
Close your eyes clap your hands. Can you hear the shape of the room? What about the carpet on the floor? In this talk, we will see how machine learning, physics and signal processing can be jointly leveraged to tackle these difficult inverse problems, with applications ranging from acoustic diagnosis to audio augmented reality. In particular, we will introduce the unifying methodological framework of "virtual acoustic space learning", review some of its recent promising results and discuss some of its current bottlenecks.
|
Antoine Deleforge Inria Nancy, France |
|||||
Modelling overlapping sound events: a multi-label or multi-class problem?
Polyphonic sound events are the main error source of audio event detection (AED) systems. This work investigates to frame the task as a multi-class classification problem by considering each possible label combination as one class. To circumvent the large number of arising classes due to combinatorial explosion, the event categories are decomposed into multiple groups to form a multi-task problem in a divide-and-conquer fashion, where each of the tasks is a multi-class classification problem. A network architecture is then devised for multi-task modelling. Experiments on a database with high degree of event overlap show that the proposed approach results in more favorable performance than the widely used multi-label approach.
|
Huy Phan Queen Mary University of London, UK |
|||||
13:45 | Q&A and discussion | |||||
14:00 |
Mixup augmentation for generalizable speech separation
Deep learning has advanced the state of the art of single-channel speech separation. However, separation models may overfit the training data and generalization across datasets is still an open problem in real-world conditions with noise. In our paper we addressed the generalization problem with Mixup as data augmentation approach. Mixup creates new training examples from linear combinations of samples during mini-batch training. We proposed four variations of Mixup and assess the improved generalization of a speech separation model, DPRNN, with cross-corpus evaluation on LibriMix, TIMIT and VCTK datasets. We show that training DPRNN with the proposed Data-only Mixup augmentation variation improves performance on an unseen dataset in noisy conditions when compared to the baseline SpecAugment augmented models, while having comparable performance on the source dataset.
|
Ashish Alex |
||||
Joint pitch detection and score transcription for piano music recordings
Automatic music transcription is the task of obtaining a machine- or human-readable score transcription from a music recording using computer algorithms. In this talk, we will discuss our ideas on predicting note pitches and symbolic musical scores using deep learning methods, including choices of input/output representations and different model architectures.
|
Lele Liu |
|||||
Learning Music audio representations via weak language supervision
Learning good audio representations is critical to solving music information retrieval (MIR) tasks. Most current approaches use supervised learning on audio data, which requires task-specific training on annotated datasets. In this talk I will introduce our recent paper where we explore multimodal pre-training on weakly aligned audio-text pairs to learn general-purpose music audio representations and show that these can be adapted to a wide range of downstream MIR tasks. |
Ilaria Manco |
|||||
Protecting gender and identity with disentangled speech representations
Besides its linguistic component, our speech is rich in biometric information that can be inferred by classifiers. Learning privacy-preserving representations for speech signals enables downstream tasks, such as speech recognition, without sharing unnecessary private information about an individual. In this short presentation we will show how gender recognition and speaker verification tasks can be reduced to a random guess, protecting against classification-based attacks.
|
Dimitrios Stoidis |
|||||
Cross-lingual hate speech detection in social media
Most hate speech detection research focuses on a single language, generally English, which limits their generalisability to other languages. In this talk, we discuss hate speech detection in a multilingual scenario, and also present our idea -- a cross-lingual capsule network learning model coupled with extra domain-specific lexical semantics for cross-lingual hate speech (CCNL-Ex).
|
Aiqi Jiang |
|||||
14:40 | Q&A and discussion |
10 December | |||
The CORSMAL Challenge | |||
11:30 | Submission of the results |
13:00 | Presentations by each team (methods and results) [link] [slides] | ||
14:00 | Presentation of the leaderboard and closing talk |
Organisers | CORSMAL Challenge | ||||
Changjae Oh
|
Lin Wang
|
Alessio Xompero
|
Sponsors | |||
|
|
|