3D audio is gaining increasing interest in the machine learning community in recent years. The field of application is incredibly wide and ranges from virtual and real conferencing to game development, music production, autonomous driving, surveillance and many more. In this context, Ambisonics prevails among other 3D audio formats for its simplicity, effectiveness and flexibility. Ambisonic recordings permit to obtain an impressive performance in many machine learning-based tasks, usually bringing out a significant improvement over the mono and stereo formats. Tasks like Sound Source Localization, Speech and Emotion Recognition, Sound Source Separation Separation, Speech Enhancement and Denoising, Acoustic Echo Cancellation, among others, benefit from tridimensional representations of sound field, thus leading to higher accuracy and perceived quality.
The L3DAS project (Learning 3D Audio Sources) aims at encouraging and fostering research on the afore-mentioned topics. In particular, the L3DAS21 Challenge focuses on 2 tasks: 3D Speech Enhancement and 3D Sound Event Localization and Detection, both relying on Ambisonics recordings. First, we provide the training and development sets, alongside with a supporting python-based API to facilitate the data download and preprocessing. We will also supply baseline results for both tasks, obtained using state-of-the art deep learning architectures. In a second step, we will release the test sets without truth labels. Participants must submit the results obtained for the latter. In the end, the final ranking of the challenge will be presented at the IEEE Workshop on MLSP and released on this page.
The L3DAS21 dataset contains multiple-source and multiple-perspective B-format Ambisonics audio recordings. We sampled the acoustic field of a large office room, placing two first-order Ambisonics microphones in the center of the room and moving a speaker reproducing the analytic signal in 252 fixed spatial positions. Relying on the collected Ambisonics impulse responses (IRs), we augmented existing clean monophonic datasets to obtain synthetic tridimensional sound sources by convolving the original sounds with our IRs. Clean files have been extracted from the Librispeech and FSD50K datasets.
We aimed at creating plausible and variegate 3D scenarios to reflect possible real-life situations in which sound and disparate types of background noises coexist in the same 3D reverberant environment. We provide normalized raw waveforms as predictors data and the target data varies according to the task, as specified in the next section.
More details on the dataset
We propose 2 tasks: 3D Speech Enhancement and 3D Sound Source Localization and Detection. These tasks are aimed at fulfilling real-world needs related to real and virtual conferencing. Especially in multi-speaker scenarios it is very important to properly understand the nature of a sound event and its position within the environment, what is the content of the sound signal and how to leverage it at best for a specific application (e.g., teleconferencing rather than assistive listening or entertainment, among others).
Each task presents 2 separate sub-tasks: 1-mic and 2-mic recordings, respectively containing the sounds acquired by one Ambisonics microphone and by an array of two Ambisonics microphones. To the best of our knowledge, this is the first time that Ambisonics recording performed with 2 microphones are considered for machine learning purposes. We expect higher accuracy/reconstruction quality when taking advantage of the dual spatial perspective of the two microphones. Both tasks rely on the same audio recordings, but with completely different targets, as described below.