The objective of this task is the enhancement of speech signals immersed in a noisy 3D environment. Here the models are expected to extract the monophonic voice signal from the 3D mixture containing various background noises, eventually supported by a RGB image. Therefore, for this task we also provide the clean monophonic speech signal as target.
The evaluation metric for this task is a combination of the short-time objective intelligibility (STOI), which estimates the intelligibility of the output speech signal, and word error rate (WER), computed to assess the effects of the enhancement for speech recognition purposes. The final metric for this task is given by (STOI+(1−WER))/2, which lies in the 0-1 range and higher values are better.
To generate the spatial sound scenes the computed IRs are convolved with clean sound samples belonging to distinct sound classes. The noise sound event database we used for task 1 is the well-known FSD50K dataset. In particular, we have selected 12 transient classes, representative of the noise sounds that can be heard in an office: computer keyboard, drawer open/close, cupboard open/close, finger snapping, keys jangling, knock, laughter, scissors, telephone, writing, chink and clink, printer, and 4 continuous noise classes: alarm, crackle, mechanical fan and microwave oven.
Furthermore, we extracted clean speech signals (without background noise) from Librispeech, taking only sound files up to 12 seconds.
The main characteristics of the L3DAS23 Task1 section are:
The predictors data of this section are released as 8-channels 16kHz 16 bit wav files, consisting of 2 sets of first-order Ambisonics recordings (4 channels each). The channels order is [WA,YA,ZA,XA,WB,YB,ZB,XB], where A/B refers to the used microphone and WYZX are the b-format ambisonics channels. A csv file named audio_image.csv connects each audio to its respective image.
The dataset is organized as follows:
The Task 1 dataset can be downloaded from Kaggle, either through the appropriate API or from the appropriate web page.