The objective of this task is the enhancement of speech signals immersed in a noisy 3D environment. Here the models are expected to extract the monophonic voice signal from the 3d mixture containing various background noises. Therefore, for this task we also provide the clean monophonic speech signal as target.
The evaluation metric for this task is a combination of the short-time objective intelligibility (STOI), which estimates the intelligibility of the output speech signal, and word error rate (WER), computed to assess the effects of the enhancement for speech recognition purposes. The final metric for this task is given by (STOI+(1−WER))/2, which lies in the 0-1 range and higher values are better.
To generate the spatial sound scenes the measured room IRs are convolved with clean sound samples belonging to distinct sound classes. The noise sound event database we used for task 1 is the well-known FSD50K dataset. In particular, we have selected 12 transient classes, representative of the noise sounds that can be heard in an office: computer keyboard, drawer open/close, cupboard open/close, finger snapping, keys jangling, knock, laughter, scissors, telephone, writing, chink and clink, printer, and 4 cointinous noise classes: alarm, crackle, mechanical fan and microwave oven.
Furthermore, we extracted clean speech signals (without background noise) from Librispeech, taking only sound files up to 10 seconds.
The main characteristics of the L3DAS21 Task1 section are:
The predictors data of this section are released as 8-channels 16kHz 16 bit wav files, consisting of 2 sets of first-order Ambisonics recordings (4 channels each). The channels order is [WA,YA,ZA,XA,WB,YB,ZB,XB], where A/B refers to the used microphone and WYZX are the b-format ambisonics channels.