L3DAS23 - Task 2


The aim of this task is to detect the temporal activity, spatial position and typology of a known set of sound events immersed in a set of simulated 3D acoustic environments. We consider up to 3 simultaneously active sounds, which may belong to the same class.  Here the models are expected to predict a list of the active sound events and their respective location at regular intervals of 100 milliseconds. We use a joint metric for localization and detection: F-score based on the Location-sensitive detection [1].
This metric considers a true positive only if a sound class is correctly predicted in a temporal frame and if it its predicted location lies within a cartesian distance from the true position of at most 1.75 m. 

Sound event classes

To generate the spatial sound scenes the measured room IRs are convolved with clean sound samples belonging to distinct sound classes. The sound event database we used for Task 2 is the well-known FSD50K dataset. In particular, we have selected 14 classes,  representative of the sounds that can be heard in an office: computer keyboard, drawer open/close, cupboard open/close, finger snapping, keys jangling, knock, laughter, scissors, telephone, writing, chink and clink, printer, female speech, male speech.

Dataset specs

The main characteristics of the L3DAS23 SELD dataset are:

  • 900 30-seconds-long data-points (a total of 7.5 hours)
  • Sampling rate: 32 kHz 16 bit
  • over 1000 sound event samples from FSD50K (14 sound classes)
  • 700+ RIRs positions for each microphone position in a selection of 68 virtual environments (300k+ possible RIRs positions)
  • separate sets with 1, 2 or 3 overlaps
  • one 512x512 px RGB image for each microphone position

The Task 2 dataset is organized as follows:

  • L3DAS23_Task2_{train, dev}
    • data
        • splitX_ov{1,2,3}_Y_A.wav
        • (first-order Ambisonics recording of mic A)
        • splitX_ov{1,2,3}_Y_B.wav
        • (first-order Ambisonics recording of mic B)
      • ...
    • labels
        • label_splitX_ov{1,2,3}_Y.csv  
        • (Csv containing temporal and spatial location of each sound and its class)
      • ...
      • audio_image.csv
      • (Csv where each line is a couple of the form: splitX_ovZ_Y,imagefilename.png)

where X is in the range [0,3] in the train set and X=4  in the dev set; Y is an incremental number; ov1 stands for maximum one overlapping sound, ov2 for maximum two overlapping sounds, and ov3 for maximum three overlapping sounds.

The Task 2 dataset can be downloaded from Kaggle, either through the appropriate API or from the appropriate web page

Example (from L3DAS21)

[1] A. Mesaros, S. Adavanne, A. Politis, T. Heittola and T. Virtanen, "Joint Measurement of Localization and Detection of Sound Events," 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2019, pp. 333-337, doi: 10.1109/WASPAA.2019.8937220.