A Multi-Resolution CRNN-Based Approach for Semi-Supervised Sound Event Detection in DCASE 2020 Challenge
EntityUAM. Departamento de Tecnología Electrónica y de las Comunicaciones
PublisherInstitute of Electrical and Electronics Engineers Inc. (IEEE)
10.1109/ACCESS.2021.3088949IEEE Access 9 (2021): 89029-89042
Funded byThis work was supported in part by the Project Deep Speech for Forensics and Security (DSForSec) under Grant RTI2018-098091-B-I00, in part by the Ministry of Science, Innovation and Universities of Spain, and in part by the European Regional Development Fund (ERDF)
ProjectGobierno de España. RTI2018-098091-B-I00
SubjectsDCASE 2020 Task 4; multi-resolution; Sound event detection; Telecomunicaciones
Rights© The author(s)
Esta obra está bajo una Licencia Creative Commons Atribución 4.0 Internacional.
Sound Event Detection is a task with a rising relevance over the recent years in the field of audio signal processing, due to the creation of specific datasets such as Google AudioSet or DESED (Domestic Environment Sound Event Detection) and the introduction of competitive evaluations like the DCASE Challenge (Detection and Classification of Acoustic Scenes and Events). The different categories of acoustic events can present diverse temporal and spectral characteristics. However, most approaches use a fixed time-frequency resolution to represent the audio segments. This work proposes a multi-resolution analysis for feature extraction in Sound Event Detection, hypothesizing that different resolutions can be more adequate for the detection of different sound event categories, and that combining the information provided by multiple resolutions could improve the performance of Sound Event Detection systems. Experiments are carried out over the DESED dataset in the context of the DCASE 2020 Challenge, concluding that the combination of up to 5 resolutions allows a neural network-based system to obtain better results than single-resolution models in terms of event-based F1-score in every event category and in terms of PSDS (Polyphonic Sound Detection Score). Furthermore, we analyze the impact of score thresholding in the computation of F1-score results, finding that the standard value of 0.5 is suboptimal and proposing an alternative strategy based in the use of a specific threshold for each event category, which obtains further improvements in performance
This item appears in the following Collection(s)
Showing items related by title, author, creator and subject.