A Multi-Resolution CRNN-Based Approach for Semi-Supervised Sound Event Detection in DCASE 2020 Challenge
Entity
UAM. Departamento de Tecnología Electrónica y de las ComunicacionesPublisher
Institute of Electrical and Electronics Engineers Inc. (IEEE)Date
2021-06-14Citation
10.1109/ACCESS.2021.3088949
IEEE Access 9 (2021): 89029-89042
ISSN
2169-3536 (online)DOI
10.1109/ACCESS.2021.3088949Funded by
This work was supported in part by the Project Deep Speech for Forensics and Security (DSForSec) under Grant RTI2018-098091-B-I00, in part by the Ministry of Science, Innovation and Universities of Spain, and in part by the European Regional Development Fund (ERDF)Project
Gobierno de España. RTI2018-098091-B-I00Editor's Version
https://doi.org/10.1109/ACCESS.2021.3088949Subjects
DCASE 2020 Task 4; multi-resolution; Sound event detection; TelecomunicacionesRights
© The author(s)Abstract
Sound Event Detection is a task with a rising relevance over the recent years in the field of audio signal processing, due to the creation of specific datasets such as Google AudioSet or DESED (Domestic Environment Sound Event Detection) and the introduction of competitive evaluations like the DCASE Challenge (Detection and Classification of Acoustic Scenes and Events). The different categories of acoustic events can present diverse temporal and spectral characteristics. However, most approaches use a fixed time-frequency resolution to represent the audio segments. This work proposes a multi-resolution analysis for feature extraction in Sound Event Detection, hypothesizing that different resolutions can be more adequate for the detection of different sound event categories, and that combining the information provided by multiple resolutions could improve the performance of Sound Event Detection systems. Experiments are carried out over the DESED dataset in the context of the DCASE 2020 Challenge, concluding that the combination of up to 5 resolutions allows a neural network-based system to obtain better results than single-resolution models in terms of event-based F1-score in every event category and in terms of PSDS (Polyphonic Sound Detection Score). Furthermore, we analyze the impact of score thresholding in the computation of F1-score results, finding that the standard value of 0.5 is suboptimal and proposing an alternative strategy based in the use of a specific threshold for each event category, which obtains further improvements in performance
Files in this item
Google Scholar:De Benito-Gorron, Diego
-
Ramos Castro, Daniel
-
Toledano, Doroteo T.
This item appears in the following Collection(s)
Related items
Showing items related by title, author, creator and subject.