Feature learning with raw-waveform CLDNNs for Voice Activity Detection

Zazo Candil, Rubén; Sainath, Tara N.; Simko, Gabor; Parada, Carolina

UAM_Biblioteca

Mañana, JUEVES, 24 DE ABRIL, el sistema se apagará debido a tareas habituales de mantenimiento a partir de las 9 de la mañana. Lamentamos las molestias.

Author

Zazo Candil, Rubén

; Sainath, Tara N.; Simko, Gabor; Parada, Carolina

Entity

UAM. Departamento de Tecnología Electrónica y de las Comunicaciones

Publisher

International Speech and Communication Association

Date

2016-09-12

Citation

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech and Communication Association, 2016. 3668-3672

ISSN

2308-457X

DOI

10.21437/Interspeech.2016-268

Editor's Version

http://dx.doi.org/10.21437/Interspeech.2016-268

Subjects

Feature extraction; Network architecture; Speech communication; Speech processing; Deep neural networks; Frequency modeling; Long short term memory; Model architecture; Pre-processing step; Proposed architectures; Speech recognition systems; Voice activity detection; Speech recognition; Telecomunicaciones

URI

http://hdl.handle.net/10486/675731

Rights

Abstract

Voice Activity Detection (VAD) is an important preprocessing step in any state-of-the-art speech recognition system. Choosing the right set of features and model architecture can be challenging and is an active area of research. In this paper we propose a novel approach to VAD to tackle both feature and model selection jointly. The proposed method is based on a CLDNN (Convolutional, Long Short-Term Memory, Deep Neural Networks) architecture fed directly with the raw waveform. We show that using the raw waveform allows the neural network to learn features directly for the task at hand, which is more powerful than using log-mel features, specially for noisy environments. In addition, using a CLDNN, which takes advantage of both frequency modeling with the CNN and temporal modeling with LSTM, is a much better model for VAD compared to the DNN. The proposed system achieves over 78% relative improvement in False Alarms (FA) at the operating point of 2% False Rejects (FR) on both clean and noisy conditions compared to a DNN of comparable size trained with log-mel features. In addition, we study the impact of the model size and the learned features to provide a better understanding of the proposed architecture.

Show full item record

Files in this item

Name

feature_zazo_INTERSPEECH_2016.PDF

Size

733.4Kb

Format

PDF

Google™ Scholar:Zazo Candil, Rubén - Sainath, Tara N. - Simko, Gabor - Parada, Carolina

This item appears in the following Collection(s)

Producción científica en acceso abierto de la UAM [20391]

UAM_Biblioteca