Frame-by-frame language identification in short utterances using deep neural networks

González Domínguez, Javier; López-Moreno, Ignacio; Moreno, Pedro J.; González Rodríguez, Joaquín

UAM_Biblioteca

Autor (es)

González Domínguez, Javier; López-Moreno, Ignacio; Moreno, Pedro J.; González Rodríguez, Joaquín

Entidad

UAM. Departamento de Tecnología Electrónica y de las Comunicaciones

Editor

Elsevier Ltd

Fecha de edición

2015-04-01

Cita

Neural Networks 64 (2015): 49-58

ISSN

0893-6080 (print); 1879-2782 (online)

DOI

10.1016/j.neunet.2014.08.006

Versión del editor

http://dx.doi.org/10.1016/j.neunet.2014.08.006

Materias

DNNs; I-vectors; Real-time LID; Telecomunicaciones

URI

http://hdl.handle.net/10486/674271

Nota

This is the author’s version of a work that was accepted for publication in Neural Networks. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Neural Networks, VOL 64, (2015) DOI 10.1016/j.neunet.2014.08.006

Derechos

Resumen

This work addresses the use of deep neural networks (DNNs) in automatic language identification (LID) focused on short test utterances. Motivated by their recent success in acoustic modelling for speech recognition, we adapt DNNs to the problem of identifying the language in a given utterance from the short-term acoustic features. We show how DNNs are particularly suitable to perform LID in real-time applications, due to their capacity to emit a language identification posterior at each new frame of the test utterance. We then analyse different aspects of the system, such as the amount of required training data, the number of hidden layers, the relevance of contextual information and the effect of the test utterance duration. Finally, we propose several methods to combine frame-by-frame posteriors. Experiments are conducted on two different datasets: the public NIST Language Recognition Evaluation 2009 (3 s task) and a much larger corpus (of 5 million utterances) known as Google 5M LID, obtained from different Google Services. Reported results show relative improvements of DNNs versus the i-vector system of 40% in LRE09 3 second task and 76% in Google 5M LID.

Mostrar el registro completo del ítem

Lista de ficheros

Nombre

frame_gonzalez-dominguez_NN_2015_ps.pdf

Tamaño

585.6Kb

Formato

PDF

Google™ Scholar:González Domínguez, Javier - López-Moreno, Ignacio - Moreno, Pedro J. - González Rodríguez, Joaquín

Lista de colecciones del ítem

Producción científica en acceso abierto de la UAM [20408]

UAM_Biblioteca