Etiquetado automático de segmentos de audio en distintas unidades fonéticas

Este Trabajo Fin de Grado tiene como objetivo realizar el etiquetado de segmentos de audio en fonemas y trifonemas. Para ello realizamos experimentos basados en la extracción de coeficientes MFCC, las características Delta y Delta-Delta, SAT, MMI y fMMI. Finalmente se realizará el entrenamiento de distintas redes neuronales profundas para realizar el etiquetado. Para los entrenamientos usaremos principalmente dos herramientas: Kaldi y Theano. Kaldi nos proporcionará herramientas para el entrenamiento de los sistemas anteriores a las redes neuronales, pero también nos permitirá entrenar una red neuronal. Por otra parte, las librerías de Theano nos permitirán entrenar otra serie de redes neuronales profundas mediante el uso de arquitecturas GPU. Las redes neuronales profundas (DNN), brindan mejores resultados en el etiquetado de voz que el resto de experimentos realizados durante el desarrollo del Trabajo Fin de Grado, por ello nos centramos en estudiar sus resultados. Realizaremos comparaciones entre los resultados obtenidos por las DNNs y los resultados de los experimentos nombrados anteriormente. De la misma manera compararemos los distintos resultados de las DNNs entre ellos, teniendo en cuenta el tiempo de entrenamiento y la precisión en cuanto palabras o ventanas acertadas. Como base de datos se usará la base de datos Switchboard de LDC, con número de serie LDC97S62. Dicha base de datos consta de alrededor de 2400 conversaciones entre dos locutores en inglés. La base de datos se dividirá en distintos subconjuntos para el entrenamiento y validación de los sistemas entrenados. Finalmente, hablaremos sobre las líneas de investigación futuras de las DNNs para el etiquetado de segmentos de audio como son las redes convolucionales y recurrentes.

In this undergraduate project we aim to create a system capable of labelling audio segments in phones and triphones. In order to achieve our objective, we are going to use systems based on MFCC extraction, delta and delta-delta features, SAT, MMI and fMMI. Finally we are going to train a set of deep neural networks to approach the labelling problem. Our main tools to create the systems and train them are: Kaldi and Theano. Kaldi will give us the tools to train the systems previous to the implementation of the deep neural networks, but it will also give us the possibility to train a net by using its tools. Theano libraries will give us access to training deep neural networks using GPU architectures. Deep neural networks (DNN) have bring better results to the labeling voice than the rest of systems developed in this undergraduate project, due to this we are going to focus on their results. Moreover, we are going to compare the results of the DNNs with the results of the rest of systems. In the same vein, we are going to compare the results of the differents DNN between them, having in mind their training time and accuracy over words or frames. As database we are going to use the LDC’s database with serial number LDC97S62. This database is compromised of around 2400 two sided telephone conversations in English. The database is going to be divided in different segments, in order to have a training and test set for the systems. Finally, we are going to discuss the future research lines based on DNNs for audio segments labelling, like convolutional nets and recurrent nets.

Show full item record

Files in this item

Name

Herreros_Salamanca_Daniel_tfg.pdf

Size

822.2Kb

Format

PDF

Google™ Scholar:Herreros Salamanca, Daniel

This item appears in the following Collection(s)

Trabajos de estudiantes (tesis doctorales, TFMs, TFGs, etc.) [20060]

Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/

UAM_Biblioteca