Machine Learning en Bases de Datos de Lenguaje Natural

García Gutiérrez, Álvaro

UAM_Biblioteca

Mañana, JUEVES, 24 DE ABRIL, el sistema se apagará debido a tareas habituales de mantenimiento a partir de las 9 de la mañana. Lamentamos las molestias.

Author

García Gutiérrez, Álvaro

Advisor

Sánchez-Montañés Isla, Manuel Antonio

Entity

UAM. Departamento de Ingeniería Informática

Date

2016-06

Subjects

Informática

URI

http://hdl.handle.net/10486/676778

Esta obra está bajo una licencia de Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0 Internacional.

Abstract

El extenso crecimiento que ha experimentado la red durante los últimos años lleva consigo la necesidad de organizar eficientemente la nueva información que se almacena cada día. Por ello cada vez son más importantes las tareas de gestión de documentos en función del contenido. La clasificación de textos es una de estas tareas, la cual consiste en organizar, en diferentes clases, textos escritos en lenguaje natural. Entre las diferentes maneras de abordar este problema, aquí nos limitaremos a tratar con técnicas de aprendizaje automático, es decir, procesos que construyen automáticamente clasificadores. Estos clasificadores aprenden las características de cada clase gracias a un entrenamiento previo sobre un conjunto de textos preclasificados. También daremos cabida en nuestro estudio a las diferentes técnicas de preprocesamiento de los textos, las cuales deben ser realizadas con cuidado con el fin de resolver eficazmente cualquier problema de clasificación propuesto. A lo largo de este trabajo se compararán algunas de estas técnicas de clasificación y preprocesamiento de textos, centrando la mayor parte de nuestros esfuerzos en un problema específico de clasificación conocido como análisis de sentimientos. Este problema consiste clasificar textos según las distintas emociones que expresan, algo que cada vez tiene mayor interés debido al gran desarrollo que están experimentando las redes sociales de hoy en día. Para ello, previamente se estudiarán las metodologías existentes de clasificación con el fin de comprender las limitaciones y ventajas de cada uno. Después, se escogerán algunas de estas metodologías y se realizarán pruebas sobre diferentes bases de datos tratando de comparar las diferentes metodologías e intentando finalmente construir el clasificador que obtenga mejores resultados. Este proceso se llevará a cabo analizando tres fases diferentes: la representación previa de los documentos, la extracción y selección de características, la construcción de un clasificador y la evaluación de dicho clasificador.

The extensive growth that the Internet has gone through over the last few years carries with it, the need to efficiently organize the new information that is stored each day. Due to this, the job of document management, based on content, has become much more important. The text classification is one of these such jobs, which consists of organizing, in different categories, texts written in their natural language. While there are various ways to approach this issue, here, we will focus on machine learning techniques. In other words, the processes that automatically build and manage classifiers. These classifiers learn the characteristics of each category due to the prior training sessions done with a pre-classified group of texts. We will also look very carefully at the different techniques used in the processing of the texts in order to efficiently resolve any classification issues that might arise. Throughout this study, many of these classification and text-processing techniques will be compared. However, the majority of our focus will be on one specific classification problem known as sentiment analysis. This problem revolves around the idea of classifying texts based on the different emotions they express, something, which has recently gained interest due to the growth of social media. To do this, we must first study the existing methods of classification to understand the limits and advantages of each one. Next, a few of these methods will be chosen to be used in trials with different databases. The trials and methods will be compared to then create a classification system that obtains the best results. This process will be conducted by analyzing four different phases: the prior representation of the documents, the extraction and selection of characteristics, the creation of a classifier and the evaluation of said classifier.

Show full item record

Files in this item

Name

Garcia_Gutierrez_Alvaro_tfg.pdf

Size

1.056Mb

Format

PDF

Google™ Scholar:García Gutiérrez, Álvaro

This item appears in the following Collection(s)

Trabajos de estudiantes (tesis doctorales, TFMs, TFGs, etc.) [19985]

Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/

UAM_Biblioteca