Manejo de herramientas Big Data para realizar topic modeling en discursos universitarios y clusterización de los resultados

Aguirre Tamaral, Alejandro

UAM_Biblioteca

Mañana, JUEVES, 24 DE ABRIL, el sistema se apagará debido a tareas habituales de mantenimiento a partir de las 9 de la mañana. Lamentamos las molestias.

Author

Aguirre Tamaral, Alejandro

Entity

UAM. Departamento de Ingeniería Informática

Date

2016-01

Subjects

Datos masivos; World Wide Web; Informática

URI

http://hdl.handle.net/10486/670710

Esta obra está bajo una licencia de Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0 Internacional.

Abstract

Los motivos de un estudiante para elegir ingresar en una universidad entre las múltiples existentes pueden resultar difíciles de cuantificar. La principal razón de dicha elección, aparte de la distancia a la que se encuentra, es comprensible que sea debida a la reputación de la universidad elegida. Existen diversos rankings que clasifican a las universidades más prestigiosas, basándose en ciertas características que éstas poseen. Por lo tanto, existen atributos que hacen que una universidad pueda obtener una mejor clasificación que otra (según el criterio seguido por el ranking en cuestión). La hipótesis de este TFG es estudiar si otro atributo para clasificar universidades podría basarse en los temas que abordan los rectores de dichas universidades en los discursos que imparten. En este Trabajo de Fin de Grado se ha realizado tanto el proceso de recolección de dichos discursos como el manejo de éstos para la obtención de resultados. De esta forma, se pueden ajustar los parámetros de la búsqueda según el requerimiento del análisis. Además, al conocer los datos, éstos se pueden pre-procesar para optimizar los resultados obtenidos por los algoritmos de análisis de texto. Centrándose en la temática de los discursos, en primer lugar hay que seguir una metodología para obtener múltiples discursos de las universidades que se quieran analizar. Se mezclan diversos métodos de recolección de datos para obtener el mayor número posible de muestras que analizar. Una vez obtenidos y validados los discursos, se pueden analizar con diversos algoritmos de análisis de texto. Tras ejecutar un algoritmo que halla los temas de cada discurso, se pueden buscar diferencias entre los discursos de universidades con un ranking destacado y los de universidades con un ranking inferior. De esta forma, es posible comprobar si la diferencia de clasificación entre diversas universidades es apreciable mediante un recurso tan accesible como lo es el discurso de un rector. Además de por los temas, se puede intentar diferenciar la clasificación de una universidad por la cantidad de discursos que aloja en su página web y por la facilidad en la que éstos se pueden descargar (debido a que están bien ordenados en la página web).

The reasons for a student to enrol in a certain university amongst all the possibilities, could be difficult to quantify. The main argument for this choice, apart from the distance to the university, is reasonable to be the chosen university reputation. There are different rankings that classify the most prestigious universities, based on different features. Therefore, there are some characteristics that make a university to have a better classification than another (based on the ranking criteria). This Bachelor Thesis is based on the hypothesis that the topics included in the speeches delivered by University rectors can be used as an additional indicator to classify universities. The work includes both the process of crawling for obtaining the speeches and the handling of this data to acquire results. Therefore, the seeking process could be adjusted by the analysis requirements. Moreover, since data are known, it is possible to pre-process them for optimizing the results obtained by the text analytics algorithms. By focusing on the topics issue, the first step is to follow a methodology for collecting multiple speeches of the universities to be analysed. Diverse methods are mixed for obtaining the maximum number of samples to analyse. Once the speeches have been acquired and validated, they can be analysed by applying various text analytics algorithms. After running an algorithm that discovers the topics of each speech, it is possible to look for differences between the speeches of the universities with a high classification and the ones with a lower classification. Thus, it is possible to research whether the classification of a university is also appreciable by a free resource as its rector speeches. In addition to the topics, another way to try to classify a university is by the amount of speeches that they upload in their web pages or by the easiness to obtain these speeches (due to how well organized is the web page).

Show full item record