Procesamiento de textos manuscritos: Técnicas de agrupamiento de imágenes de palabras

Romero Vejar, Luis Eduardo

UAM_Biblioteca

Mañana, JUEVES, 24 DE ABRIL, el sistema se apagará debido a tareas habituales de mantenimiento a partir de las 9 de la mañana. Lamentamos las molestias.

Author

Romero Vejar, Luis Eduardo

Advisor

Colás Pasamontes, José

Entity

UAM. Departamento de Tecnología Electrónica y de las Comunicaciones

Date

2016-07

Subjects

Telecomunicaciones

URI

http://hdl.handle.net/10486/675124

Esta obra está bajo una licencia de Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0 Internacional.

Abstract

El procesamiento de textos manuscritos, principalmente si se trata de textos históricos (incunables), es un área de gran interés en países donde el patrimonio cultural es muy amplio y de gran valor como es el caso de España. Hasta la fecha los archivos y fondos documentales que custodian este tipo de documentos históricos sólo ha podido digitalizar (fotografiar) los mismos para permitir el acceso a través de Internet y preservar esta riqueza documental que, en nuestro caso, data de muchos siglos atrás. El objetivo de este trabajo es desarrollar una herramienta informática que permitan la extracción del texto a partir de las fotografías de estos documentos. Para este fin se integrarán en la herramienta distintos bloques que permitirán preprocesar las imágenes de páginas de documentos históricos cuya calidad generalmente dista de estar en condiciones óptimas de inteligibilidad. Una vez preprocesada la imagen procederá a ser segmentada a nivel de línea y de palabra; de este modo se extraerán las imágenes de palabras que conforman cada página que posteriormente serán agrupadas por similitud para generar diccionarios visuales que permitan la transcripción manual de los mismos a especialistas y de esta forma el proceso de transcripción pueda ser automatizado. Se construirá una herramienta en MATLAB que permitirá integrar todos los procesos llevados a cabo para que pueda ser utilizada por gente no especializada en el ámbito de la informática. También, se desarrollarán técnicas de evaluación objetivas gracias a los recursos documentales de esta naturaleza disponibles en el grupo HCTLAB.

Manuscript processing, mainly if they are historical texts (incunabula), is an area of great interest in countries where cultural heritage is very broad and valuable as is the case in Spain. To this day, files and collections of historical documents have only been able to digitize (photograph) them to allow access through the Internet and preserve this documentary wealth which, in our case, dates back many centuries. The aim of this work is to develop a software tool that allows text extraction from photographs of these documents. For this purpose, the tool will consist of different blocks that will preprocess the images of pages of historical documents whose quality is generally far from being in optimum conditions of intelligibility. Once the image is preprocessed it will be segmented (line and word level); thus the images of words that make up each page subsequently be grouped by similarity to generate visual dictionaries that allow manual transcription by specialists and so the transcription process can be done automatically within a tool in MATLAB that will integrate all processes carried out so that it can be used by people not specialized in the field of information technology. Also, evaluation techniques will be developed and objective assessment techniques are also developed thanks to the documentary resources of this nature available in the HCTLAB group.

Show full item record

Files in this item

Name

Romero_Vejar_LuisEduardo_tfg.pdf

Size

2.403Mb

Format

PDF

Google™ Scholar:Romero Vejar, Luis Eduardo

This item appears in the following Collection(s)

Trabajos de estudiantes (tesis doctorales, TFMs, TFGs, etc.) [19985]

Except where otherwise noted, this item's license is described as https://creativecommons.org/licenses/by-nc-nd/4.0/

UAM_Biblioteca