Data mining from scientific literature

K Shah, Parantu

UAM_Biblioteca

Author

K Shah, Parantu

Advisor

Bork, Peer; Valencia Herrera, Alfonso

Entity

UAM. Departamento de Biología Molecular

Date

2005-12-21

Subjects

Bioinformática - Tesis doctorales; Biología y Biomedicina / Biología

URI

http://hdl.handle.net/10486/674061

Note

Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Ciencias, Departamento de Biología Molecular. Fecha de lectura: 21-12-2005

Abstract

Function annotation in the genomic context is one of the major challenges faciog the discipline of Bioinfomiatics today. Seqwnces of entire genomes are continuously being deposited in public databws waiting to be analyzed and annotated. Computational methods and data wming out fmm various types of high-throughput experiments are now being used to assist in huictional annotations and knowledge discovery. Published findings mostly analyzing mles of individual genes are used for gene annotations. Similarly. curated sets of facts established in the literature are required in order to check the quality of computational methods and analysis of high-thmughput data. Hena. there is a great demand for infotmation exhaction tools to extnct structured information about gene and gene pmducts fmm scientific literature automaticaily and prepare hiowledgebases. Before one sets on to devise tools for infonnation extraction fmm scientific literature, several questions must be answered. Where does the useful infonnation reside? 1s this information structuredenough to be exhacted? What tools should be utilized for accurate retneval and exmtion of infomiation? Also, how useful mining of information form biomedical texts is for advancing level of present knowledge? Moreover, suitabitity of tools developed for processing of general Englih should also be checked for their usability for biomedical iexts. The work presented in this thesis nies to answer questions poscd above. Keyword-based analysis of full-text articles from Nature genetlcs was carried out in order to analyze and compare the distribution of information in different sections of papers. Keyword based methods while very useful to explore the overall struciure and article contents don't provide exact relationships memioned in the literature. Biologically importmt events and relationships can only be extracted usipg the BtnictUred templates based on contents of sentences descnbing events of interest, which is a non-tivial task. The potential of predicate argument stnictures for providing semantic templates for accurate information extraction was explorcd for verbs describing gene expression. molecular interactions and signal hansduction. Predicate argument structures (PAS) was d&ed for important verbs by analyzing sentences fmm Abstracts as well as full-text aiticles; they were then compared systematically with PropBank PAS for general English in order to characterize domain specific usage of predicates in biomedical texts. A database of transcnpt diversity was genented using a composite procedure that combined retneval of appmpriate sentences from MEDLINE and extncting information using niles basad on PAS. Suppon vector machines proved to be the best sentence categorization/retrievaI method when compared to other retneval methods. LSAT - a database of altemative tnnscnpts was generated after the PAS based information extraction sep. lnformation miding in LSAT was utiüzed for MeSH term and gene annotations, and studying about the extent of synergy and preferente of different transcript diversity generating mechanisms by different organ systems.

Show full item record

Files in this item

Name

shah_parantu_k.pdf

Size

20.05Mb

Format

PDF

Description

Texto de la Tesis Doctoral

Google™ Scholar:K Shah, Parantu

This item appears in the following Collection(s)

Trabajos de estudiantes (tesis doctorales, TFMs, TFGs, etc.) [19966]

UAM_Biblioteca