Friday, January 20, 2012

Gene clustering using Latent Semantic Indexing of MEDLINE abstracts

http://memphis.edu/binf/RaminWebpage.htm

Gene clustering using Latent Semantic Indexing of MEDLINE abstracts

Recent advances in genomics and DNA microarray technology enable investigators to simultaneously analyze the expression of thousands of genes under different experimental conditions. However understanding the functional relationships between co-regulated genes presents a formidable task to investigators, requiring first hand knowledge of the biological characteristics of ea`ch gene. There are a variety of public electronic resources from which investigators may assemble gene information. For instance, there are over 10,000 annotated human genes in LocusLink and nearly 13 million citations archived in MEDLINE. However, better automated tools are needed to aid in extraction and utilization of gene information from these databases. My lab has been collaborating with Dr. Michael Berry (Professor of Computer Science at The University of Tennessee, Knoxville; http://www.cs.utk.edu/~berry/) to develop a new software environment called Semantic Gene Organizer?(SGO) ( http://shad.cs.utk.edu/sgo/sgo.html ) to automatically extract gene relationships from titles and abstracts in MEDLINE citations. SGO utilizes a variant of the vector-space model of information retrieval called Latent Semantic Indexing (LSI). LSI implements a classical factorization method from linear algebra (singular value decomposition) to identify conceptual relationships between documents. Our studies have provided proof-of-principle that LSI is a robust automated method for identification of gene-to-keyword and gene-to-gene relationships from the biological literature. Future aims of this project include: 1) expansion of the gene-document collection to include all genes in the LocusLink database; 2) Utilize SGO to expand gene ontology terms and functional gene annotation.

No comments: