My project

วันอังคารที่ 8 กุมภาพันธ์ พ.ศ. 2554

My project

An Intelligent Digital Library System for Biologists

Jeffrey Stone

Dept. of Computer Science,

University of Vermont

jestone@zoo.uvm.edu

Xindong Wu

Dept. of Computer Science,

University of Vermont

xwu@emba.uvm.edu

Marc Greenblatt

Department of Medicine,

University of Vermont

Marc.Greenblatt@vtmednet.org

Abstract

To aid researchers in obtaining, organizing and

managing biological data, we have developed a

sophisticated digital library system that utilizes

advanced data mining techniques. Our digital

library system is centralized with Web links to

publicly accessible data repositories. Our digital

library is based on a framework used for

conventional libraries and an object-oriented

paradigm, and will provide personalized usercentered

services based on the user’s areas of

interests and preferences. To make personalized

service possible, a “user profile” that represents the

preferences of an individual user is constructed

based upon the user’s past activities, goals indicated

by the user, and options. Utilizing these user

profiles, our system will make relevant information

available to the user in an appropriate form,

amount, and level of detail with minimal user effort.

1. Introduction

Recent advances in the fields of computational

biology, cloning and genetics have resulted in vast

amounts of data, which are providing an

unprecedented volume of knowledge to researchers

and medical personnel. This information will be

critical for the understanding of biological structure

and function and has allowed for the development of

new treatment approaches for disease such as gene

therapy and pharmacogenetics. However, the

amount of data that the researcher must digest on a

daily basis has become unmanageable

the view of biological phenomena as being composed

of a number of sub disciplines (e.g., structural

biology, genomics, proteomics, and biochemistry)

has served to further complicate the issue. To obtain

a coherent picture of biological phenomena at the

molecular, cellular, and organism levels, one must

both look at all of these attributes and at the

relationships among them. To do that currently

requires finding which databases contain the relevant

information and then searching through the databases

one by one.

To aid the researcher in this task of information

retrieval and organization, we are developing a

sophisticated digital library system that utilizes data

mining techniques and user profiling to recommend

items to the user. Our digital library system is

centralized with Web links to publicly accessible data

repositories. Based on the framework of a

conventional library, our system will also provide

user-centered services based on a user’s past activities

and preferences.

The core of our project is an agent architecture

that provides advanced services by combining data

mining capabilities with domain knowledge in the

form of a semantic network. The semantic network

will impart a knowledge structure through which the

system can “reason” and draw conclusions about

biological data objects and will provide a federated

view of the many disparate databases of interest to

biologists. In the development of our semantic

network, we have included the concepts from several

established controlled vocabularies, chief among

them being the National Library of Medicine’s

Unified Medical language System (UMLS). Our

complete semantic network consists of 183 semantic

types and 69 relationships.

. Furthermore,

2. Library System Design

Our approach begins from the centralized,

structured view of a conventional library, and seeks

to provide access to the digital library through

electronic means including the Internet, while

maintaining the advantages of decentralization, rapid

evolution and flexibility of the Web. The core of our

project is a knowledge object modeling of data

Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB 2004)

0-7695-2194-0/04 $20.00 ฉ 2004

IEEE

repositories, and an agent architecture that provides

advanced services by combining data mining

capabilities. The knowledge objects are defined to be

an integration of the object-oriented paradigm with

rules, the proper integration of which provides a

flexible and powerful environment for deductive

retrieval and pattern matching.

To make personalized service possible, a “user

profile” representing the preferences of an individual

user is constructed based upon the user’s past

activities, goals indicated by the user, and options.

Utilizing these user profiles, our system will make

relevant information available to the user in an

appropriate form, amount, and level of detail, and

especially with minimal user effort.

3. Semantic network based dictionary

One crucial component of our digital library

system is a dictionary of biological terminology.

This dictionary will play an important role in

building the user profiles as well as the categorization

rules of each item in our digital library.

In the construction of the dictionary, we are

presented with some difficulties due to the nature of

biological data. Some of the problems encountered

are multiple names for the same protein or gene in

different organisms, the dependency of the biological

state in which the function is taking place and

multiple functions for the same protein. These

problems preclude the use of a simple hierarchical

dictionary structure.

To overcome these obstacles and provide a model

that can accurately model the information contained

in multiple biological databases, we have developed

our dictionary as a semantic network of biological

terminology utilizing a directed graph based

paradigm. Our semantic network strives to provide a

categorization of biological concepts and

relationships among these concepts. The semantic

network will impart a knowledge structure through

which our system can “reason” and draw conclusions

about biological data objects. The Unified Medical

language System (UMLS) contains a large semantic

network of its own that we have used as a base for

our system [1]. However the UMLS is in some

aspects not general enough for use in categorizing

multiple biomedical databases and also contains too

many terms that are outside of the scope of our

project. Therefore, we have trimmed some of the

detail from the UMLS system and added new types

and relationships to this system to provide a more

general coverage of biological databases. Our

complete semantic network consists of 183 semantic

types and 69 relationships.

Our semantic network is comprised of nodes

representing semantic types and relationships between

these nodes. Each node represents a category of

either a biological entity or an event. The entities

and events used in our semantic network result from

a merging of some of the concept names in the

National Library of Medicine’s Unified Medical

Language System and the Gene Ontology

Consortium’s controlled vocabulary [2].

4. Generation of rules

The rules that are used for the recommendation of

items in our system are generated with the popular

open source Weka data mining package [2].

Regeneration of rules will take place upon the end of

each session, or optionally during a session when

prompted by the user. At the time of regeneration of

the rules, relevant data is extracted from the user’s

profile and passed to the Weka J48 program. This

program will generate the classification rules that will

be the basis for the recommendation of items in the

library. After the rules are generated, they are saved

into the user’s profile after generation.

5. Conclusions

Through the combination of data mining, user

profiles and a semantic network, our system can aid

researchers in obtaining, organizing and managing

biological data. Through the use of recommendation

agents, our system will help the novice researcher

discover new topics and items to look at, and the

expert to quickly view new items that pertain to their

research area.

6. References

[1]

Library of Medicine,

NLM Unified Medical Language System, National

http://www.nlm.nih.gov/research/umls

[2]

http://www.geneontology.org

[3] Ian H. Witten and Eibe Frank,

Practical machine learning tools with Java

implementations

2000.

Gene Ontology Consortium.Data Mining:, Morgan Kaufmann, San Francisco,

Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB 2004)

0-7695-2194-0/04 $20.00 ฉ 2004 IEEE

สาระความรู้เกี่ยวกับคอมพิวเตอร์

วันอังคารที่ 8 กุมภาพันธ์ พ.ศ. 2554

My project

ไม่มีความคิดเห็น :

แสดงความคิดเห็น