Discovery by scent: Discovery browsing system based on the Information Foraging Theory

8
Discovery by scent: Discovery browsing system based on the Information Foraging Theory J. Caleb Goodwin & Trevor Cohen, MBChB, PhD The University of Texas at Houston Houston, Texas [email protected] [email protected] Thomas Rindflesch, PhD The National Library of Medicine Bethesda, Maryland [email protected] Abstract - This work presents a discovery browsing system based on the Information Foraging Theory (IFT). Discovery browsing is a type of information seeking behavior where the expert user interacts iteratively with a literature-based discovery system to explore poorly understood relationships with the end goal of formulating a hypothesis or gaining insight by uncovering novel points of view. The mathematical model underlying the IFT is predictive of information seeking behavior of foragers on the World Wide Web (WWW) in a plethora of scenarios. We hypothesize that a discovery browsing system based upon the IFT can assist the user in the process of discovery by automatically making available the concepts to which the user would most likely attend. Given initial terms from a user, the discovery browsing system mines a semantic network of over 26 million object-relation- object pairs from 7.8 million MEDLINE citations and presents a ranked sub-graph, which is the prediction of where the interesting concepts (ideally discoveries) lie. In this work, we present the theoretical foundations and design of the discovery browsing system. To demonstrate its efficacy, we replicate two recent discoveries and demonstrate that it is able to predict the concepts that were determined as playing a role in novel hypotheses proposed by scientists. Keywords: Information foraging theory, Literature-based discovery, spreading activation, discovery browsing I. INTRODUCTION Paradoxically, a known contributor to knowledge deficiency in science is the body of scientific knowledge itself [1]. As more knowledge is created, one strategy devised by scientists to deal with the resulting information overload is to specialize in narrowly focused areas [2, 3]. The result is an increasingly compartmentalized body of scientific knowledge, which can result in undiscovered public knowledge [2]. According to Swanson, “knowledge can be public, yet undiscovered, if independently created fragments are logically related but never retrieved, brought together, and interpreted” [3]. An emerging discipline known as LBD seeks to reconnect the fragmented body of literature by developing methods to facilitate the process of hypothesis generation by uncovering unknown relationships [2]. This work extends the existing LBD system known as Semantic MEDLINE [4], which relies upon semantic predications created using the natural language processing (NLP) tool SemRep [5]. SemRep extracts concept-relation- concept triplets known as predications from MEDLINE citations (titles and abstracts). SemRep predications have shown potential for uncovering latent knowledge in several studies [6, 7]; however, a predication graph for a given seed term can contain hundreds of concepts and thousands of relationships, which place an extreme cognitive load on the user. One method for dealing with the potential explosion of relationships is to develop systems to facilitate discovery browsing. Wilkowski et al. [6] presented the notion of discovery browsing as human-machine cooperative reciprocity where the user “focuses system output iteratively based on stipulations that bring relevant relations into clearer focus by narrowing choices, thus controlling the explosion of potential relationships often generated in LBD”. The initial conceptualization of discovery browsing presented by Wilkowski et al. [6] relied upon the degree centrality, which is a measure of the importance of a vertex in a graph [8], of the concepts to rank paths, and human input to direct the focus of the browsing. In this work, we seek to extend the initial work in discovery browsing by investigating methods to enable the machine to have a more active role in discovery browsing by making predictions of the concepts to which the user would most likely attend. The ultimate goal is to develop a system that explores multiple paths simultaneously at several levels of depth from the seed terms and provides recommendations to the user either by ranking concepts or ranking sub-graphs. The foundation of this work lies in computational cognitive modeling. Cognitive models have been used to model human behavior in a myriad of scenarios [9, 10]. In this work, we develop an initial model that replicates the results of discovery browsing. The intuition behind this approach is that if cognitive models of discovery browsing can be developed then these models can serve as a foundation for agents that can actively assist the user in discovery browsing by making active recommendations. In this paper, we describe a discovery browsing system grounded in the computational cognitive model of information seeking behavior known as the IFT. According to IFT, the utility of an information item is assessed by its information scent, which can be thought of as a “rational 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW) 978-1-4673-2747-3/12/$31.00 ©2012 IEEE 232

Transcript of Discovery by scent: Discovery browsing system based on the Information Foraging Theory

Discovery by scent: Discovery browsing system based on the Information Foraging Theory

J. Caleb Goodwin & Trevor Cohen, MBChB, PhD

The University of Texas at Houston Houston, Texas

[email protected] [email protected]

Thomas Rindflesch, PhD The National Library of Medicine

Bethesda, Maryland [email protected]

Abstract - This work presents a discovery browsing system based on the Information Foraging Theory (IFT). Discovery browsing is a type of information seeking behavior where the expert user interacts iteratively with a literature-based discovery system to explore poorly understood relationships with the end goal of formulating a hypothesis or gaining insight by uncovering novel points of view. The mathematical model underlying the IFT is predictive of information seeking behavior of foragers on the World Wide Web (WWW) in a plethora of scenarios. We hypothesize that a discovery browsing system based upon the IFT can assist the user in the process of discovery by automatically making available the concepts to which the user would most likely attend. Given initial terms from a user, the discovery browsing system mines a semantic network of over 26 million object-relation-object pairs from 7.8 million MEDLINE citations and presents a ranked sub-graph, which is the prediction of where the interesting concepts (ideally discoveries) lie. In this work, we present the theoretical foundations and design of the discovery browsing system. To demonstrate its efficacy, we replicate two recent discoveries and demonstrate that it is able to predict the concepts that were determined as playing a role in novel hypotheses proposed by scientists.

Keywords: Information foraging theory, Literature-based discovery, spreading activation, discovery browsing

I. INTRODUCTION Paradoxically, a known contributor to knowledge deficiency in science is the body of scientific knowledge itself [1]. As more knowledge is created, one strategy devised by scientists to deal with the resulting information overload is to specialize in narrowly focused areas [2, 3]. The result is an increasingly compartmentalized body of scientific knowledge, which can result in undiscovered public knowledge [2]. According to Swanson, “knowledge can be public, yet undiscovered, if independently created fragments are logically related but never retrieved, brought together, and interpreted” [3]. An emerging discipline known as LBD seeks to reconnect the fragmented body of literature by developing methods to facilitate the process of hypothesis generation by uncovering unknown relationships [2].

This work extends the existing LBD system known as Semantic MEDLINE [4], which relies upon semantic predications created using the natural language processing (NLP) tool SemRep [5]. SemRep extracts concept-relation-concept triplets known as predications from MEDLINE citations (titles and abstracts). SemRep predications have shown potential for uncovering latent knowledge in several studies [6, 7]; however, a predication graph for a given seed term can contain hundreds of concepts and thousands of relationships, which place an extreme cognitive load on the user.

One method for dealing with the potential explosion of relationships is to develop systems to facilitate discovery browsing. Wilkowski et al. [6] presented the notion of discovery browsing as human-machine cooperative reciprocity where the user “focuses system output iteratively based on stipulations that bring relevant relations into clearer focus by narrowing choices, thus controlling the explosion of potential relationships often generated in LBD”. The initial conceptualization of discovery browsing presented by Wilkowski et al. [6] relied upon the degree centrality, which is a measure of the importance of a vertex in a graph [8], of the concepts to rank paths, and human input to direct the focus of the browsing. In this work, we seek to extend the initial work in discovery browsing by investigating methods to enable the machine to have a more active role in discovery browsing by making predictions of the concepts to which the user would most likely attend. The ultimate goal is to develop a system that explores multiple paths simultaneously at several levels of depth from the seed terms and provides recommendations to the user either by ranking concepts or ranking sub-graphs.

The foundation of this work lies in computational cognitive modeling. Cognitive models have been used to model human behavior in a myriad of scenarios [9, 10]. In this work, we develop an initial model that replicates the results of discovery browsing. The intuition behind this approach is that if cognitive models of discovery browsing can be developed then these models can serve as a foundation for agents that can actively assist the user in discovery browsing by making active recommendations.

In this paper, we describe a discovery browsing system grounded in the computational cognitive model of information seeking behavior known as the IFT. According to IFT, the utility of an information item is assessed by its information scent, which can be thought of as a “rational

2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW)

978-1-4673-2747-3/12/$31.00 ©2012 IEEE 232

analysis of categorization of cues according to their expected utility” [9]. According to IFT, users attend to the cues with the highest expected utility given their information need. Therefore, if one can accurately model information scent, it is possible to predict the items that the user would most likely access.

The underlying computation of information scent in this work and previous invocations of information scent [11] is based on the Adaptive Control of Thought-Rational (ACT-R) theory of human associative memory [10]. According to this theory, human semantic memory is contained in a semantic network and the retrieval of concepts from the semantic network is based on a spreading activation mechanism that computes the probability of a memory item being needed based on the current context and past use. The spreading activation algorithm, based on the ACT-R theory of human associative memory, traverses the predication graph to identify a sub-graph composed of the concepts with the highest information scent, which is the prediction of where interesting relations and concepts (ideally discoveries) lie.

This work presents a novel advance in the use of predications for LBD by developing a model to assist users in exploring the predication space by making active predictions based on formal models of human information seeking behavior. In addition, to the extent of our knowledge, this work presents the first LBD system developed based on insights from IFT or rational computational cognitive models in general. Finally, though similar algorithms based on concepts from spreading activation (known as Constrained Spreading Activation (CSA) [12]) have been applied to information retrieval (IR), this work presents the first attempt at using this family of models for LBD.

II. BACKGROUND

A. Literature-based Discovery The field of LBD began with the pioneering work of Swanson [2]. The two major variants of LBD are known as open and closed discovery both of which are variants of the so-called “A-B-C paradigm” [13]. In Open Discovery, the user begins with an initial seed term (A term), such as “Raynaud’s Disease”, and explores related terms to attempt to identify previously unrecognized connections. The concepts that link to the A term are known as the linking terms (B terms). For example, the concept “Raynaud’s Disease” would be the A term and a possible linking term (B term) would be the concept “Platelet Aggregation”. In the next step, the concepts that link to the B terms are known as the target terms (C Terms). In closed discovery, the inquiry begins with both A and C terms, and the discovery involves finding the linking term(s) (B term) that connect the knowledge related to the A and C terms.

B. Related works

An in-depth review of all LBD methods is beyond the scope of this paper and the interested reader is directed to [14]. For this paper, we focus the review on methods that utilize semantic predications for LBD. Semantic MEDLINE provides an alternative method for browsing the MEDLINE literature using semantic relations [4] and has been used in several LBD studies [6, 7]. Semantic MEDLINE relies upon the NLP system SemRep [5] to extract semantic predications in the form , , from MEDLINE abstracts.

The first paper to propose leveraging predications for LBD is [15], and subsequent work leveraged the approach for drug discovery [16] and hypothesis generation by integrating predications with microarray data [17]. These works rely upon the notion of the discovery pattern, which can be viewed as performing inference over the predication graph utilizing rules. Ahlers et al. [16] describes the discovery pattern as containing “a set of conditions to be satisfied for the discovery of new relations between concepts”. Since the initial work of Hristovski, the Semantic MEDLINE system has been used to propose several hypotheses [6, 7]. Specifically, they proposed a mechanistic link between cortisol, testosterone, and age-related sleep quality decline [7] and the underlying etiology relating sleep and depression [6]. The work by Cohen [18-20] is focused on encoding the predications in a high-dimensional vector space. A notable insight of this approach is the ability to perform analogical reasoning and encode typed relations in the high-dimensional space.

C. Overview of cognitive architectures and cognitive modeling The term “cognitive architecture” was first introduced to cognitive science in 1971 [21]. According to Anderson [10], a cognitive architecture is a “specification of the structure of the brain at a level of abstraction that explains how it achieves the function of the mind”. The ACT-R theory asserts that the mind is comprised of structural modules and these modules correspond to brain regions. Example modules include the declarative memory module and visual perception module. The function of the mind (cognitive processes), according to the ACT-R theory, emerges through interaction of the modules. Thus, a cognitive model within the ACT-R cognitive architecture is a specification of the interaction of multiple modules. IFT is a cognitive model created within the cognitive architecture ACT-R and is used to understand a specific cognitive process (e.g. information seeking on the WWW). Likewise, in this investigation, we are interested in gaining insight into the specific task of discovery browsing and the model exists within the theoretical foundation of the ACT-R architecture.

The ACT-R architecture, which this work and IFT draws upon, is considered a hybrid symbolic-subsymbolic system. The ACT-R theory originated with Anderson’s theory of human associative memory (HAM) [22] and was subsequently updated with insights from rational analysis

233

(discussed in detail in Section III) to include Bayesian interpretations of several components of the architecture [23-25]. Increasingly, the ACT-R architecture is informed by insights from neuroscience with the major focus of the current ACT-R research being predicting the blood oxygen level-dependent (BOLD) responses in different brain regions during performance of a cognitive task [26].

III. THEORETICAL AND MATHEMATICAL FRAMEWORK

A. Information scent and literature-based discovery The Rational analysis is a methodology for studying cognition that focuses on understanding the demands placed on the cognitive system by the environment to arrive at a model of the computational problem that the agent is solving based on these demands [10]. This approach arises from a line of research that has investigated the interplay between information systems and human memory. Interestingly, the rational analysis methodology, which was created initially for studying human memory, began with the observation made by Anderson (shown below) that human memory and information management systems face the same computational challenge.

The framing we offer for human memory comes from a subfield of computer science called, curiously, information retrieval. The generic information retrieval system has a database of stored items and must respond with a subset of these, given a query that consists of some keywords. …. Such systems have to deal with access to very large databases in the presence of very limited and uncertain cues. We think this is essentially the human situation. [27]

Following this initial observation, numerous studies have contributed by developing information management systems based on insights from rational models of cognition. Examples of this include predicting document accesses [28, 29] and personalized document management systems [30]. Furthermore, insight into human cognition has resulted by investigating the computational problems solved by information management systems. An example is insight gained into human semantic network organization by investigating how computers organize graph-based information [31]. This work is best understood in the context of this lineage of research and has the overarching goal of leveraging the insights from the rational analysis of cognition (specifically IFT) to develop a theoretical foundation for LBD as well as systems that can assist the user in exploring the predication space.

The goal of IFT is to provide insight into information seeking behavior and the underlying cognitive processes that guide these strategies. According to [9], information scent refers to “the detection and use of cues, such as World Wide Web links or bibliographic citations, that provide users with concise information about content that is not immediately available”. For example, consider the search

results of a typical search engine shown in Figure 1. According to IFT, the user will select the link with the highest information scent based on proximal cues such as the Web Page title to maximize the probability of satisfying the information need with the distal information content (i.e. the Web page associated with a hyperlink).

Figure 1: Information scent and the WWW. Adapted from [9]

In mapping information scent to discovery browsing on Semantic MEDLINE, the potential proximal cues are concept labels, relation types, and semantic types of the concepts connected to the seed terms. An additional proximal cue, which has proven useful in previous studies using Semantic MEDLINE for LBD [6, 7], is the structure of the predication graph generated from the seed terms (e.g. degree centrality). In discovery browsing, the user enters seed terms and then evaluates the utility of the connected concepts based on the aforementioned cues. Figure 2 represents the view of the forager after the initial seed terms. In the context of discovery browsing, the utility can be thought of as “discovery scent”, which is the forager’s assessment of whether a concept lies on the path to discovery. Some of the distal concepts the forager will explore by modifying the seed terms which has the result of shifting the direction of search in particular areas according to what the forager expects to lead to a discovery or a poorly understood area.

The goal of this initial work is to model information scent for discovery browsing. Such a model should rank highly the concepts that the user finds interesting and should also reflect the decision points that the user made in focusing the output of an LBD system during discovery browsing. The initial conceptualization of information scent for discovery browsing presented in this work uses a spreading activation algorithm inspired by the ACT-R theory of human memory. This particular instantiation of spreading activation deviates from the previous ACT-R and IFT models in that the mathematical framework has been modified to take advantage of the structure of the predication graph as well as recent insights in smoothing probability distributions using graph-based algorithms [32]. However, theoretically the motivation is unchanged and we seek to rank the concepts based on the posterior probability (information scent) that the user would attend to one on the linked concepts.

Time

Info

rmat

ion

scen

t (E

xpec

ted

Util

ity)

CurrentWeb page

Linked Pages

Hyp

erlin

ks

234

Figure 2: Information scent and LBD

The overarching goal of formalizing discovery browsing within the theoretical framework of IFT is two-fold. First, discovery browsing is a potentially new class of information seeking behavior distinguished by expert searchers that are seeking to formulate novel hypotheses. To the authors’ knowledge, discovery browsing does not fit easily into any of the existing information seeking paradigms, which limits our ability to rationally leverage techniques from well-studied fields such as IR to support users in discovery browsing. Second, the use of computational cognitive models can guide the development of tools to support the user in discovery browsing by automatically making available the concepts to which they would most likely attend (distal concepts in Figure 2) based on their expression of the information need (current concept in Figure 2) thereby enabling the user to browse the predication space in active collaboration with the machine.

B. Information scent calculation In this work and many pervious applications of IFT, information scent is calculated using the ACT-R theory of human memory computational model [10]. It is important to note that the assumptions of the ACT-R human memory theory and IFT differ, but the rational analyses of the computational task are similar and consequently the mathematical framework is identical. The ACT-R theory of human memory is based on the hypothesis that the human memory system actively predicts the memory items most likely to be needed based on the current context and past access of memory items [27]. In contrast, IFT is based on the idea that the forager is assessing the utilities (information scent) of external items based on proximal cues (e.g. textual description for a hyperlink) and selects the proximal cue (i.e. hyperlink) that will most likely satisfy the user’s information need. In both cases, the computational problem faced by the agent is calculating the utility of distal information given the proximal cues.

The Bayesian spreading activation equation used in ACT-R and IFT is presented in Equation 1. The parameter

is known as the base-level activation and reflects the prior probability of a given item being accessed independent of the current of the current context. The parameter is known as the strength of association and measures how likely a source of evidence (seed term or term in second

level inference) is to be encountered given another concept and is thus context dependent. In Bayesian terms, this is known as the likelihood. The attentional weight ( ) is a measure of the validity of a given piece of evidence. For example, a term could be weighted based on an entropy measure to take into account term generality.

(1)

In this work, the base-level activation of a given concept in the predication graph is estimated based on degree centrality, which is motivated by previous works demonstrating the utility of degree centrality in LBD [6, 7] and recent studies providing evidence that degree centrality computed on a semantic network is reflective of the probability of an item being retrieved [31]. The degree centrality measure used in this work (presented in Table 1) originates with [33], which provides a generalization of degree centrality for weighted graphs by taking into account the number of ties (Equation 4) and weight of ties (Equation 5) using the linear integration parameter with values 0,1 .

TABLE 1. Prior probability equations

(2)

(3)

log log (4)

Before proceeding, two concepts need to be defined.

The discovery graph ( ) is the subset of the predication graph ( ) wherein the likelihoods of the concepts given the seed terms are calculated. For example, given two seed terms A and C, the for the first level of inference would be comprised of the subset of the containing the concepts A and C and all of the intermediary concepts that link A and C. The statistics computed on the are query-independent and are used for estimating the prior probability and background language model of the concepts in the information scent calculation. The statistics computed on the

are query-dependent and are used for computing the context (i.e. likelihood) of a concept being relevant given the seed terms.

The strength of association equations are presented in Table 2. These equations are based on a generalized framework for smoothing language models using graphs [32]. Language models originated in machine translation research [34] and speech recognition [35] and were first applied to IR by Ponte and Croft [36]. Language models have many desirable properties. For example, they provide theoretical justification for commonly used heuristics such as term frequency (TF), inverse document frequency (IDF) weighting, and document length normalization [37-39]. This particular language model computes the likelihood of a

Time

Info

rmat

ion

scen

t (E

xpec

ted

Util

ity)

Currentconcept

Distal concepts

Semanticrelations

235

concept in a graph given the seed terms while taking into account properties of the graph such as the degree centrality of the target concept in the and the semantic relatedness between the target and source. Language models smooth the probability distributions with respect to a background language model. For example, a language model for a document is smoothed with the language model constructed from a large corpus of text. In this case, the background language model is smoothed with respect to the language model. The background language model is defined according to Equation 5 and the language model is defined according to Equation 6. The smoothing parameter

controls the linear integration with values 0,1 . From Equation 7, there are several possible

instantiations of , and . In this work , is calculated based on the cosine distance between the two concepts in a high dimensional space. The concept space was built using a method similar to the Predication-based Semantic Indexing (PSI) method [20] using the Semantic Vectors Page Version 3.6. For each concept, a vector is encoded that allows for measurable associations with other concepts in the predication graph regardless of the relation type as well as measurable associations between concepts that occur with the same relation type. The degree centrality

is the degree centrality of a concept in the computed according to Equation 4.

TABLE 2. Strength of association equations (5) (6)

log 1 ,

(7)

The information scent algorithm is defined as follows

and presented in Table 3. Two steps of inference are used in this study. The source concept spreads activation to all concepts that are connected to it by a semantic relation. In the first level of inference, the attentional weight is set to . In the first step, the posterior probability of the target concepts is computed using Equation 1. In the second level of inference, the top N concepts with the highest scent are selected and the relations from these top N concepts and existing concepts in the are added to the discovery graph. The idea is similar to pseudo-relevance feedback in IR where the top N ranked documents are selected and used as evidence to re-rank the remaining documents [40]. In the second level of inference, the attentional weight for the top N concepts is set to the posterior probability computed in the first level. This has the effect of causing the activation to preferentially flow through the network. The top N concepts are treated as sources (weighted by ) and the posterior probability of

the concepts in the third layer is computed using Equation 3 and the ranked concepts in the are returned.

TABLE 3. Information scent algorithm

global variables: predicationGraph, the complete predication graph levelsOfInference, Number of levels of inference initialSeedWeight, Initial attentional weight given to seed terms topN, Number of nodes for each level of inference

functions: createDiscoveryGraph (vertexList, discoveryGraph) inputs: vertexList, list of vertices whose connections are added discoveryGraph, network of vertices returns: discoveryGraph, updated with relations from vertexList function spreadingActivation (seedVertices, discoveryGraph) inputs: seedVertices, list of seed vertices discoveryGraph, graph for spreading activation

local variables: activeList, List of vertices that have not spread activation

for vertex in seedVertices vertex.attentionalWeight log activeList vertex

createDiscoveryGraph(activeList, discoveryGraph)

for i := 1 to levelsOfInference do

while NOT activePriorityList.empty() sourceVertex activePriorityList.remove() spreadActivationFromNode (sourceVertex, discoveryGraph)

for j:= 1 to topN activePriorityList remove(maxActivation(discoveryGraph))

createDiscoveryGraph(activeList, discoveryGraph)

for vertex in activeList vertex.attentionalWeight = log .

for vertex in discoveryGraph vertex.activation = vertex.activation +

end function

function spreadActivationFromNode (source, discoveryGraph) inputs: source, vertex from which activation is spread discoveryGraph, graph for spreading activation

for target in discoveryGraph target.activation target.activation +

source.attentionalWeight + log 1 ,

end function

C. Information scent calculation example This section presents an example calculation of information scent on the simple three-layer network in Figure 3. The seed terms are and and the attentional weights denoted as and are 1.0. The edge weights represent the strength of association calculation from Equation 7.

236

Figure 3: Information scent and LBD

Tables 3 and 4 present the activation equations for the first and second level inference respectively. It should be noted that the logarithm is not used in the example as presented in Equations 4 and 7 to facilitate the interpretability of the results. The strength of association values between the terms in the network act as inhibitory and excitatory links and the connected terms can be conceptualized as competing for activation. For example, the result of the first level of inference is that receives a higher activation value, which causes the activation to flow preferentially in the second level of inference. There is a resemblance to the Construction-Integration (CI) model of text comprehension proposed by Kintsch [41]. Retrieving all of the concepts linked to the seed terms is analogous to the construction phase which generates a set of items where the “right element is likely to be among those generated, even though others will also be generated that are irrelevant or outright inappropriate” [41]. The information scent calculation (notably CI utilized spreading activation as well) is similar to the integration phase which is used “to strengthen the contextually appropriate elements and inhibit unrelated and inappropriate ones” [41].

TABLE 4. Information scent equations for first level inference 0.008 0.128

TABLE 5. Information scent equations for second level inference 0.000014 0.000027 0.000027IV. METHODS AND MATERIALS

We used SemRep to extract nearly 25 million semantic predications from nearly 7 million MEDLINE citations (titles and abstracts dating from 1999-2010). The PSI model is available as a component of the Semantic Vectors open source package for distributional semantics research [42]. Two groupings of semantic types as defined in the UMLS were used in this experiment. The substance group was comprised of 24 semantic types that included types that described or classified physical entities such lipids or genes. The disease and functions group was comprised of disease

types, disease symptoms, or organ functions. The integration constant in Equation 4 was set to 0.5 and the integration constant in Equation 7 was set to 0.9.

IV. RESULTS

A. Replication of testosterone and sleep Miller et al. [7] utilized the closed discovery LBD method in Semantic MEDLINE to explore the underlying etiology of age-related hormonal changes and sleep quality. The hypothesis proposed by [7] can be summarized as follows: as men increase in age testosterone levels are known to decrease. Testosterone is hypothesized to inhibit cortisol. Cortisol is known to play a role in insomnia and waking. The decrease in testosterone in aging men results in an increase in cortisol, which may play a role in decreased sleep quality.

Miller et al. [7] conducted the experiment by executing four queries using Semantic MEDLINE and analyzed the results. The goal of the discovery replication is to return a ranked list of concepts where the concepts that were determined useful in [7] were highly ranked. The top 20 concepts from the seed terms “Sleep” and “Testosterone” are shown in Table 3. From the results, we see that “Corticosterone” (Rank 9 with substance and function, Rank 11 substance only) and “Hydrocortisone” (Rank 4 substance and function, Rank 7 substance only) were highly ranked. Given the high ranks of the concepts “Corticosterone” and “Hydrocortisone”, we conclude that we were able to generate a discovery graph that to a large extent automates the discovery by [7].

TABLE 6. Results for seed terms “Sleep” and “Testosterone”

Rank Concept name (Functions/Diseases and Substances)

Concept name (Substances only)

1 Obesity Nitric Oxide 2 Excretory function Interleukin-6 3 Aging Tumor Necrosis Factor-alpha 4 Hydrocortisone Leptin 5 Melatonin Norepinephrine 6 Epilepsy Complement System Proteins 7 Respiration Hydrocortisone 8 Energy Metabolism Melatonin 9 Corticosterone Somatotropin 10 Prolactin receptor expression 11 Immunologic function Corticosterone 12 Sleep Apnea, Obstructive Corticotropin-Releasing

Hormone 13 Steroid hormone Prolactin 14 Fatigue Interleukin-18 15 Circadian Rhythms Steroid hormone 16 Orexin ghrelin 17 Dehydroepiandrosterone

Sulfate orexin

18 Metabolic Suppression Dehydroepiandrosterone Sulfate

19 Sleep Apnea Syndromes HTR2A 20 3 (or 17)-beta-hydroxysteroid

dehydrogenase 3 (or 17)-beta-hydroxysteroid dehydrogenase

T1B1 = 0.2

T2B2 = 0.2

T5B5 = 0.2

T4B4 = 0.2

T3B3 = 0.2

S2W2=1.0

S1W1= 1.0

0.33

0.2

0.8

0.8

0.2

0.33

0.33

0.2

0.4

0.4

237

There are several potential concepts latent in the results that were not pursued by [7]. For example, ghrelin (an appetite regulation hormone) and obesity were present, which could serve as a foundation for a larger hypothesis in the context of cortisol, testosterone, and age-related sleep decline. Interestingly,though no indication was provided that the domain of interest was age-related sleep disorders, it was able to infer that aging was a probable interesting concept given the seed terms.

B. Replication of sleep and depression study The discovery browsing method was presented by Wilkowski et al. [6] to explore the relationship between sleep and depression. Based on initial seed terms and user selections, the paths between concepts were ranked based on a summation of the degree centrality of the concepts in the path. An important note is that the method was interactive and the users focused the search on paths containing the CLOCK gene given its prominent role in regulating sleep. The result of discovery browsing generated a graph containing the nine concepts shown in Table 7. There were several potential explanatory hypotheses implicit in the graph and [6] focused on the role of inflammation (interleukin-6 and interleukin-1 beta), circadian rhythms (clock gene and melatonin), and the neurotransmitter norepinephrine. Individually, each component is known to play a role in depression. However, [6] report that a high-recall PubMed query returned no citations indicating that the pathophysiology of circadian rhythms, norepinephrine, and inflammation in depression is a novel insight or at minimum an ill understood area.

TABLE 7. Results from seed terms “serotonin” and “melatonin”

Concept Name Rank discovery graph (no stop list)

Rank discovery graph (with stop list)

Substances Function/disease & Substances

Substances Function/disease & Substances

Norepinephrine 1 1 3 1 CLOCK 11 9 11 8 Interleukin-6 41 65 37 45 Interleukin-1 beta

31 44 30 39

Dopamine 32 34 27 30 Glutamate 66 62 1 35 Insulin 42 69 38 50 Interferon Type II

72 97 57 82

Cholesterol 59 92 49 69

We used “Serotonin” and “Melatonin” as the seed terms for the replication. The stop list used in some configurations was used in [6]. The rankings for the nine concepts are presented in Table 4. In three of the four configurations, the discovery made by [6] was contained within the top 50 concepts. The best performing configuration was using no stop list with the graph limited to substance interactions. In this particular configuration, the sub-graph was contained within the top 37 concepts. Additionally, it is important to note that we were able to simulate the decision point to

focus on CLOCK genes given that the concept “CLOCK” was in the first 25 concepts causing activation to flow down this path to connected concepts.

V. DISCUSSION We presented the design of an LBD system based on the theoretical foundation of IFT. This work has several contributions. It is the first attempt to develop a discovery browsing system based on the theoretical foundations of the IFT. In addition, we showed that the model is capable of replicating previous discoveries made by scientists using Semantic MEDLINE. Finally, this work is an example of the potential for rational theories of cognition to inform the development of information management systems.

The major limitation of this work is that it stops short of being able to propose the work presented here as a computational theory of discovery browsing. We were able to replicate the results of two LBD findings. Notably, in the case of sleep and depression, the discovery browsing system was to some extent able to replicate the human decision point to focus on CLOCK genes and was thus able to find the novel connections linking serotonin and melatonin. However, during discovery seeking there are many intermediate decisions that lead up to the final discovery. Further laboratory experiments are needed to record these intermediate steps to gain insight into the entire discovery browsing session. It is these intermediate decisions that are crucial for formally extending the IFT to account for discovery browsing.

Currently, little is known about the scientific discovery process or how scientists engage in this type of information seeking behavior. Furthermore, no studies have been conducted that investigate the behavior of scientists as they interact with LBD systems. It seems plausible that there are different types of discovery seeking behavior and these behaviors could maximize different utility functions. For example, information seeking behavior on the Web is often classified into informational, navigational, and resource with different categories existing for each [43]. It would seem plausible that discovery seeking behavior could be divided into different categories each of which could maximize different utility functions. For example, a scientist may browse Semantic MEDLINE looking for anomalous relations (i.e. highly novel) to inspire a future area where a hypothesis could be formulated versus a scientist that is looking for a connection, as in closed discovery, between two or more concepts that they hypothesize to be related. Furthermore, once the types of discovery seeking behavior are defined the next goal is to define information scent for each of the discovery classifications.

In summary, little is currently known about the underlying cognitive mechanisms at play during discovery browsing and even less is known about how to develop systems to support this type of behavior. This work is an early attempt at such a discovery support system, but further investigations are needed to progress this area of research.

238

This theoretical paradigm of IFT provides a space where we can theorize, test, experiment, and ultimately understand discovery seeking behavior and develop tools to assist in the process, which will ultimately expedite the growth of scientific knowledge.

V. CONCLUSION We presented a theoretically motivated framework for LBD based on the IFT. We have shown that this framework can replicate past LBD experiments and holds promise in formulating new hypotheses.

ACKNOWLEDGMENTS

This research was funded in part by a training fellowship from the Keck Center of the Gulf Coast Consortia, on the Training Program in Biomedical Informatics, National Library of Medicine (NLM) T15LM007093. In addition, the research was supported in part by the Medical Informatics Training Program at the Lister Hill Center for Biomedical Communications and the Intramural Research Program of the National Institutes of Health, National Library of Medicine. We would like to thank Todd R. Johnson, PhD and Elmer V. Berstam, MD for their input and support in this work. We also want to thank the anonymous reviewers whose comments resulted in significant improvement to the presentation of this material.

REFERENCES

1. Wilson, P., Unused relevant information in research and development. Journal of the American Society for Information Science, 1995. 46(1): p. 45-51. 2. Swanson, D.R., Fish-oil, Raynaud's Syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine, 1986. 30(1): p. 7-18. 3. Swanson, D.R., Undiscovered public knowledge. Library Quarterly, 1986. 56(2): p. 103-118. 4. Kilicoglu, H., et al., Semantic MEDLINE: A web application for managing the results of PubMed Searches, in Proceedings of the Third International Symposium for Semantic Mining in Biomedicine. 2008. 5. Rindflesch, T.C. and M. Fiszman, The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. Journal of Biomedical Informatics, 2003. 36(6): p. 462-77. 6. Wilkowski, B., et al., Discovery browsing with semantic predications and graph theory. AMIA Annu Symp Proc. 2011, 2011. 7. Miller, C.M., et al., A closed literature-based discovery technique finds a mechanistic link between hypogonadism and diminished sleep quality in aging men. Sleep, 2012. 35(2): p. 279-285. 8. Freeman, L.C., Centrality in social networks: Conceptual clarification. Social Networks, 1978. 1: p. 215-239. 9. Pirolli, P. and S. Card, Information foraging, 1999: Oxford University Press. 10. Anderson, J., How can the human mind occur in the physical universe?, 2007, New York, NY: Oxford University Press. 11. Pirolli, P., Rational analyses of information foraging on the Web. Cognitive Science, 2005. 29: p. 343-373. 12. Crestani, F., Application of spreading activation techniques in information retrieval. Artificial Intelligence Review, 1997. 11: p. 453-482. 13. Weeber, M., et al., Using concepts in literature-based discovery: Simulating Swanson's raynaud-fish oil and migraine-magnesium discoveries. Journal of the American Society for Information Science, 2001. 52(7): p. 548-557. 14. Bruza, P. and M. Weeber, Literature-based discovery, 2008: Springer-Verlag. 15. Hristovski, D., et al., Exploring semantic relations for literature-based discovery, in AMIA Snnu Symp Proc 2006, 2006. 16. Ahlers, C.B., et al., Using the literature-based discovery paradigm to investigate drug mechanisms, AMIA Annu Symp Proc 2007, 2007. p. 6-10.

17. Hristovski, D., et al., Combining semantic relations and DNA microarray data for novel hypothesis generation, in Linking literature, information, and knowledge for biology, 2010, SpringerLink. 18. Cohen, T., et al. Finding schizophrenia's prozac emergent relational similarity in predication space. in Quantum Interaction. 2011. 19. Cohen, T., et al., Logical leaps and quantum connectives: Forging paths through predication space, in AAAI Fall 2010 Symposium on Quantum Informatics for cognitve, social, and semantic processes, 2010. 20. Cohen, T., R. Schvaneveldt, and T. Rindflesch. Predication-based semantic indexing: Permutations as a means to encode predications in semantic space. in AMIA Annu Symp Proc. 2009. 21. Bell, C.G. and A. Newell, Computer Structures: Readings and Examples, 1971, New York: McGraw-Hill. 22. Anderson, J.R. and G.H. Bower, Human associative memory, 1973, Washington, DC: Winston & Sons. 23. Anderson, J.R., Reflections of the environment in memory. Psychological Science, 1991(2): p. 396-408. 24. Anderson, J.R., The adaptive nature of human categorization. Psychological Review, 1991. 98: p. 409-429. 25. Schooler, L.J. and J.R. Anderson, The role of process in the rational analysis of human memory. Cognitive Psychology, 1997. 32: p. 219-250. 26. Anderson, J.R., et al., A central circuit of the mind. Trends in Cognitive Sciences, 2008. 12 (136-143). 27. Anderson, J.R. and R. Milson, Human memory: An adaptive perspective. Psychological Review, 1989. 96: p. 703-719. 28. Goodwin, J.C., et al., Predicting biomedical document access as a function of past use. J Am Med Inform Assoc., 2011. 29. Recker, M.M. and J.E. Pitkow, Predicting document access in large multimedia repositories. ACM transactions on Computer-Human Interaction (TOCHI), 1996. 3(4). 30. Van Maanen, L., et al., Personal publication assistant: abstract recommendation by a cognitive model. Cognitive Systems Research, 2009. 31. Steyvers, M. and T.L. Griffiths, Rational analysis as a link between human memory and information retrieval, in The Probabilistic Mind: Prospects from Rational Modles of Cognition, 2010, Oxford University Press. 32. Mei, Q., D. Zhang, and C.X. Zhai. A general optimization framework for smoothing language models on graph structures. in In SIGIR '08: Proceedings of the 31st annaul international ACM SIGIR conference on research and development in information retrieval. 2008. USA, New York, NY: ACM. 33. Opsahi, T., F. Agneessens, and J. Skvoretz, Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks, 2010. 32(3): p. 245-251. 34. Brown, P.F., et al., A statistical approach to machine translation. Computation Linguistics, 1990. 16(2): p. 79-85. 35. Jelinek, F., Statistical methods for speech recognition, 1997: MIT Press. 36. Ponte, J. and W.B. Croft. A language modeling approach to infromation retrieval. in In Proceedings of the ACM SIGIR 98. 1998. 37. Singhal, A., C. Buckley, and M. Mitra. Pivoted document length normalization. In Proceedings of the 1996 ACM SIGIR Conference on Research and Development in Information Retrieval. 1996. 38. Hiemstra, D. and W. Kraaij. Twenty-one at TREC-7: Ad-hoc and cross language track. in In Proceedings of Seventh Text Retriveal Conference (TREC-7). 1998. 39. Hiemstra, D., A probabilistic justification for using tf x idf term weighting in information retrieval. International Journal on Digital Libraries, 2000. 3: p. 131-139. 40. Lavrenko, V. and W.B. Croft. Relevance-based language models. In Proc. 24th ACM SIGIR Conf. on Research and Development in Information Retrieval. 2001. 41. Kintsch, W., Comprehension: A paradigm for cognition, 1998: Cambridge University Press. 42. Widdows, D. and T. Cohen, The semantic vectors package: New algorithms and public tools for distributional semantics, in Fourth IEEE International Conference on Semantic Computing (IEEE ICSC2010) 2010: Carnegie Mellon University, Pittsburgh, Pennsylvania. 43. Rose, D.E. and D. Levinson. Understanding user goals in Web search. In Proc WWW. 2004.

239