Gelsius: A Literature-Based Workflow for Determining Quantitative Associations between Genes and...

13
Gelsius: A Literature-Based Workflow for Determining Quantitative Associations between Genes and Biological Processes Francesco Abate, Andrea Acquaviva, Elisa Ficarra, Roberto Piva, and Enrico Macii Abstract—An effective knowledge extraction and quantification methodology from biomedical literature would allow the researcher to organize and analyze the results of high-throughput experiments on microarrays and next-generation sequencing technologies. Despite the large amount of raw information available on the web, a tool able to extract a measure of the correlation between a list of genes and biological processes is not yet available. In this paper, we present Gelsius, a workflow that incorporates biomedical literature to quantify the correlation between genes and terms describing biological processes. To achieve this target, we build different modules focusing on query expansion and document cononicalization. In this way, we reached to improve the measurement of correlation, performed using a latent semantic analysis approach. To the best of our knowledge, this is the first complete tool able to extract a measure of genes-biological processes correlation from literature. We demonstrate the effectiveness of the proposed workflow on six biological processes and a set of genes, by showing that correlation results for known relationships are in accordance with definitions of gene functions provided by NCI Thesaurus. On the other side, the tool is able to propose new candidate relationships for later experimental validation. The tool is available at http://bioeda1.polito.it:8080/medSearchServlet/. Index Terms—UMLS, gene ontology, thesaurus, ontologies, text mining Ç 1 INTRODUCTION T HE recent advances in gene expression analysis techni- ques, such as microarrays analyses and DNA/RNA next-generation sequencing (NGS), coupled with the devel- opment of biomedical literature databases and data classification systems such as ontologies and thesauri, have enabled an unprecedented diffusion of biomedical informa- tion related to genes and their involvement in biological processes and pathologies on the web [1], [2]. This new and very exciting scenario comes together with an intriguing challenge that is the capability of extracting relevant knowledge from genetic and biomedi- cal studies performed on a wide variety of biological processes and genes. Correlation quantification, reflecting the current knowl- edge reported in biomedical literature, could be used for the selection of a specific subset of genes that are most likely involved in the biological process under study, and where successive analyses should focus. The correlation score should reflect the conceptual content of the biomedical literature pertaining to the topics related to the analysis. To achieve this target, an information extraction tool should be able to provide not only a quantification of the correlation of each gene with a biological process, but also a comparative evaluation of the relevance of the role of each gene compared to others as well as of the role of each gene across different biological processes. Recently, numerous query expansion methods [29], [30], [31], and semantic similarity and indexing frameworks [33], [34], [28], [7] have been developed to facilitate the mining of scientific literature for meaningful biological information. In addition, techniques (not focused on biomedical literature) based on word thesauri have been proposed [35] for analyzing the relatedness of text. Despite these individual contributions, a complete workflow targeted to gene- biological processes correlation measurement is still missing. We present a complete workflow and its implementation in a tool called Gelsius, enabling quantitative measurement of correlation between genes and biological processes according to biomedical literature knowledge. The pro- posed workflow exploits concept-based keyword expansion [29] to generate a set of terms to retrieve related PubMed abstracts, which are canonicalized before being processed by a correlation engine, implemented using a latent semantic indexing (LSI) [9] approach. Gelsius implements an innovative pipeline where build- ing blocks are customized to the quantification of the gene- biological process correlation. It exploits the completeness of information provided by the UMLS Metathesaurus [3] and the biomedical literature in combination with an effective correlation extraction workflow. Fig. 1 outlines the proposed pipeline. Starting from a set of input terms specified by the scientist, the terms are expanded to the related concepts exploiting UMLS relationships. The ex- panded terms are adopted as keyword to the PubMed web service. The collected documents are canonicalized into a language based on UMLS concepts. The biological relation- ship between the input terms is quantitatively evaluated through LSI, which measures the semantic distance between biological concepts (i.e., genes and biological processes) occurring in the documents set. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013 619 . F. Abate, A. Acquaviva, E. Ficarra, and E. Macii are with the Polytechnic of Turin, Turin, Italy. . R. Piva is with the University of Turin, Turin, Italy. Manuscript received 12 Mar. 2011; revised 27 Jan. 2013; accepted 5 Feb. 2013; published online 20 Feb. 2013. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TCBB-2011-03-0061. Digital Object Identifier no. 10.1109/TCBB.2013.11. 1545-5963/13/$31.00 ß 2013 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

Transcript of Gelsius: A Literature-Based Workflow for Determining Quantitative Associations between Genes and...

Gelsius: A Literature-Based Workflowfor Determining Quantitative Associationsbetween Genes and Biological Processes

Francesco Abate, Andrea Acquaviva, Elisa Ficarra, Roberto Piva, and Enrico Macii

Abstract—An effective knowledge extraction and quantification methodology from biomedical literature would allow the researcher to

organize and analyze the results of high-throughput experiments on microarrays and next-generation sequencing technologies.

Despite the large amount of raw information available on the web, a tool able to extract a measure of the correlation between a list of

genes and biological processes is not yet available. In this paper, we present Gelsius, a workflow that incorporates biomedical

literature to quantify the correlation between genes and terms describing biological processes. To achieve this target, we build different

modules focusing on query expansion and document cononicalization. In this way, we reached to improve the measurement of

correlation, performed using a latent semantic analysis approach. To the best of our knowledge, this is the first complete tool able to

extract a measure of genes-biological processes correlation from literature. We demonstrate the effectiveness of the proposed

workflow on six biological processes and a set of genes, by showing that correlation results for known relationships are in accordance

with definitions of gene functions provided by NCI Thesaurus. On the other side, the tool is able to propose new candidate relationships

for later experimental validation. The tool is available at http://bioeda1.polito.it:8080/medSearchServlet/.

Index Terms—UMLS, gene ontology, thesaurus, ontologies, text mining

Ç

1 INTRODUCTION

THE recent advances in gene expression analysis techni-ques, such as microarrays analyses and DNA/RNA

next-generation sequencing (NGS), coupled with the devel-opment of biomedical literature databases and dataclassification systems such as ontologies and thesauri, haveenabled an unprecedented diffusion of biomedical informa-tion related to genes and their involvement in biologicalprocesses and pathologies on the web [1], [2].

This new and very exciting scenario comes togetherwith an intriguing challenge that is the capability ofextracting relevant knowledge from genetic and biomedi-cal studies performed on a wide variety of biologicalprocesses and genes.

Correlation quantification, reflecting the current knowl-edge reported in biomedical literature, could be used for theselection of a specific subset of genes that are most likelyinvolved in the biological process under study, and wheresuccessive analyses should focus. The correlation scoreshould reflect the conceptual content of the biomedicalliterature pertaining to the topics related to the analysis.

To achieve this target, an information extraction toolshould be able to provide not only a quantification of thecorrelation of each gene with a biological process, but also acomparative evaluation of the relevance of the role of eachgene compared to others as well as of the role of each geneacross different biological processes.

Recently, numerous query expansion methods [29], [30],[31], and semantic similarity and indexing frameworks [33],[34], [28], [7] have been developed to facilitate the mining ofscientific literature for meaningful biological information. Inaddition, techniques (not focused on biomedical literature)based on word thesauri have been proposed [35] foranalyzing the relatedness of text. Despite these individualcontributions, a complete workflow targeted to gene-biological processes correlation measurement is still missing.

We present a complete workflow and its implementationin a tool called Gelsius, enabling quantitative measurementof correlation between genes and biological processesaccording to biomedical literature knowledge. The pro-posed workflow exploits concept-based keyword expansion[29] to generate a set of terms to retrieve related PubMedabstracts, which are canonicalized before being processedby a correlation engine, implemented using a latentsemantic indexing (LSI) [9] approach.

Gelsius implements an innovative pipeline where build-ing blocks are customized to the quantification of the gene-biological process correlation. It exploits the completenessof information provided by the UMLS Metathesaurus [3]and the biomedical literature in combination with aneffective correlation extraction workflow. Fig. 1 outlinesthe proposed pipeline. Starting from a set of input termsspecified by the scientist, the terms are expanded to therelated concepts exploiting UMLS relationships. The ex-panded terms are adopted as keyword to the PubMed webservice. The collected documents are canonicalized into alanguage based on UMLS concepts. The biological relation-ship between the input terms is quantitatively evaluatedthrough LSI, which measures the semantic distancebetween biological concepts (i.e., genes and biologicalprocesses) occurring in the documents set.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013 619

. F. Abate, A. Acquaviva, E. Ficarra, and E. Macii are with the Polytechnicof Turin, Turin, Italy.

. R. Piva is with the University of Turin, Turin, Italy.

Manuscript received 12 Mar. 2011; revised 27 Jan. 2013; accepted 5 Feb. 2013;published online 20 Feb. 2013.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCBB-2011-03-0061.Digital Object Identifier no. 10.1109/TCBB.2013.11.

1545-5963/13/$31.00 � 2013 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

The proposed workflow embeds the following relevantfeatures:

1. It extends the UMLS query expansion strategies[29] by introducing the filtering based on SemanticGroups.

2. It implements an original algorithm for documentretrieval to get a set of specific documents toimprove the successive correlation analysis.

3. It improves the correlation quantification performedby LSI exploiting the canonicalization of abstracts.

4. It introduces a novel correlation metric (BCI—biological correlation index), which enables thecomparison of scores to evaluate the relevance of agene across different biological processes.

We demonstrate the effectiveness of the proposed toolwith a set of biological processes and genes as case studies,containing both known and unknown correlations. We usethe former for validation and the latter to show how theproposed workflow is able to highlight unknown links.Validation is achieved by comparing the relationshipdiscovered against NCI definitions [25].

2 RELATED WORK

Several methodologies based on graph-structured ontolo-gies, such as GO, have been proposed to compute the termssimilarity. They are mainly divided in two classes: edgebased, focusing on counting the number of edges betweentwo terms, and node based, relying comparing the proper-ties of involved terms [19].

An alternative methodology to calculate the semanticsimilarity representing the gene products in a vector spacemodel (VSM) has been proposed in [6]. By creating a matrixof GO terms by gene product, a pairwise comparisonbetween GO terms vectors is performed resulting in a half-matrix of gene product similarities. However, the tool lacksgenerality of both the knowledge base and the type ofcorrelations that can be found. Furthermore, as argued in [8],computing the similarity with standard techniques (pairwisecomparison, cosine, etc.) in the VSM fails to take into accountthe co-occurrences during the calculation. In a similar way,mapping the GO terms and gene products on the VSMprevents to consider the co-occurrences of gene products indifferent GO terms. As detailed in Section 3.3, our toolovercomes these limitations exploiting an LSI technique [9].

Various papers applied LSI in the context of biomedicalterm correlation. However, most of the research has beenfocused on finding correlated words for expanding the

query, not for correlation quantification [9], [32], [37], [38],[31], [29]. In our work, we use a similar approach for termexpansion [29]; however, our aim is to have a specific set ofdocuments, rather than extending it. As such, we con-catenate in “AND” the keywords for the query instead ofORing them. Moreover, we use a different filtering based onsemantic groups, as we will explain in Section 3.1. A moregeneral technique for the analysis of text relatedness basedon word thesauri has been recently proposed in [35]. Thiswork has two main points of distinction with respect toours. First, even if authors propose word correlationtechniques, their aim is quantify text relatedness ratherthan single term correlation. Second, their method is notfocused on biomedical literature.

The effectiveness of LSI for extracting the functionalassociation between genes from a set of biological abstractshas been demonstrated by Homayouni et al. [7]. Even if thestudy in [7] provides a proof-of-principle of the efficiency ofadopting LSI for elucidating both implicit and explicit generelationships from the biomedical literature, it does notpresent a complete workflow for gene-biological processcorrelation. In particular, the proposed method is applied toa set of prebuild abstract document from PubMed; therefore,no automated abstract retrieval has been developed. How-ever, this paper points out the efficiency of adopting the LSIfor enhancing the latent correlation among gene terms anduser keywords (i.e., gene terms or biological processes).

The main contribution of this paper is the developmentof an integrated pipeline for the correlation of gene-biological process correlation using biomedical literatureas knowledge base. The tool exploits basic text processingtechnologies that are extended and customized for correla-tion quantification. Moreover, it proposes a new correlationmetric. A preliminary version of this tool was presented in[36] and also extended in [46] to integrate biologicalpathways in the correlation search. With respect of theprevious version, the method has been consistently ex-tended to get as input a list of genes and processes. Toachieve this target, a new keyword expansion strategy anddocument abstract retrieval method have been developed.In addition, the experimental results section has beenexpanded with new results about the impact of semanticgroup filtering as well as new case studies based onadditional biological processes to better characterize thetool and evaluate its effectiveness.

3 FLOW DESCRIPTION

In this section, we detail the steps shown in Fig. 2 thatimplement the Gelsius semantic analysis workflow. Theuser input consists in a list of gene names and biologicalprocesses. The output of Gelsius is the correlation scorebetween all the genes and the biological processes in theinput list. The final metric (i.e., BCI) allows the comparisonbetween the impact of a set of genes on the same biologicalprocess as well as the impact of single genes acrossbiological processes.

3.1 Terms Expansion

The aim of the term expansion is to automatically extend theset of documents related to biomedical topics that could linkthe genes with the biological processes and eventually

620 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013

Fig. 1. Semantic analysis workflow.

unveil nontrivial correlations. The term expansion expandsthe initial set of gene names, named initial gene set (IGS),and biological processes, named initial biological processset (IBPS). Specifically, for each couple ðg; bpÞ composed of agene and a biological process, the set of related terms andsynonyms, namely the expanded concept set (ECS), isdetermined. All the input terms and the expanded terms areused as keywords to retrieve the set of related documents.For each ðg; bpÞ, a document set is retrieved and thesuccessive semantic analysis is then performed separatelyon these sets. Finally, since the objective of Gelsius is tomeasure the correlation between a list of genes andbiological processes, the individual scores are representedin a common space through the BCI.1

To achieve the objective, we extend the methodologypresented in [29] in mainly two ways. First, we provide acustomized implementation of the term generation from co-concepts targeted to correlation measurement. Second, wedo not limit the coconcepts to belong to the same semantictype of the initial term, rather we exploit the UMLSsemantic network.

The expanded terms will be used in combination withthe initial terms to generate a set of keyword for theretrieval of abstracts from PubMed, later used for correla-tion analysis.

Referring to Fig. 2, the block called software query interfaceto UMLS database implements the access to the UMLSdatabase to retrieve coconcepts and the terms related to aspecified keyword string (i.e., initial keyword set).

The structure of the UMLS database assigns a conceptunique identifier (CUI) to all medical concepts it includes.Each concept can be associated with one or more stringlabels. When multiple strings are associated with a CUI, apreferred term (PT) is also indicated. As a keyword string,either a gene term or a biological process can match morethan a single CUI. In this case, all the matching CUIs areconsidered for each keyword. The resulting CUIs areconcept candidates (CCs) for the corresponding keywordset. For instance, the keyword “VEGF” matches with both“VEGF protein, human” (CUI C1171892) and “VEGFA gene”

(CUI C1823619).By exploiting UMLS relationships between CUIs, we then

find all CUIs related to each CC. These CUIs are the expanded

concepts (ECs) and the corresponding PTs are the expanded

terms used as keywords term in the ECS. Fig. 3 shows anexample of expansion applied to the starting term “VEGF.”

Instead of simply limiting the expansion to thoseconcepts belonging to the same semantic type of the initialkeywords, we exploit the semantic network and semantic

groups [5] of UMLS. The semantic network is a semanticrepresentation of the UMLS concepts that maps the hugeamount of concepts into a set of 135 categories, indicatingthe semantic type for each CUI in the UMLS database. The

ABATE ET AL.: GELSIUS: A LITERATURE-BASED WORKFLOW FOR DETERMINING QUANTITATIVE ASSOCIATIONS BETWEEN GENES AND... 621

Fig. 2. Complete semantic analysis workflow. The user input terms, a gene list and a biological process, are expanded exploiting the UMLS databaseto get a set of terms to be used as keyword for the PubMed web services. All the concepts in the resulting document set are mapped to UMLS CUIs,by means of Metamap program. Finally, the ad hoc version of LSI calculates the relationship between genes and biological processes according tothe co-occurrences in the document set.

Fig. 3. Term expansion example.

1. By performing separate searches, we do not force documents tocontain the whole list of gene terms. This would be needed if we wanted tocompute the correlation between the list of genes considered as a whole anda biological process. However, two or more subsets of documents mayoverlap if the genes or the biological processes are correlated. Conversely, asingle subset of documents can contain other genes and biological processeswith respect to the initial terms used for the search. Since the tool willanyway compute the correlations between all the terms present in thedocument subset (as explained in Section 3.3), a list of these correlationscores can be provided as output. This information can be used to discovercorrelations that are not initially taken into account by the user (i.e., they arenot present in the IGS).

semantic groups are a further conceptual abstraction layerthat includes 15 categories and all the semantic types inUMLS belong to one or more semantic groups (Table 1) [5].

Table 1 lists all the semantic groups of UMLS. Thenumber of semantic types belonging to each group isreported as well.

To obtain the most valuable result, it is crucial to create aproper set of semantic types. Including all the types wouldlead to a too generic search, possibly introducing noise forthe subsequent steps. Including only one type, as discussedbefore, would cause a loss of meaningful documents. Aviable approach is to exploit the semantic network ofUMLS. The user selects the semantic groups mostly relatedto the topics of interest for the specific search. The ECs arethen filtered according to the semantic groups. Note that, bydefinition of groups, they already include a number ofsemantic types. We will quantify the impact of the selectionof semantic groups in Section 4.

Here, we provide some insights about the selectionprocedure. Suppose that VEGFA and Angiogenesis are thepair of gene and biological process. The first term is theacronym of Vascular Endothelial Growth Factor A andaccording to the definition of NCI Thesaurus, it is involvedin “in the regulation of blood vessel growth” [25]. The secondterm is a biological process consisting in “development of

new blood vessels” [25]. By applying a filter using uniquelythe Genes & Molecular Sequences group, only the termsstrictly related to the genes field (e.g., gene symbols) wouldbe considered.

Therefore, narrowing to the Genes & Molecular Sequences

group would lead a too restricted set of documents. In fact,terms such as Angiostatins, Blood Vessels Morphogenesis,Vasculogenesis, which play a fundamental role in definingthe links between the two input terms, are removed fromthe ECS. Moreover, the restriction to the GENE groupprevents to consider keywords referring to protein namesthat can be useful for the sake of correlation quantification.For instance, the keyword “VEGF protein, human” belongingto the “Chemicals & Drugs” group in the UMLS databasewould be discarded even if it potentially links the VEGFAgene and the angiogenesis process.

3.2 Document Abstract Retrieval andCanonicalization

The document abstract retrieval phase aims at collecting a setof document abstracts by performing a number of queries toPubMed web services using a combination of the keywordsgenerated in the previous step.

The keyword query set generation selects the subset ofkeywords to be used for each query. Each query will returna subset of abstracts.

Other document retrieval approaches [38], [29] are notfocused on correlation quantification. Because of that, wedifferentiate from them by logically AND the keywordsinstead of “ORing” them. This prevents retrieving docu-ments that are more related to the expanded keywords thanto the initial keywords.

The proposed approach is to formulate the query as acombination of both expanded and initial keywords, thusreturning a set of documents maintaining the conceptuallink between the initial keyword and the expanded ones.

To this purpose, we developed a query generationalgorithm, reported in Fig. 4. To avoid an empty set ofretrieved abstracts from PubMed, two new less specificqueries are formulated combining one initial keywordand one expanded keyword for each query. This choice,in practice, guarantees that a nonempty set of documentsis retrieved.

The document abstract canonicalization phase translates thetext of the abstracts retrieved in the previous step fromnatural language form into UMLS concepts. This procedureis also called canonicalization [32]. In the proposed work-flow, this step is done to reduce the ambiguity that couldimpact the next correlation measurement step.

The workflow shown in Fig. 5 reports example sentencesextracted from the document abstracts canonicalized intoUMLS concepts.

As a result, the document abstract canonicalization returns aset of documents in the common domain of biomedical

622 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013

Fig. 4. Query generation algorithm. For each gene in the IGS andbiological process in the IBPS, the algorithm first generates a basicquery as a combination of gene name and biological process with onekeyword from the ECS. Note that each keyword from both the IGS andthe IBPS and ECS is a label string; therefore, it may actually becomposed by more than a single string.

TABLE 1UMLS Semantic Groups

terms expressed in the form of UMLS CUIs. This helps thesubsequent LSI step, where the correlation among the CUIcontained in the document is computed.

3.3 Latent Semantic Analysis

Gelsius adopts the LSI algorithm to analyze the relation-ships between the terms (CUIs) contained in a set ofdocuments (the retrieved abstracts). Rather than performinga simple statistical analysis of co-occurrence in the abstracts,LSI performs a remapping of the space of documents into aspace of conceptual domains using a singular valuedecomposition (SVD) [8], [9]. Note that conceptual domainsdo not correspond to UMLS concepts.

The LSI algorithm is applied to each document setseparately, where each set corresponds to a search originat-ing by a couple gene-biological process. Hence, LSI iscomputed many times depending on how many gene-biological process pairs were provided as input. ThroughLSI, a score in the interval [0,1] is obtained, expressing thesemantic distance among the gene and the biologicalprocess in the space of the conceptual domains.

The preprocessing stage performed prior the applicationof LSI, where all the documents in natural language formare mapped into UMLS CUIs by means of Metamap,allows the analysis of the relationships between UMLSCUIs and documents, instead of simple terms anddocuments, thus improving the match between CUIsbelonging to different documents.

The LSI step proceeds in the following way: It starts froma representation of CUIs and documents in a VSM througha two-dimensional occurrence matrix X in which thecolumns represent the dimensional space of a corpus ofdocuments and the rows represent the dimensional space ofthe CUIs occurring into the documents. A generic elementxc;d 2 X represents the occurrence of the CUI c into thedocument d.

Note that as terms, we consider all the genes in the inputgene list as well as the targeted biological process.

At this stage, the SVD is applied. The occurrence matrixis decomposed into a new CUI-by-domain matrix X, alsocalled domain model matrix, expressing the degree ofassociation between CUIs and conceptual domains. Thus,a generic element xc;k 2 X represents how much the CUIs cis related to the conceptual domain k.

Now, given two row vectors fxi, xjg 2 X, correspondingto the CUI i and j, the semantic correlation of two CUIs is

computed by the cosine similarity among the vectors xi andxj. Fig. 6 shows a graphical representation of the resultingmatrix of occurrences.

After applying the SVD, the biological correlation scoresindicating the biological relationship between each gene inthe input gene list and the biological process are obtainedby computing the cosine similarity between the correspond-ing CUIs. The cosine correlation is then computed betweeneach gene CUI in the list of genes and the CUI of thebiological process under study.

The cosine similarity is computed for each term (i.e.,CUI) present in the original set of abstracts and depends onthe documents it contains. Among these CUIs, there are alsothe CUIs corresponding to the original couple ðg; bpÞ. Ingeneral, we are not interested in computing all the pairwisecorrelations. However, these correlations will be used tonormalize the scores with respect to the document set. Theresults for the various ðg; bpÞ couples are compared with theBCI metric (explained in Section 4).

The final output of the tool is then a list of scoresexpressing the correlations between each gene and eachbiological process provided by the user as input.

3.4 Pairwise Correlation Comparison

The output of the LSI step is a list of scores for each geneand biological process. We refer to this type of metric as thesemantic relationship score (SRS) expressing the correlation ina range between 0 and 1. The SRS is defined as the cosineproduct of two CUI vector in the X matrix. Therefore, thehigher the SRS, the greater the biological correlationaccording to the retrieved documents.

However, the SRS strongly depends on the corpus ofdocuments because different corpi produce different corre-lation matrix and, thus, different results in terms of SRS. Infact, the cosine between two CUI vectors depends on thevector space and the vector space changes according to thecorpus of documents.

Thus, considering that Gelsius retrieves a specific corpusof documents for each couple of gene and biological process,the SRS computed on different corpi are not directlycomparable using this metric. Since the input of Gelsius isa list of genes and biological processes and the final goal is toallow the comparison of correlation results across differentsearches, a metric has been introduced, called BCI thatenables the comparison between the LSI results obtained on

ABATE ET AL.: GELSIUS: A LITERATURE-BASED WORKFLOW FOR DETERMINING QUANTITATIVE ASSOCIATIONS BETWEEN GENES AND... 623

Fig. 6. CUI by domain matrix. Representation of CUI by domain matrix.The CUI of both Gene term and Biological Process term indexes acertain row in the matrix.

Fig. 5. Document abstract canonicalization. The text in doc_1, containedin the abstracts of documents coming from PubMed, is analyzed byMetamap. The concepts, such as VEGF-A, are detected and mapped onthe corresponding UMLS CUI to compose the article abstract in the formof UMLS CUI text.

different document sets. In this way, it is possible to evaluatethe relevance of a gene across different biological processesas well as to rank the list of genes based on their correlationwith a single biological process. In practice, we can map theLSI results in a normalized representation.

Additionally, since BCI allows cross comparisons, thescientist can evaluate the results with case studies wherethe relationship between genes and biological process arewell known.

We analyzed the SRS distribution (SRSD) on different Xmatrices to define a criterion that would allow us tocompare SRSs resulting from different gene-biologicalprocess X matrices. The SRSD is defined as the densityfunction that reports the percentage of occurrences, orfrequencies, of the SRS scores computed between all theCUIs in the corpus of documents related to the specificcouple of gene and biological process.

The SRSD is divided into hundred intervals of equivalentarea, namely percentiles. In essence, the higher the numberof occurrences of a certain SRS, the smaller the width of thecorresponding interval. We assign a progressive number tothe intervals and we define this number as the BCI. Thehigher the BCI, the greater the relative biological correlationbetween the biological process and the gene.

To explain how BCI works, we discuss an example ofcomparison among four correlation analyses involving fourdifferent biological processes. For each analysis, we focuson the SRS result of a single gene and its BCI. Among thevarious cases, we suppose that the scientist knows thatAngiogenesis and VEGFA are strongly correlated. The BCIobtained from the Angiogenesis-VEGFA search will be used

then as reference to evaluate the other correlation scoresbetween BCL2-Apoptosis, BCL-XL-Antiapoptosis, AKT1-Signal Transduction Pathways.

Fig. 7 plots the SRSD of the four independent analysis.The quantity in the y-axis is the percentages of occurrencesof the SRSs. BCIs are emphasized by the dashed lines. Tohighlight the role of the BCI metric, we compare Figs. 7aand 7c and refer to them as the first case study, and Figs. 7band 7d and refer to them as the second case study.

The first case study concerns the comparison of thecorrelation analyses between VEGFA gene—Angiogenesisprocess (Fig. 7a) against BCL-xL—Antiapoptosis (Fig. 7c).The vertical solid line highlights the SRS of the specific geneunder analysis. In this experiment, the SRS of VEGFA gene(52.3) (Fig. 7a) is clearly higher than the SRS of the BCL-xLgene (29.5) (Fig. 7c), but both SRSs fall in a region of the SRSDwith relatively high SRS characterized by a very smallnumber of terms. In essence, they are highly correlated withthe relative biological process with respect to other terms. Asa result, both VEGFA and BCL-xL obtain the highest BCI.

This is in accordance with biological evidence. In fact,VEGFA gene is strongly involved in Angiogenesis process[11] in the same way as BCL-xL strongly regulates theAntiapoptosis process [15]. However, their SRS score is notequal. This is because, while VEGFA occurs in most of thedocuments concerning Angiogenesis, BCL-xL, as part of theBCL-2 family, it may not be explicitly mentioned in allthe documents. Thus, the results in terms of SRS is lowerthan the score obtained for VEGFA.

The second case study highlights the opposite situationin which a similar SRS corresponds to a large gap in terms

624 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013

Fig. 7. SRSD plot about four experimental run. Each histogram reports the density function distribution of all the possible SRSs in the experimentalrun. The area below the function distribution has been divided into hundred equal parts each corresponding to an interval in the SRS axis. Thedashed line delineates the interval corresponding to 10 percentiles. The green line next to the arrow highlights the SRS into the overall distribution. InFigs. 7a and 7c, we compare the correlation analysis between VEGFA gene—Angiogenesis process against BCL-xL—Antiapoptosis. The SRS ofVEGFA gene is higher than the SRS of BCL-xL gene, but they are the more correlated to the biological processes. In Figs. 7b and 7d, we analyzedthe correlation between BCL2 gene—Apoptosis process and AKT1 gene—Signal Transduction Pathways. BCL2 and AKT1 genes have similar SRS,but AKT1 gene is not the more correlated term respect to the Signal Transduction Pathways biological process.

of BCI. In this case, we compare the results of thecorrelation analysis between BCL2 gene—Apoptosis(Fig. 7b) and between AKT1—Signal Transduction Pathway(Fig. 7d). The SRS of BCL2 gene confirms that BCL2 isstrongly involved in Apoptosis [16] and its value (36.1)corresponds to the highest BCI because it is one of the bestSRS in the SRSD. Conversely, AKT1 gene presents a similarSRS (23.5) that is only 13 percent less than the BCL2 score,but its value in BCI is 68 with respect to 100 achieved byBCL2. This implies that according to the informationcontained in the documents collected during the documentabstracts retrieval phase, there are other genes having agreater biological relationship with the biological process.In fact, using gene ontology as source of validation, the termsignal transduction (GO: 0007165) presents about 13,004 geneproducts, while the term apoptosis (GO: 0006915) presents2,673 gene products. Consequently, compared to apoptosis,the signal transduction involves a wider set of genes.

On the other side, since BCL2 is highly correlated withapoptosis, it is very frequent in their presence in the samedocument compared to other couples composed by anothergene and apoptosis. That is why, BCL2 has a high SRS andhigh BCI, while AKT1 has similar SRS but lower BCI.

This analysis demonstrates that BCI is an effective way toassess the meaning of individual SRSD and it allows for thecomparison between results coming from different analysisruns. For completeness sake, in the following experimentalresults, both SRS and BCI are reported.

4 EXPERIMENTAL ANALYSIS

In this section, we first provide an evaluation of the impactof the semantic groups on the correlation score. Then, weshow the results of the correlation between a number ofgenes and biological process, and we show that they matchcurrent scientist knowledge from one side and highlightunknown correlation from the other side.

4.1 Terms Expansion Effects

To highlight the effect of semantic group filtering (Sec-tion 3.1), we computed the biological correlation degree

between a set of five input genes (VEGFA, ANGPT2,ANGPT1, BCL-XL, P53) and two biological processes(Angiogenesis and Apoptosis). In further detail, we calculatethe biological correlation score between each process andgene with the following three semantic groups filtering cases:

. Case A. All the semantic groups are considered withno filtering action on the ECS.

. Case B. The GENE, ANAT, CHEM, PHEN, PHYSsemantic groups are selected (see Table 1). All theexpanded keywords belonging to other semanticgroups are automatically excluded.

. Case C. Only the GENE semantic group is consideredand only those expanded keywords belonging to ittake place in the ECS.

Fig. 8 depicts the results for all the three cases. First, ageneral consideration applicable to the results of all thesemantic groups can be drawn. The VEGFA, ANGPT2, andANGPT1 genes are more involved into the Angiogenesisbiological process, and they are generally involved in thecell growth processes [11], [12]. BCI results reflect thisinformation as they present the highest values in Fig. 8a.Conversely, according to the data coming from Figs. 8a and8b, AKT1 and BAD are correctly the most correlated withthe Apoptosis process (see definitions in NCI thesaurus [25],AKT1: This gene is involved in signal transduction andnegative regulation of apoptosis, and BAD: The gene plays arole in the positive regulation of cellular apoptosis)compared to the lower score of ANGPT1 and ANGPT2.

Considering the semantic groups, it can be observed inFig. 8a that the correlation value of ANGPT2 and ANGPT1sharply decreases when considering only the expandedkeyword belonging to GENE semantic group (Case C)during the analysis run. The reason of such decrease is thecorresponding reduction of the expanded keywords setcomposing the query to retrieve PubMed abstracts that isresponsible of reducing the completeness in the textanalysis. In particular, from Cases B to C, there is a cut ofthe 56 percent of expanded keywords for ANGPT2 gene and60 percent for ANGPT1, whereas from Cases A to B, the cutis, respectively, of the 3 and 2 percent. Noteworthy, 48 and

ABATE ET AL.: GELSIUS: A LITERATURE-BASED WORKFLOW FOR DETERMINING QUANTITATIVE ASSOCIATIONS BETWEEN GENES AND... 625

Fig. 8. Biological correlation scores between a set of five genes and two biological processes (Angiogenesis, Fig. 8a, and Apoptosis, Fig. 8b). Theresults, reported in BCI, highlight the results coming from the three distinct cases of document set. Each document set is the result of a differentfiltering on the semantic groups.

45 percent of expanded keywords belonging to Amino Acid,Peptide, or Protein semantic type (CHEM semantic group)are cut in case of ANGPT2 and ANGPT1 respectively, thusreducing also terms linking the gene and the biologicalprocess under investigation.

Therefore, this example shows that restricting the selec-tion of semantic groups impacts the correlation score. As aguideline for the scientist, we can conclude that for a genericsearch, it is more convenient to include at least the mostcommonly related five semantic groups of Case B. However,the method is robust enough to filter out the additional noisecreated by uncorrelated documents. Indeed, results forCase A where all the semantic groups are included are verysimilar to results obtained for Case B.

4.2 Correlation Results

To validate the effectiveness of the proposed workflow, thedegree of correlation between a set of input genes (VEGFA,

ANGPT2, ANGPT1, AKT1, BCL-XL, BCL2, P53, PTEN,MAPK1, CCND1) and six significant biological processes(Angiogenesis, Vascular Permeability, Tumor Suppressor, SignalTransduction, Cell Cycle, and Apoptosis) has been measured.For all the experimental analyses, we consider a keywordsexpansion strategy using five semantic groups: GENE,ANAT, CHEM, PHEN, and PHYS (referred as Case B inSection 4.1).2 These benchmarks have been selected becausemost of the correlations between genes and biologicalprocesses are known with a certain confidence. In parti-cular, we take as a reference the definitions from the NCIThesaurus [25].

The functional definitions of the considered gene set arereported in the Table 2. We use them to demonstrate thecoherence of the quantified correlation results with atrustworthy source of biomedical information.

Besides NCI definitions, selected literature studies areused as reference. It is worth noting an important distinc-tion between the use of literature documents in the searchphase to their use in the validation phase. In the first case, alarge collection of documents and terms is automaticallyanalyzed. In the second case, trusted and selected papersare used to report the current thinking of scientists about thecorrelation of a certain gene and biological process. As such,there is a clear distinction between the two phases. More-over, Fig. 9 reports the histogram of the measurementsexpressed both in SRS and in BCI, while Fig. 10 reports theBCI using a different view, where scores are grouped bybiological process (Fig. 10a) or by gene (Fig. 10b). Theseviews help the understanding of the comparative resultsdiscussed in what follows. To guide the discussion, eachparagraph below is associated with a number that isreferenced in Figs. 9 and 10.

4.2.1 Angiogenesis, Vascular Permeability and

Apoptosis

Angiogenesis and Vascular Permeability are biologicallycorrelated processes that are essential for the proliferation

626 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013

TABLE 2Gene Definition from NCI Thesaurus

Fig. 9. Histogram representation of biological correlation scores between eleven genes and six biological processes. The scores are reported both inSRS (Fig. 9a) and BCI (Fig. 9b). To guide the discussion, the reported numbers along the histograms correspond to the number of the experimentalsection paragraphs.

2. These results may show some variations with respect to the results inFig. 7 because the latter have been obtained using all the semantic groupshave been used

and survival of malignant glioma cells that are responsibleof the most devastating tumors [23], [24]. The analysisreported in Fig. 9b shows that for both the processes,VEGFA, ANGPT1, ANGPT2 genes are highly correlatedaccording to the literature [11], [12]. This is further evidentin Fig. 10b where a cluster of this three genes with anelevated score corresponds to the angiogenesis process.

Concerning Apoptosis, the analysis of BCI reveals arelevant role of MAPK1, besides the main role of BCL2,BCL-xL, AKT1, and P53 genes. Indeed, MAPK1 is sometimesassociated with the Apoptosis because it suppresses theapoptotic effect of some BCL2 family genes. This informa-tion is extremely important, especially for scientists inter-ested in cancer related topics. The correlation score betweenMAPK1 and Apoptosis is 55 BCIs. This value is relevant evenif consistently lower with respect to BCL2, BCL-xL, AKT1,and P53 genes that are more directly related to Apoptosis.

4.2.2 The Role of BCL-XL Gene in Vascular

Permeability

The analysis highlights a remarkable correlation betweenBCL-XL gene with Vascular Permeability, while it confirms, ashighlighted by the results (Fig. 9b) and according to its NCIdefinition (Table 2), its main involvement in the Apoptosisregulatory process. The correlation with Vascular Permeability

is due to the fact that Angiogenesis, Vascular Permeability andApoptosis are biological processes discussed in many papersabout tumor therapies. Therefore, very likely VascularPermeability and BCL-xL gene frequently occur in manyabstracts. In this case, the tool proposes a correlation that isnot explicitly described in literature references.

4.2.3 The Correlation of VEGFA with Apoptosis

Similar considerations can be applied to the correlationresults of VEGFA. In fact, even if this gene is a knownpromoter of Angiogenesis and it is strongly involved in cellgrowth, the results in Fig. 9b suggests a very highcorrelation (BCI equal to 70 percent) with Apoptosis becauseboth the processes are often associated with tumortherapies. Moreover, VEGFA mutations are sometimesdirectly involved in the Apoptosis, as, for example, in thecase of VEGFR ubiquitination, and mutations/deletions inthe kinase domain of VEGFR, that induce Apoptosis inhormone-refractory breast cancer.

4.2.4 Other Genes Involved in Apoptosis and Tumor

Suppressor

Genes BCL-2 and BCL-XL (belonging to the BCL-2 families),as well as P53 and AKT1 show a relevant correlation withApoptosis (see Fig. 9). The remarkable scores of BCL-2 familiesgenes are confirmed by NCI definitions (see Table 2). It canbe observed that P53 is also mostly involved in Apoptosis(still in agreement with NCI definition) but also in TumorSuppressor process (BCI reported in Fig. 9b is 67 percent inthis case), which generates indeed the apoptosis of the tumorcells (i.e., the two biological processes are correlated). Notethat comparing the BCI of other genes with respect to TumorSuppressor (see Fig. 10b), P53 is the highest among all thegene scores highlighting its relevant role in tumor suppres-sion. This correlation is not explicit in its NCI definition.

4.2.5 The Role of AKT1

The high correlation scores reported for AKT1 gene withApoptosis (94 percent) and Signal Transduction biologicalprocess (92 percent) are expected, being specificallyinvolved in signal transduction and negative regulation ofapoptosis” (see definition in [25]).

4.2.6 The Genes Correlated with a Cell Cycle Process

The CCND1 gene, which belongs to the cyclin gene familythat controls the cell cycle through the cyclin-dependentkinase, obtains a score of 86 percent with Cell Cycle process(see gene NCI definition above). However, the resultssuggests that a set of genes including AKT1, BCL-XL,ANGPT2, and P53 has a relevant correlation score as well,because these genes are involved in processes that take partor indirectly deal with the cell cycle.

4.2.7 Comparison between SRS and BCI Results

The comparison between Figs. 9a and 9b highlights theeffectiveness of the BCI metric in enhancing some keycorrelations.

For instance, in the case of more general processes, asSignal Transduction and Cell Cycle, due to the high numberof keywords and documents involved in the analysis,the SRS range is limited reaching, respectively, 23.3 and

ABATE ET AL.: GELSIUS: A LITERATURE-BASED WORKFLOW FOR DETERMINING QUANTITATIVE ASSOCIATIONS BETWEEN GENES AND... 627

Fig. 10. Graph points representation of biological correlation scoresbetween 11 genes and six biological processes. All the scores arereported and BCI. Fig. 10a represents the graph point of the scoresconsidering the set of genes on the abscissa. Conversely, in Fig. 10b,all the scores are reported with the biological processes on theabscissa. The graph point representation allows the immediateidentification of cluster scores that enhance the correlation betweengenes and biological processes. To guide the discussion, the reportednumbers along the circles correspond to the number of theexperimental section paragraphs.

14 percent as maximum values. Conversely, in the case ofmore specific processes Angiogenesis and Tumor Suppressor,the maximum range is respectively limited to 63 and43 percent (see Fig. 9a).

This difference makes the comparison across processesmuch harder. For instance, the AKT1 gene is the mostcorrelated to Signal Transduction process (more generic). Inthe same way, VEGFA is the most correlated gene withAngiogenesis (more specific). Since the correlation expressedin BCI is normalized with respect to the SRS range, theimpact of this difference is filtered out. Therefore, asexpected, AKT1 gene and Signal Transduction have a BCIcorrelation value of 92 percent, coherently similar to thecorrelation value of 100 percent between Angiogenesis andVEGFA (see Fig. 9b).

4.3 Experiments on a Large Gene Pool

To evaluate the scalability of the proposed tool, weperformed two additional experiments. The first one isbased on a study about anaplastic lymphoma kinase (ALK)signature in anaplastic large cell lymphomas (ALCLs). Inour experiment, we evaluated the correlation betweenALCL as a biological process and a pool of 176 geneswhose involvement in the ALK path was explored in [39].We used Gelsius to perform the analysis of correlationbetween ALCL and this set of genes, and to identify themost correlated ones. The computation took less than1 hour. However, we remind that computation time arestrongly affected by caching effects.

In the second additional experiment, we computed thecorrelation of the same set of genes with Neuroblastoma. Inthis case, we expected a lower number of genes showingsignificant correlation values (i.e., higher than 50 percent).

For these two experiments, we report the correlationresults in the following Fig. 11. Among the large set ofgenes, we highlighted some of them for the sake ofdiscussion. In particular, concerning the evaluation of genescorrelated to ALCL, we identified several strongly corre-lated genes. Among them, TUBA3 with a correlation of100 percent, ARHGAP8 (95 percent), GAS1 (88 percent),DKC1 (95 percent), ICOS (73 percent), IL2RA (74 percent),BCL2A1 (77 percent), and as expected, ALK (100 percent),

and TNFRSF8 (90 percent). These results are in accordancewith recent experimental findings [39], [40], [41] thatdemonstrated the strong correlation with the ALCL.

In the evaluation of the correlation between the neuro-blastoma and the same set of genes, we obtained, as statedabove, a lower number of strongly correlated genes.However, among them we can identify TUBA3 with acorrelation of 93 percent, UBE2C (87 percent), and MYCN(100 percent). Also, in this case, these results are inagreement with recent experimental findings [42] [43].

4.4 Comparison with GO-Based Tools

With respect to GO-based correlation computation toolssuch as G-sesame, we point out that the proposed approachdoes not represent an alternative to the tools based on GO.The utilization of GO is more for specific searches whereGO terms are well known and identified. The tools thatexploit GO for computing correlations between terms/keywords (or genes) are generally based on Wang’salgorithm [4] and, in most cases, they need the GO IDs asinput. For this reason, different tools mapping terms on GOIDs (such as AMIGO [22], QuickGO [21], GOA [21],GOSemSim [20], etc.) must run before computing termscorrelation. Since several GO IDs can result for a singlekeyword, manual process might be required to select theGO term that better represents the term/keyword itself. Theproposed approach does not require identification of GO-terms. While this characteristic makes the tool more subjectto noise, however it provides user friendliness andflexibility. The proposed tool is more suitable for correlationanalysis between biological processes and genes than GO-based correlation tools. Additionally, even considering themost appropriate GO terms, in some cases GO-basedcorrelation between biological processes and genes lacksaccuracy. For the sake of comparison, we employedAMIGO, GOA, QuickGO GO-terms search tools andGeneCards on the same case studies discussed in the paperto obtain the GO IDs. Then, we used G-Sesame tool tocalculate the correlations. For space limitations, we report asingle relevant results concerning the correlation analysisbetween VEGFA and Angiogenesis. The most correlatedGO term gives a correlation score of around 7 percent, while

628 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013

Fig. 11. Correlation between a set of 178 genes and ALCL (blue bars) and neuroblastoma (red bars). The genes are sorted from the less to thehigher correlated. Some of the mostly correlated genes are highlighted.

we expect a much higher correlation value according to the

role of VEGFA gene in the Angiogenesis process. Overall,

these results highlight that the proposed tool overcomes

some of the limitations of the currently available correlation

computation tools.

5 CONCLUSION

The analysis workflow presented in this paper provides

quantitative measurements of correlation among genes

and biological processes from scientific publications. The

workflow has been implemented using term expansion

and document retrieval to find out subsets of documents

to be used for correlation extraction with an LSI approach.

Each step has been customized for the purpose of getting

reliable correlation quantification. To complete the work-

flow, a metric has been introduced to enable the

comparison of correlation results obtained on different

subsets of documents.Results about semantic group filtering demonstrate that

the workflow is robust with respect to the selection of

semantic types for keyword expansion. However, they

point out that a too restrictive selection using a single

semantic group results in poor correlation quantification.

On the other side, the results about correlation analysis

conducted on six biological processes and 10 genes

demonstrate that the workflow is reliable as it confirms

known correlations. Moreover, it allows the discovery of

unknown or less explicit ones.We conclude that the Gelsius tool we developed, which

implements the proposed workflow, can be used by the

scientist not only to perform an effective and quantitative

exploration of the literature, but also to better understand

and filter the results of high-throughput experiments

(microarray analyses, NGS).In the future, we would like to insert Gelsius in a NGS

experiment analysis loop, where the gene list of interest is

pruned in an iterative way by means of the correlation scores.

REFERENCES

[1] J.P. de Magalhes et al., “Next-Generation Sequencing in AgingResearch: Emerging Applications, Problems, Pitfalls and PossibleSolutions,” Ageing Research Rev., vol. 9, pp. 315-323, 2009.

[2] M. Kircher, “High-Throughput DNA Sequencing Concepts andLimitations,” Bioessays, vol. 32, pp. 524-536, 2010.

[3] B.L. Humphreys et al., “The Unified Medical Language System:An Informatics Research Collaboration,” J. Am. Medical InformaticsAssoc., vol. 5, no. 1, pp. 1-11, 1998.

[4] J.Z. Wang et al., “A New Method to Measure the SemanticSimilarity of GO Terms,” Bioinformatics, vol. 23, pp. 1274-1281,2007.

[5] A.T. McCray et al., Aggregating UMLS Semantic Types for ReducingConceptual Complexity. IOS Press, 2001.

[6] J. Chabalier et al., “A Transversal Approach to Predict GeneProduct Networks from Ontology-Based Similarity,” BMC Bioin-formatics, vol. 8, article 235, 2007.

[7] R. Homayouni et al., “Gene Clustering by Latent SemanticIndexing of MEDLINE Abstracts,” Bioinformatics, vol. 21,pp. 104-115, Jan. 2005.

[8] A.M. Gliozzo et al., “Domain Kernels for Text Categorization,”Proc. Ninth Conf. Computational Natural Language Learning, June2005.

[9] N. Cristianini et al., “Latent Semantic Kernels,” J. IntelligentInformation Systems, vol. 18, pp. 127-152, 2002.

[10] Nat’l Cancer Inst., “NCI Term Browser,” http://nciterms.nci.nih.gov, 2013.

[11] C. Cristina et al., “Increased Pituitary Vascular EndothelialGrowth Factor-a in Dopaminergic D2 Receptor Knockout FemaleMice,” Endocrinology, vol. 146, pp. 2952-2962, 2005.

[12] V. Nguyen et al., “Differential Response of Lymphatic, Venousand Arterial Endothelial Cells to Angiopoietin-1 and Angiopoie-tin-2,” BMC Cell Biology, vol. 8, article 10, 2007.

[13] I. Shiojima et al., “Regulation of Cardiac Growth and CoronaryAngiogenesis by the Akt/PKB Signaling Pathway,” Genes andDevelopment, vol. 20, pp. 3347-3365, 2006.

[14] P. Jiang et al., “The Bad Guy Cooperates with a Good Cop p53:Bad Is Transcriptionally Up-Regulated by p53 and Forms Bad/p53 Complex at the Mitochondria to Induce Apoptosis,” MolecularCellular Biology, vol. 26, pp. 9071-9082, 2006.

[15] B.S. Chang et al., “The BH3 Domain of Bcl-xS Is Required forInhibition of the Antiapoptotic Function of Bcl-xL,” MolecularCellular Biology, vol. 19, pp. 6673-6681, 1999.

[16] J. Kim et al., “Glucose-Dependent Insulinotropic Polypeptide-Mediated Up-Regulation of �-Cell Antiapoptotic Bcl-2 GeneExpression Is Coordinated by Cyclic AMP (cAMP) ResponseElement Binding Protein (CREB) and cAMP-Responsive CREBCoactivator 2Su” Molecular Cellular Biology, vol. 28, pp. 1644-1656,2008.

[17] Medical Subject Headings, http://www.nlm.nih.gov/mesh/meshhome.html, 2013.

[18] F. Vazquez et al., “Phosphorylation of the PTEN Tail RegulatesProtein Stability and Function,” Molecular Cellular Biology, vol. 20,pp. 5010-5018, 2000.

[19] C. Pesquita et al., “Semantic Similarity in Biomedical Ontologies,”PLoS Computational Biology, vol. 5, no. 7, article e1000443, 2009.

[20] G. Yu et al., “GOSemSim: An R Package for Measuring SemanticSimilarity among GO Terms and Gene Products,” Bioinformatics,vol. 26, pp. 976-978, 2010.

[21] D. Binns et al., “QuickGO: A Web-Based Tool for Gene OntologySearching,” Bioinformatics, vol. 25, pp. 3045-3046, 2009.

[22] M. Ashburner et al., “Gene Ontology: Tool for the Unification ofBiology. The Gene Ontology Consortium,” Nat. Genetics, vol. 25,no. 1, pp. 25-29, May 2000.

[23] M.A. Meser et al., “Changes in Vascular Permeability andExpression of Different Angiogenic Factors Following Anti-Angiogenic Treatment in Rat Glioma,” PLoS ONE, vol. 5, no. 1,article e8727, 2010.

[24] R.H. Goldbrunner et al., “PTK787/ZK222584, an Inhibitor ofVascular Endothelial Growth Factor Receptor Tyrosine Kinases,Decreases Glioma Growth and Vascularization,” Neurosurgery,vol. 55, pp. 426-432, 2004.

[25] NCI Thesaurus, http://ncit.nci.nih.gov, 2013.

[26] B. Markova et al., “Novel Pathway in Bcr-Abl Signal TransductionInvolves Akt-Independent, PLC-gamma1-Driven Activation ofmTOR/p70S6-Kinase Pathway,” Oncogene, vol. 29, pp. 739-751,2010.

[27] U. Galderisi et al., “Cell Cycle Regulation and Neural Differentia-tion,” Oncogene, vol. 22, pp. 5208-5219, 2003.

[28] X. Hu et al., “A Semantic Approach for Mining Hidden Linksfrom Complementary and Non-Interactive Biomedical Litera-ture,” Proc. SIAM Conf. Data Mining, pp. 200-209, Apr. 2006.

[29] Z. Weizhong et al., “Using UMLS-Based Re-Weighting Terms as aQuery Expansion Strategy,” Proc. IEEE Int’l Conf. GranularComputing, pp. 217-222, 2006.

[30] A.R. Aronson et al., “Query Expansion Using the UMLSMetathesaurus,” Proc. AMIA Ann. Fall Symp., pp. 485-489, 1997.

[31] G. Leroy et al., “Meeting Medical Terminology Needs: TheOntology-Enhanced Medical Concept Mapper,” IEEE Trans.Information Technology in Biomedicine, vol. 5, no. 4, pp. 261-270,Dec. 2001.

[32] C.G. Chute, “An Evaluation of Concept Based Latent SemanticIndexing for Clinical Information Retrieval,” Proc. Ann. Symp.Computer Application Medical Care, pp. 639-643, 1992.

[33] C.G. Chute, “Latent Semantic Indexing of Medical DiagnosesUsing UMLS Semantic Structures,” Proc. Ann. Symp. ComputerApplication Medical Care, pp. 639-643, 1992.

[34] T. Pedersen et al., “Measures of Semantic Similarity andRelatedness in the Biomedical Domain,” J. Biomedical Informatics,vol. 40, pp. 288-299, June 2007.

ABATE ET AL.: GELSIUS: A LITERATURE-BASED WORKFLOW FOR DETERMINING QUANTITATIVE ASSOCIATIONS BETWEEN GENES AND... 629

[35] G. Tsatsaronis et al., “Text Relatedness Based on a WordThesaurus,” J. Artificial Intelligence Research, vol. 37, pp. 1-40, Jan.2010.

[36] F. Abate et al., “An Automated Tool for Scoring Biomedical TermsCorrelation Based on Semantic Analysis,” Proc. Int’l Conf. Complex,Intelligent and Software Intensive Systems (CISIS), 2010.

[37] W. Hersh et al., “Assessing Thesaurus-Based Query ExpansionUsing the UMLS Metathesaurus,” Proc. Am. Medical InformaticsAssoc. Symp., 2000.

[38] Y. Guo et al., “Sheffield University and the TREC 2004 GenomicsTrack: Query Expansion Using Synonymous Terms,” Proc. 13thText REtrieval Conf., 2004.

[39] R. Piva et al., “Functional Validation of the Anaplastic LymphomaKinase Signature Identifies CEBPB and BCL2A1 as Critical TargetGenes,” J. Clinical Investigation, vol. 116, pp. 3171-3182, 2006.

[40] R. Piva et al., “Gene Expression Proling Uncovers MolecularClassiers for the Recognition of Anaplastic Large-Cell Lymphomawithin Peripheral T-Cell Neoplasms,” J. Clinical Oncology, vol. 28,no. 9, pp. 1583-1590, 2012.

[41] L. Agnelli et al., “Identification of a 3-Gene Model as a PowerfulDiagnostic Tool for the Recognition of ALK-Negative AnaplasticLarge-Cell Lymphoma,” Blood, vol. 120, pp. 1274-1281, 2012.

[42] S. De Brouwer et al., “Meta-Analysis of Neuroblastomas Reveals aSkewed ALK Mutation Spectrum in Tumors with MYCNAmplification,” Clinical Cancer Research, vol. 16, pp. 4353-4362,2010.

[43] K. De Preter et al., “Independent Data Sets Using a MultigeneSignature Accurate Outcome Prediction in NeuroblastomaAcross,” Clinical Cancer Research, vol. 16, pp. 1532-1541, 2010.

[44] The Cancer Genome Atlas Research Network, “ComprehensiveGenomic Characterization of Squamous Cell Lung Cancers,”Nature, vol. 489, pp. 519-525, 2012.

[45] F. Abate, A. Acquaviva, G. Paciello, C. Foti, E. Ficarra, A.Ferrarini, M. Delledonne, I. Iacobucci, S. Soverini, G. Martinelli,and E. Macii, “Bellerophontes: An RNA-Seq Data AnalysisFramework for Chimeric Transcripts Discovery Based on Accu-rate Fusion Model,” Bioinformatics 2012, vol. 28, no. 16, pp. 2114-2121, 2012.

[46] F. Abate, C. Foti, A. Acquaviva, and E. Ficarra, “Integration ofLiterature with Heterogeneous Information for Genes CorrelationScoring,” to be published in ACM J. Emerging Technologies inComputing Systems.

Francesco Abate received the bachelor’s andmaster’s degrees from Politecnico di Torino inDecember 2004 and July 2007, respectively. Hereceived the PhD degree in system and compu-ter engineering from the Department of Controland Computer Engineering, Politecnico di Tor-ino, in 2011. In 2008, he worked with the CADGroup, Politecnico di Torino, in the field of faulttolerance and embedded systems, and fromJune 2008 to December 2008, he was visiting

student at Universidade Federal do Rio Grande do Sul, Porto Alegre,Brazil. In 2009, he joined the EDA group where he worked on the designof data mining tools applied to bioinformatics and the development ofnext-generation sequencing analysis pipelines. Since November 2011,he has been with the Rabadan Lab at Columbia University as apostdoctoral research scientist. He is currently working on thedevelopment of several pipelines mainly focused on the detection,characterization, and annotation of gene fusions and genomic translo-cation in RNA-Seq data. He is also working on the analysis of the effectof point mutations, copy number variations and genetical aberration inseveral cancer disease mainly anaplastic large cell lymphoma. Hisrecent activity also involves the study and the discovery of newpathogens in RNA/DNA-Seq samples, and it aims at recognizing the roleof new pathogens in the development of cancer. His research activity isfocused on computational biology applied to cancer research. He is amember of the IEEE.

Andrea Acquaviva received the MSc degreefrom the University of Ferrara, Italy, and the PhDdegree in electrical engineering from BolognaUniversity, Italy. He is an assistant professor inthe Department of Control and Computer En-gineering, Politecnico di Torino. He was re-search intern at Hewlett Packard Labs in PaloAlto, California, from 2001 to 2003 and a visitingresearcher in the Ecole Politechnique Federalede Lausanne from 2005 to 2007. He was a

research consultant for Freescale Semiconductor from 2004 to 2006. Heis currently a coordinator of the European FP7 Project TOUCHMOREregarding the design of toolchain for next-generation heterogeneousmulticore platforms. He is also involved in various FP7 and ARTEMISEuropean funded projects concerning the design of software for next-generation multicore platforms. His research interests focus on low-power software for multicore and distributed wireless systems as well asbioinformatics (computational genetics, molecular simulation). In thesefields, he is coauthor of more than 100 publications in internationaljournals and peer-reviewed international conference proceedings. He isa member of the Executive Committee of the IEEE DATE conference,and he is member of the technical program committee of several IEEE/ACM conferences, including DATE and ISLPED. He is a member of theIEEE.

Elisa Ficarra received the PhD degree fromPolitecnico di Torino in 2006. She has been anassistant professor at the Politecnico di Torino,Italy, since 2008. Her research interests includealgorithms for biomolecular imaging, and ana-lyses on transcriptional and posttranscriptionalregulative mechanisms. In particular, her inter-ests include algorithms for WES and RNA deepsequencing, microarrays data analysis, clinicalgenomics, HTS-CLIP data analysis, and miRNA

regulation network design, tissue microarrays data analysis, and high-resolution bioimaging. She is involved in many international collabora-tions and European projects. She is an invited member of manybioinformatics societies (e.g., ISCB, EMB), a reviewer of manyinternational journals (including Nature Nanotechnology), a member ofseveral IEEE/ACM TPC, and an associate editor of IEEE Transactionson Information Technology in Biomedicine. She has several publicationsin international journals and peer-reviewed international conferenceproceedings as well as book chapters. She is a member of the IEEE.

630 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 10, NO. 3, MAY/JUNE 2013

Roberto Piva received the honors degree inbiology and doctoral degree in neurologicalsciences and the residency in clinical pathology.He is an associate professor in the LaboratoryMedicine, University of Turin, and an adjunctassistant professor in pathology at the New YorkUniversity Langone Medical Center. During thefirst part of his career, he focused his research onthe study of the molecular mechanisms of celldeath in human and mouse models of neurode-

generative disorders. From 1999 to 2002, he served as a postdoctoralfellow in the Department of Pathology, New York University, where heacquired expertise on cell cycle, protein degradation through the ubiquitinpathway, and on transgenic mouse models. From 2002 to 2003, he was asenior scientist for a biotech company where he investigated theantitumoral activity of NF-kB inhibitors in B-cell malignancies. In 2004,he moved to the Centre for Experimental Research in Medicine,University of Turin, where he established his own laboratory as arecipient of the Brain Gain grant program from the Italian Ministry ofUniversity and Research. Since then, his research has been orientedtoward the dissection of ALK signaling and on the validation oftherapeutic targets for haematological malignancies using a combinationof gene expression profiling and functional screenings. He is the author of50 peer-reviewed publications a principal investigator in several foundedresearch grants, and the supervisor of an independent research.

Enrico Macii received the laurea degree inelectrical engineering from Politecnico di Torinoin 1990, the laurea degree in computer sciencefrom Universita di Torino in 1991, and the PhDdegree in computer engineering from Politecnicodi Torino in 1995. He is a full professor ofcomputer engineering at Politecnico di Torino.Prior to that, he was an associate professor(from 1998 to 2001) and an assistant professor(from 1993 to 1998) at the same institution.

From 1991 to 1997, he was also an adjunct faculty at the University ofColorado, Boulder. Since 2007, he has been the vice rector forResearch, Technology Transfer and EU Affairs at Politecnico di Torino.His research interests are in the design of electronic digital circuits andsystems, with particular emphasis on low-power consumption aspects.In the field above, he has authored around 400 scientific publications.He is a senior member of the IEEE.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

ABATE ET AL.: GELSIUS: A LITERATURE-BASED WORKFLOW FOR DETERMINING QUANTITATIVE ASSOCIATIONS BETWEEN GENES AND... 631