Discovering market trends in the biotechnology industry

18
184 Int. J. Business Intelligence and Data Mining, Vol. 6, No. 2, 2011 Copyright © 2011 Inderscience Enterprises Ltd. Discovering market trends in the biotechnology industry Haralampos Karanikas* Knowledge Management Lab., National & Kapodistrian University of Athens, Ano Ilisia 15771, Athens, Greece E-mail: [email protected] *Corresponding author George Koundourakis General Secretariat for Information Systems, Ministry of Economy and Finance, Nikis 5-7, 10180, Athens, Greece E-mail: [email protected] Ioannis Kopanakis Technological Educational Institute of Crete, Psillinaki 42, Ierapetra, GR-72200, Crete, Greece E-mail: [email protected] Website: www.csd.uoc.gr/~kopanak/ Thomas Mavroudakis Knowledge Management Lab., National & Kapodistrian University of Athens, Ano Ilisia 15771, Athens, Greece E-mail: [email protected] Nikos Pelekis Department of Informatics, University of Piraeus, Piraeus, 80 Karaoli-Dimitriou St., GR-18534 Piraeus, Greece E-mail: [email protected] Abstract: In this paper we describe our approach to discover trends for the biotechnology and pharmaceutical industry based on temporal text mining. Temporal text mining combines information extraction and data mining techniques upon textual repositories and our main objective is to identify changes of associations among entities of interest over time. It consists of three

Transcript of Discovering market trends in the biotechnology industry

184 Int. J. Business Intelligence and Data Mining, Vol. 6, No. 2, 2011

Copyright © 2011 Inderscience Enterprises Ltd.

Discovering market trends in the biotechnology industry

Haralampos Karanikas* Knowledge Management Lab., National & Kapodistrian University of Athens, Ano Ilisia 15771, Athens, Greece E-mail: [email protected] *Corresponding author

George Koundourakis General Secretariat for Information Systems, Ministry of Economy and Finance, Nikis 5-7, 10180, Athens, Greece E-mail: [email protected]

Ioannis Kopanakis Technological Educational Institute of Crete, Psillinaki 42, Ierapetra, GR-72200, Crete, Greece E-mail: [email protected] Website: www.csd.uoc.gr/~kopanak/

Thomas Mavroudakis Knowledge Management Lab., National & Kapodistrian University of Athens, Ano Ilisia 15771, Athens, Greece E-mail: [email protected]

Nikos Pelekis Department of Informatics, University of Piraeus, Piraeus, 80 Karaoli-Dimitriou St., GR-18534 Piraeus, Greece E-mail: [email protected]

Abstract: In this paper we describe our approach to discover trends for the biotechnology and pharmaceutical industry based on temporal text mining. Temporal text mining combines information extraction and data mining techniques upon textual repositories and our main objective is to identify changes of associations among entities of interest over time. It consists of three

Discovering market trends in the biotechnology industry 185

main phases; the Information Extraction, the ontology driven generalisation of templates and the discovery of associations over time. Treatment of the temporal dimension is essential to our approach since it influences both the annotation part (IE) of the system as well as the mining part.

Keywords: temporal text mining; ontology; competitive intelligence.

Reference to this paper should be made as follows: Karanikas, H., Koundourakis, G., Kopanakis, I., Mavroudakis, T. and Pelekis, N. (2011) ‘Discovering market trends in the biotechnology industry’, Int. J. Business Intelligence and Data Mining, Vol. 6, No. 2, pp.184–201.

Biographical notes: Haralampos Karanikas is currently a Researcher at the University of Athens. He holds a BSc in Physics from the Aristotle University of Thessalonica(GR) (1996). He received MSc in Computer Science (1998) from UMIST (UK) and PhD in the field of Temporal Text Mining from the University of Manchester. His research work has informed the formation of research proposals. His work aims at advancing the ways data-mining algorithms are used to analyse natural language text in an attempt to discover structure and implicit meanings ‘hidden’ within the text. His research interests also include document warehouse, ontology construction, data mining and business intelligence.

George Koundourakis is working in the Ministry Of Economy and Finance of Greece, General Secretariat for Information Systems. He received his BSc from the Computer Science Department of the University of Crete (1996). He has subsequently joined the Department of Computation in the University of Manchester Institute of Science and Technology (UMIST) to pursue his MSc in Information Systems Engineering (1999) and his PhD in Data Mining (2001). He has been working for almost 10 years in the field of data management and machine learning. His research interests include knowledge discovery, data mining, machine learning, spatio-temporal databases and geographical information systems.

Ioannis Kopanakis is an Assistant Professor at the Technological Institution of Crete. He holds a Diploma in Computer Science from the University of Crete (1998), Greece, an MSc in Information Technology (1999), and a PhD in Computation (2003), both from UMIST, UK. His research interests include data mining, visual data mining and business intelligence. He has been involved in five Hellenic and in three European research programmes. He has published 14 papers in journals and refereed conferences, and he has demonstrated the results of his work in Europe and the USA.

Thomas Mavroudakis is a Lecturer at the University of Athens since 2004. He holds an MSc in Mathematics and a PhD in Computation from the University of Manchester. His main research interest is knowledge modelling and computational linguistics. From 1999 to 2004, he was the Director of the Research and Technology Directorate, Minister Staff, Hellenic Ministry of National Defence. He has a long and non-stop experience in international and national research project funded by EU, WEAG-CEPA 6, Hellenic Government and Ministry of National Defence since 1987.

Nikos Pelekis is a Post-Doc Researcher at the Department of Informatics, University of Piraeus, Greece. Recently, he was Elected Lecturer at the Department of Statistics and Insurance Science, University of Piraeus. He received his BSc from the Computer Science Department of the University of Crete (1998). He has subsequently joined the Department of Computation

186 H. Karanikas et al.

in the University of Manchester Institute of Science and Technology (UMIST) to pursue his MSc in Information Systems Engineering (1999) and his PhD in Moving Object Databases (2002). He has co-authored more than 40 research papers and book chapters.

1 Introduction and motivation

We are living in the ‘Information Age’ and biomedical knowledge is growing at an amazing speed, making comprehension a time-consuming and mentally heavy job for humans (Morchen et al., 2008a). The vast amount of data found in an organisation, some estimates run as high as 80%, are textual such as reports and e-mails (Merrill Lynch, 2000). This type of unstructured data usually lacks metadata (data about data), and as a consequence, there is no standard means to facilitate search, query and analysis. Knowledge Discovery in Text (KDT) and TM are mostly automated techniques that aim to discover high-level information in huge amount of textual data and present it to the potential user (analyst, decision-maker, etc.).

Text Mining elaborates unstructured textual information and examines it in an attempt to discover structure and implicit meanings ‘hidden’ within the text (Karanikas et al., 2000) by utilising specialised data mining and Natural Language Processing (NLP) techniques operating on textual data. TM applications impose strong constraints on the usual NLP techniques (Rajman and Besançon, 1997), as the involvement of large volumes of textual data does not allow integrating complex treatments.

Within our framework, IE is an essential phase in text processing. It facilitates the semi-automatic creation of (more or less) domain-specific metadata repositories. Such repositories can be further processed using standard data-mining techniques. The focus is to study the relationships and implications among entities (such as company and person) and processes (such as business affairs) participating in specific events. The goal is to discover important association rules over time within a document collection such that the presence of a set of entities or processes in a document implies the presence of another entity or process. For example, one might learn that whenever the company A occurs in news text, it is highly probable that a technology area is also mentioned for a specific time period.

The real-world case study scenario demonstrated in this paper involves a company that specialises in corporate intelligence (including competitive intelligence –CI) products and services for the biotechnology and pharmaceutical industry. These products and services are targeted mainly to managers in charge of business development (or investment/strategic level issues) who wish to make decisions on industry and company developments or simply to monitor these on a regular basis.

In particular, our approach is applied to a number of documents that describe new alliances and partnerships of competitors. In an intelligent way, we discover business trends of competitors (or in the market in general). Competitors moving into markets, or the emergence of a new market, are of paramount importance. This type of analysis enhances knowledge acquisition regarding markets and competitors so that the business analyst would be able to support the proposed strategic plans of an/the organisation.

Discovering market trends in the biotechnology industry 187

Our main contributions include:

• integration on the database level of IE and DM to discover useful patterns from text

• incorporation of the time issue in the text discovery process

• incorporation of background knowledge via the use of ontology in the discovery process

• metadata model for the text repository

• an ontology-driven technique for discovering temporal association in text.

The rest of the paper is organised as follows. In Section 2, we discuss the process of analysing vast amount of textual data and provide the definitions that will be used throughout the paper. In the subsequent Section 3, we demonstrate our approach, to perform temporal KDT, with the use of a realistic case study. Section 4 reviews related work in the field, while in Section 5 we conclude and give some future research directions.

2 Knowledge Discovery in Text

Knowledge Discovery in Text (KDT) and TM is an emerging research area that tries to resolve the problem of information overload by using techniques from data mining, machine learning, NLP, Information Retrieval (IR), IE and knowledge management. As with any emerging research area, there is still no established vocabulary for KDT and TM, a fact that can lead to confusion when attempting to compare results. Often, the two terms are used to denote the same thing. KDT (Feldman and Dagan, 1995; Kodratoff, 2000) (or in textual databases), Text Data Mining (Hearst, 1999) and TM (Rajman and Besançon, 1997; Nahm and Mooney, 2000; Tan, 1999; Hehenberger and Coupet, 1998; Sullivan, 2001) are some of the terms that can be found in the literature.

We use the term KDT to indicate the overall process of turning unstructured textual data into high-level information and knowledge, while the term TM is used for the step of the KDT process that deals with the extraction of patterns from textual data. By extending the definition, for Knowledge Discovery in Databases (KDD) (Fayyad and Piatetsky, 1996), we provide the following one: KDT is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in unstructured textual data.

KDT is a multi-step process, which includes all the tasks from the gathering of documents to the visualisation of the extracted information. The main goal of KDT is to make patterns understandable to humans to facilitate a better understanding of the underlying data (Fayyad and Piatetsky, 1996). TM is a step in the KDT process consisting of particular data mining and NLP algorithms that under some acceptable computational efficiency limitations produces a particular enumeration of patterns over a set of unstructured textual data.

The above-mentioned definition is better understood by comparing it with the related technology, Data mining. Data mining works with structured data, often numerical in nature. TM is analogous to data mining in that it uncovers relationships in information.

188 H. Karanikas et al.

Unlike data mining, it works with information stored in an unstructured collection of text documents.

The process of discovering knowledge in text includes three major stages (Sullivan, 2001; Kosala and Blockeel, 2000) (Figure 1):

1 Collect relevant documents: The first step is to identify what documents are to be retrieved. Once we have identified the source of our documents, we need to retrieve the documents (from web or from an internal file system).

2 Pre-processing documents: This step includes any kind of transformation processes of the original documents retrieved. These transformations could aim at obtaining the desired representation of documents such as XML and SGML. The resulting documents are then processed to provide basic linguistic information about the content of each document. With an understanding of how words, phrases and sentences are structured, we can control text and extract information. From the TM perspective, our task in this pre-processing stage is to use language rules and word meanings to produce reusable by the other stages, representations of text. Main areas of linguistics include the morphology (structure and form of words), Syntax (how words and phrases form sentences) and Semantics (the meaning of words and statements).

3 Text-mining operations: High-level information is extracted (metadata creation). Patterns and relationships are discovered within the extracted information.

Figure 1 Major KDT stages of processing (see online version for colours)

The accommodation of time into the KDT process provides the ability to discover richer knowledge, which may also indicate cause and effect associations. Temporal TM is an extension to TM, as it provides the capability of mining activity rather than just states upon textual data.

The temporal issue influences mainly the NLP (Temporal annotations) and the mining (Temporal Data Mining) phases. NLP extracts and assigns the temporal features into textual objects and the mining techniques include the temporal dimensions introduced into the analysis.

3 Temporal text-mining approach

The methodology we describe is a generic approach that can be applied to text databases of varying complexity. In our approach, the process of detecting patterns within and across text documents (TM) depends mainly on IE techniques. By identifying key entities in a text along with the temporal characteristics, one can find relationships between terms and identify unknown links (‘hidden’) between related topics.

Discovering market trends in the biotechnology industry 189

We consider that each fact (or event or template according to IE) defines a ‘transaction’ that creates the associations between textual items. In analogy with a traditional data-mining example, each fact is the corresponding market basket and the textual items (template slots according to IE) assigned on it constitute the items in the basket. Consequently, our general goal of association extraction over time, given a set of facts, is to identify relationships between textual items and events such as the presence of one pattern implies the presence of another pattern over time.

The overall architecture of our system is shown in Figure 2. First, the IE module elaborates the input text to extract and fill the predefined template. Then, the generalisation module performs the necessary generalisations upon specific slots of the template. Finally, the filled templates feed the temporal association discovery module to extract the rules.

Figure 2 Architecture flow

Metadata repository (and an associated model) has been utilised by our approach to store data that describe the information extracted from the documents. More specifically, the role of the repository is to provide a neutral medium and unified representation forming the basis for analysis. Additionally, it provides a more solid basis for large-scale applications in text.

In the following sections, we will demonstrate our methodology by presenting a sample of the case study, with some terms replaced owing to confidentiality.

3.1 Preparation of the CI case study

For any modern organisation, one of the most important issues nowadays, along with knowing its strengths/weaknesses and understanding its customers, is about knowing its competitors. Competitive intelligence is defined as “a systematic program for gathering and analysing information about your competitors’ activities and general business trends to further your own company’s goals” (Kahaner, 1996). Owing to the mainly unstructured format of the information sources, TM is a valuable tool.

The real-world case study scenario demonstrated in this paper involves a company that specialises in corporate intelligence (including CI) products and services for the biotechnology and pharmaceutical industry. Within the Company, much of the IE process has historically been done manually. A human analyst has had to search a set of information sources gathering information, filtering, indexing and structuring it in an appropriate way for use format. Examples of sources include press releases, scientific papers, etc. The corpus of press releases for the case study has been derived from

190 H. Karanikas et al.

Business Wire (www.businesswire.com). Business Wire specialises in the electronic delivery of news releases and information directly from companies, institutions and agencies to the media, financial community and consumers. The press releases the company is interested in, concern business events within the life science domain.

Before we begin the process of gathering information about competitors, we have to define some organising principles or else we can find ourselves searching, retrieving and loading documents that do not meet our requirements.

The first task we have to accomplish to collect the relevant documents is to define the target competitors. In some industries, i.e., automobile, this is a simple matter. In several other industries such as pharmaceutical is quite different. For example, while developing and testing new drugs is a complex process, which is often done by large companies, smaller companies are often formed to develop and market a single drug or use a single technology, such as genetic, to improve the manufacturing process for some class of drugs. This example shows that we should not only choose large competitor companies for a target company, but also include smaller companies. In fact, smaller companies in a specific sector may indicate a new dynamic market opportunity.

In our case study, few of the companies we have targeted include Combichem, Chromagen, Ono, Pharmacopeia, Pharmacia & Upjohn, Roche Bioscience, etc. The documents we have gathered from business press releases on the internet concern the target companies and in particular discuss collaborations among them on specific scientific issues.

The need for the analyst is to identify competitors’ movements into new scientific areas, potential alliances of competitors and the relationships between industry-specific sectors and technological areas. It is self-evident that these relationships over time would add great value in the decision process of the analyst. These are types of business problems that initially drive the need for TM.

A Text-Mining goal is an operational definition of the business goal. In other words, the TM goal states project objectives in technical terms. For example, in our case study, one of the business goals might be: “Should we move in new technological areas?”. One of the sub-questions to answer is: “Are the competitors moving in new technological areas and which?” The corresponding TM goal might be: “According to the collaborations our competitors have signed the last year, discover relationships with new technological areas”.

3.2 Extracting information pieces

Following the problem definition and collecting the relevant documents is the phase of extracting information pieces from text. IE is the process of identifying essential pieces of information within a text, mapping them to standard forms and extracting them for use in later processing (Sullivan, 2002). In other words, it is the mapping of natural language texts (such as newswire reports, newspaper and journal papers and World Wide Web pages) into predefined, structured representation (or templates), which, when filled, represent an extract of key information from the original text (Grishman, 1997). The information refers to entities of interest in the application domain (e.g., companies or persons), or relations between such entities, usually in the form of facts (or events) in which the entities take part (e.g., company takeovers management successions).

Discovering market trends in the biotechnology industry 191

Once extracted, the information can then be stored, queried, mined, summarised in natural language, etc.

The IE component used in our framework is based on GATE system (GATE, 2007). GATE allows a sequence of language-processing components to work together and as a consequence marks up annotations on the input text. These language-processing components include standard components such as sentence splitters and named entity recognisers. The result of the IE process is a set of named annotations over sections of the text.

Although the use of GATE was significant, especially for the named entity tasks, some of the work for the facts extraction has been done manually owing to the high precision required for the rest of the components. Another reason is that the focus of the current study, relating to IE, is the outcome of this process.

In the current case study, we consider extracting information pieces from a collection of documents that describe corporate activities (1998–1999). We are interested in extracting a specific fact that concerns Collaborations among companies on a specific technology subject and at some point in time. According to the sample document of Figure 3, the output of the IE subsystem with the information pieces that we are interested on extracting is shown in Figure 4.

Figure 3 Sample collected document

Figure 4 Filled template with several slots that refer to collaboration facts

The above-mentioned fact of interest that concerns collaborations among companies is defined by the following attributes: First of all, the two companies that participate. Then, the collaboration type and the collaboration subject that presents useful information of the relationship between the two companies. The collaboration subject usually includes technology terms of significance to the bio-knowledge worker. These technology terms point to technology areas/markets and they are utilised to represent the current or future directions of company activities. The last two features involve time. The announced date is the date that the fact was announced or published

192 H. Karanikas et al.

and the valid date is the actual date that the fact happened or is active. All these features have been defined in cooperation between the TM analyst and the domain expert. Table 1 presents case study examples of facts extracted.

Table 1 IE results of collaboration facts

ID Company 1 Company 2 Collab. subject DateAn DateVal

1 Pharmacopeia Pharmacia & Upjohn Screen molecules 19/01/1999 10/01/1999 2 CombiChem Chromagen Drug screening 15/03/1999 15/03/1999 3 CombiChem Ono Drug discovery 04/01/1999 10/04/2000 4 CombiChem Roche Bioscience Novel inhibitors 13/05/1998 05/06/1998

3.3 Metadata model

The information pieces extracted from documents are stored in a relational database according to our metadata model. The metadata model includes concepts derived from various existing standards such as the Dublin Core (DCMI, 2002), Common Warehouse Metamodel (CWM, 2001), Message Understanding Conference (MUC7) (Chinchor, 1997), TimeML (Pustejovsky et al., 2002) and Text Encoding Initiative (TEI, 2008).

Our metadata model is demonstrated using the linguistic notation and as such we will refer to annotations upon the documents, where an annotation represents a form of metadata attached to a particular section of document content. An annotation has a type (or a name), which is used to create classes of similar annotations, usually linked together by their semantics. Annotation set holds a number of annotations.

The most basic elements of IE metadata, in which we will refer as conceptual annotations, include Entities and Relations. Entities represent main points of a document and may be terms or concepts such as person names, locations, event’s action and other information pieces of interest within the documents. Relations describe the relationships between terms or concepts such as organisations, persons and places. Combining entities and relationships, we can define facts of interest within the document collection.

The set of annotations we are utilising is organised in three levels:

• Structural annotations: This category of annotations is used to define the physical structure of the document (for example, the organisation into head, body, sections, paragraphs, sentences and tokens). This is particularly useful when we have to guide our analysis in specific parts of the document. For example, if we are planning to utilise research papers in a knowledge discovery process, just working with the abstract sections could be enough.

• Lexical annotations: These annotations are associated to a short span of text (smaller than a sentence), and identify lexical units that have some relevance for the framework: Named Entities, Terminology, Time Expressions, etc.

• Semantic/conceptual annotations: This type of annotations is not associated with any specific piece of text. They refer to lexical annotations. They (partially) correspond to what in MUC7 was termed ‘Template Elements’ and ‘Template Relations’ (Chinchor, 1997) (template represents the final, tabular output format of IE process). Ideally, they should correspond to instances.

Discovering market trends in the biotechnology industry 193

The temporal aspects of documents that are considered relevant to our analysis are captured with TimeML. Basically, TimeML uses five tags <EVENT>, <TIMEX3>, <SIGNAL>, <LINK> and <DocCreationTime>. The information that corresponds in the last tag in our model is captured by the Dublin core metadata Date tag which except Creation Date also includes Publication and Last modified Date (DCMI, 2002). The scope of the annotation scheme comes directly from the interaction of the four remaining tags.

Two types of temporal entities are used:

1 verbs, nominalisations, adjectives and prepositions that indicate something that happened are marked as <EVENT>

2 explicit references to times and calendar dates are marked as <TIMEX3>.

How these two types of entities temporally interact is captured via <SIGNAL>, which marks explicit relations (‘twice’, ‘during’), and <LINK>, which marks implicit relations.

In Figure 5, the metadata model is presented.

Figure 5 Extracted information metadata, logical model

194 H. Karanikas et al.

3.4 Defining conceptual levels for analysis

An important role of ontologies in our approach is to be utilised by the user during the analysis (mining) step, to specify the conceptual levels of the discovered patterns. By utilising generalisation rules based on business ontologies, Figure 6 values of the template slots (extracted textual items, i.e., CombiChem) are generalised at multiple conceptual levels.

Figure 6 Part of sample ontology

As a result of this process, a set of slots is generated for every generalised slot of the original set. The generated attributes are instances of the same slots but in a different conceptual level (Table 2). This facilitates the discovery of association patterns in various conceptual levels.

Table 2 Generalised attributes

Template slot Generalised template slot

Company Bioscience sector that the company belongs i.e., Pharmacopeia Bioscience sector C, i.e., Molecules sector i.e., CombiChem Bioscience sector A, i.e., Drug sector Collaboration subject Generalised collaboration subject i.e., Collection of molecules i.e., Micro molecules sector

By applying the generalisations on Table 1 that keeps the output of the IE, we result with Tables 3 and 4. For attributes company and collaboration subject, we get additional attributes with the generalised values.

Table 3 Generalised collaborations (1/2)

ID Company 1 GenCompany 1 Company 2 GenCompany 2

1 Pharmacopeia Bioscience Sector C Pharmacia & Upjohn Bioscience Sector B

2 CombiChem Bioscience Sector A Chromagen Bioscience Sector B

3 CombiChem Bioscience Sector A Ono Bioscience Sector B

4 CombiChem Bioscience Sector A Roche Bioscience Bioscience Sector B

Discovering market trends in the biotechnology industry 195

Table 4 Generalised collaborations (2/2)

CollabSubject GenCollabSubject DateAn DateVal

Screen molecules TechAreaB 19/01/1999 10/01/1999 Drug screening TechAreaA 15/03/1999 15/03/1999 Drug discovery TechAreaA 04/01/1999 10/04/2000 Novel inhibitors TechAreaB 13/05/1998 05/06/1998

Textual data set becomes more abstract and meaningful. The user is able to direct the analysis at the required conceptual levels and extract valuable knowledge that satisfy his needs.

Moreover, generalisation provides the necessary frequency for the patterns to be discovered. Usually, specific facts do not appear so frequently in the document collections that we keep. Generalisation of objects such as companies and products into market sector, technology area, etc., respectively, provides frequent facts and thus the ability to detect useful patterns.

3.5 Discovering association over time

Once we have extracted and stored in a database the information required from the documents, we are able to perform temporal data-mining algorithms to discover valuable temporal patterns.

The general goal of association extraction, given a set of data items, is to identify relationships between attributes and items such as the presence of one pattern implies the presence of another pattern (Goebel and Gruenwald, 1999; Agrawal et al., 1993). Association rules upon textual items can be defined as follows:

Let I = {i1, i2, …, im} be a set of textual items (analogous to data item). A textual item varies from a single word to a multiword term, i.e., person’s name, location, company’s name, etc. Let Facts Database (FDB) be a database of facts (analogous to transactions), where each fact F consists of a set of textual items such that .F I⊆ We use the term fact to refer to a set of textual items that refer to a specific piece of information that the user is interested. For example, the statement that “Sibilo acquired Oracle” is a fact and consists of the textual items: Sibilo (Company name), Acquire (Event) and Oracle (Company name). Given a textual itemset ,X I⊆ a fact F contains X only and only if .X F⊆ An association rule is an implication of the form ,X Y⇒ where ,X I⊆ Y I⊆ and .X Y∩ = ∅ The association rule X Y⇒ holds in FDB with confidence c (supporting by the rule facts divided by the facts that X is true) and has support s (supporting by the rule facts divided by the total facts). The task of mining association rules is to find all the association rules whose support is larger than a minimum support threshold and whose confidence is larger than a minimum confidence threshold.

An example of such associations may be the following: The presence of Company A is associated with the presence of Technology Area B (support s, confidence c).

196 H. Karanikas et al.

Each fact (collaborations) is treated individually to discover associations. However, in cases where fact histories exist, temporal patterns can be discovered. In our case study, two temporal dimensions exist. The date that the fact announced and the actual date that the fact took place. For these specific temporal dimensions, the mining of the textual data set at different temporal periods results into a series of sets of association rules. Each set of association rules is the result of the mining process in one of the examined temporal periods. The strong association rules of one temporal period apply on the other temporal periods with the same, lower or higher support and confidence. It seems that the rule evolution in a temporal dimension is demonstrated by the fluctuation of its support and confidence in a series of temporal periods. After processing, the result is of type: In period t, the presence of Company A and Company B is associated with the presence of Technology Area C (support s and confidence c over time).

According to the TM goal, as explained in Section 3.1, specific queries expressed by potential knowledge workers are expressed upon the text repository (for example Tables 3 and 4).

Query: “find all associations between a set of companies or the bioscience sector that the company belongs and any technology area for the period t”

Result: [Bayer AG] → Combinational chemistry software [Rule stronger in March].

According to this application scenario, the following tasks were performed:

1 User selects target attribute. Target attribute is the one for which there is interest on finding association rules. In our case, we select the collaboration subject and the generalised collaboration subject, which represent the technology area that a specific biotechnology belongs.

2 User selects the attributes, generalised or not, that will appear in the left part of the rule. We select the attribute company 1 and the generalisation of the company 1 that represent the bioscience sector that the company belongs.

3 User must select a specific temporal dimension. To discover temporal rules, the algorithm utilises specific time slices that separates into continuous temporal periods of equal length. After the selection of the temporal attribute, the user should define the length of the temporal periods that are going to be examined. For example, this could be into minutes, hours, days, etc. Finally, the starting and ending points are defined. In our case, we have selected the dimension of the date that the fact actually happened. Owing to the nature of the news stories we had to deal, live news, also the date that the fact was announced could provide valuable information. Regarding the span of the temporal periods that we intend to examine, we have selected months.

We have chosen to present the most significant relations according to the measures of confidence and support. Additionally, we present relations that appear to be particularly important to the domain expert:

IF Company IS CombiChem THEN GenCollabSubject IS TechAreaA

[Rule stronger at the beginning of the year and especially in March, Figure 7]

Discovering market trends in the biotechnology industry 197

Figure 7 CombiChem → TechAreaA

IF GenCompany IS Bioscience Sector A THEN GenCollabSubject IS TechAreaA

[Rule stronger at the end of the year and especially in November, Figure 8]

Figure 8 Bioscience Sector A → TechAreaA

IF GenCompany IS Bioscience Sector B THEN Collaboration Subject IS gene expression

[Rule stronger at the mid and end of the year and especially in November, Figure 8]

Figure 9 Bioscience Sector B → Gene expression

IF Company IS SmithKline Beecham THEN Collaboration Subject IS high throughput screening

[Rule stronger in November]

IF GenCompany IS Bioscience Sector C THEN GenCollabSubject IS TechAreaC

[Rule stronger in November, December]

IF Company IS Inspire Pharmaceuticals, IDBS THEN GenCollabSubject IS TechAreaA

[Rule stronger at the end of the year]

IF GenCompany IS Bioscience Sector B, Oxford Molecular Group PLC THEN Collaboration Subject IS metabolic disorders

[Rule stronger in September and October]

198 H. Karanikas et al.

The significance of the above-mentioned type of results is obvious for the domain analyst. The analyst is able to detect movements of specific companies or bioscience sectors into technology areas over time. For a particular period of time, the interest of companies for specific technologies can be discovered and as such, competitors.

The above-mentioned approach is suitable for mining association rules and fluctuations of their strength, for textual items that appear in the same document. Additionally, by using the method described in Kamran and Hamilton (2001), we are able to discover association rules between textual items that appear in different documents. Via this technique, a data set of time series can be flattened by placing consecutive records in a single record. The number of records that are merged is called the time window size. Also, the new attributes in the flattened record are renamed so as to avoid name clashes. By using these approach rules, we are able to mine association rules like the following:

IF at Time 1 GenCompany IS Bioscience Sector C and At Time 2 Company IS Inspire Pharmaceuticals, IDBS THEN At time 5 GenCollabSubject IS TechAreaC

IF at Time 10 Company IS SmithKline Beecham and At Time 13 Company IS Inspire Pharmaceuticals, IDBS THEN At time 20 GenCollabSubject IS TechAreaD

4 Related work

There has been rather slow progress on research work in integrating IE techniques and data mining for trends analysis. Most existing systems tend to have a prescriptive structure: a number of topics are defined and the data is mined, or an atemporal one: a corpus of documents is mined for keywords, phrases, entities and relationships (Hurst, 2006). Most importantly, current techniques ignore most of the dynamic aspects, assuming that participating entities and formed links are static.

Research in TM, taking into consideration temporal issues, is still very poor (Karanikas and Theodoulidis, 2002; Karanikas and Mavroudakis, 2005). Studies analyse the past topics traditionally to detect rules that concern the past. At best, altering mechanisms use simple temporal models that include either a sudden change in some rank or the crossing of a threshold to determine the importance of a term over time (Hurst, 2006). TimeMines (Construction Timelines with Statistical Models of Word Usage) automatically generates timelines by detecting documents that relate to a single topic (an event in a ‘history’). This is achieved exclusively by selecting and grouping semantic features in the text based on their statistical properties. Significant tests are used to detect time periods where words have s higher than usual frequency (Swan and Jensen, 2000). Wang and McCallum (2006) propose a generative probabilistic model approach to better model evolving topics. In Fung et al. (2005), burst detection is used to cluster trends of words. E-Analyst (Mining of Concurrent Text and Time Series), on the other hand, Lavrenko et al. (2000) use two types of data and attempt to find the relationship between them: the textual documents and numerical data, both with time stamps. The system discovers the trends in time series of numeric data (e.g., stock prices) and attempts to characterise the content of textual data (e.g., news papers) that precede these events. The objective is to use this information to predict the trends in the numeric data based on the content of textual documents that precede the trends.

Discovering market trends in the biotechnology industry 199

Additionally, there are approaches that use a partition of the time axis into various intervals to detect changes in text collections. In Mei and Zhai (2005), clustering is performed for each time step and clusters are connected with a graph model to obtain a temporal representation of the text collection. Similarly, in Schult and Spiliopoulou (2006), new clusters that cannot be connected to clusters from the previous time interval are candidates for new ontology concepts. An approach applied to a small excerpt of PubMed (Zhang and Chundi, 2006) decomposes the time axis related to a document archive, trying to minimise the information loss in the set of significant features per time step.

Most recently related to our work is BioJournalMonitor, a decision-support system for the analysis of trends and topics in the biomedical literature (Morchen et al., 2008b). They describe mainly how a probabilistic topic model can be used to annotate recent papers with the most likely MeSH (Ontology) terms before they are manually annotated. Additionally, they present a study on how to predict the inclusion of new terms in the MeSH ontology.

5 Conclusion and future work

The general goal of TM is to automatically extract information from textual data. In this paper, we have presented an application of our approach for discovering temporal rules and associations in text over time. We have shown how the IE and the temporal features can be utilised to assist the mining process of associations over time. Additionally, we have presented the use of generalisation for discovering association at a conceptual level the user desires and our metadata model for keeping the information required in the database.

We have made no attempt to provide a strong evaluation of the methods presented. This paper is describing the application of standard techniques to a textual data set in the context of a particular application – mining news for biotechnology and pharmaceutical industry. Evaluation, therefore, should be in terms of how this method improves some task. For example, does trends analysis of this sort improve the precision and the speed of alerting mechanisms?

We have used various components to demonstrate the usefulness of our approach. In the near future, we are planning to apply our approach to large corpora. Another planned task is to extend the metadata model to capture the results of the mining phase for further analysis (meta-mining step).

References Agrawal, R., Imienlinski, T. and Swami, A. (1993) ‘Mining association rules between sets of items

in large databases’, Proceedings of the ACM SIGMOD Conf. on Management of Data, May, Washington DC, pp.207–216.

Berners-Lee, T., Hendler, J. and Lassila, O. (2001) The Semantic Web, Scientific American, May, pp.35–43.

Chinchor, N. (1997) MUC-7 Named Entity Task Definition, v. 3.5, http://www.itl.nist.gov/iaui/894. 02/related_projects/muc/proceedings/ne_task.html

CWM (2001) Common Warehouse Metamodel, Specifications Version 1.0, 2 February, pp.369–400.

200 H. Karanikas et al.

DCMI (2002) Dublin Core Metadata Initiative, http://dublincore.org/, Accessed 10/2/2005. Fayyad, U. and Piatetsky, S.G. (1996) From Data Mining to Knowledge Discovery: An Overview,

Advances in Knowledge Discovery and Data Mining, ISBN 978-0-262-56097-9, MIT Press, Cambridge, Mass, pp.1–34.

Feldman, R. and Dagan, I. (1995) ‘Knowledge Discovery in Textual databases (KDT)’, Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95), AAAI Press, Montreal, Canada, 20–21 August, pp.112–117.

Fung, G.P.C., Yu, J.X., Yu, P.S. and Lu, H. (2005) ‘Parameter free bursty events detection in text streams’, Proc. Intl. Conf. on Very Large Data Bases, Norway, Vol. 1, pp.181–192.

GATE (2007) http://www.gate.ac.uk, Accessed 10/3/2007. Goebel, M. and Gruenwald, L. (1999) ‘A survey of data mining and knowledge discovery software

tools’, ACM SIGKDD, Vol. 1, No. 1, pp.20–33. Grishman, R. (1997) Information Extraction: Techniques and Challenges, International Summer

School, SCIE-97, Frascati, Italy, Vol. 1299, pp.10–27. Hearst, M.A. (1999) ‘Untangling text data mining’, Proceedings of ACL ’99: the 37th Annual

Meeting of the Association for Computational Linguistics, University of Maryland, USA, June.

Hehenberger, M. and Coupet, P. (1998) ‘Text mining applied to patent analysis’, Paper presented at the Annual Meeting of American Intellectual Property Law Association (AIPLA), 15–17 October, Arlington, VA.

Hurst, M. (2006) Temporal Text Mining, American Association for Artificial Intelligence, www.aaai.org/Papers/Symposia/Spring/2006/SS-06-03/SS06-03-015.pdf

Kahaner, L. (1996) Competitive Intelligence: How to Gather, Analyze, and Use Information to Move Your Business to the Top, 1st ed., Simon & Schuster, New York, NY.

Kamran, K. and Hamilton, H.J. (2001) Temporal Rules and Temporal Decision Trees: A C4.5 Approach, Technical Report CS-2001-02.

Karanikas, H. and Mavroudakis, T. (2005) Text Mining Software Survey, RANLP 2005, Text Mining Research Practice and Opportunities, Borovets – Bulgaria

Karanikas, H. and Theodoulidis, B. (2002) Knowledge Discovery in Text and Text Mining Software, Technical Report, UMIST Department of Computation, Manchester, UK.

Karanikas, H., Tjortjis, C. and Theodoulidis, B. (2000) ‘An approach to text mining using information extraction’, Proc. PKDD 2000 Knowledge Management: Theory and Applications, September, Lyon, France, pp.165–178.

Kodratoff, Y. (2000) About Knowledge Discovery in Texts: A Definition and an Example, Unpublished Paper: http://www.cyberlex.pt/y%20kodratoff.pdf, Accessed 10/10/2008.

Kosala, R. and Blockeel, H. (2000) ‘Web mining research: a survey’, ACM SIGKDD, Vol. 2, pp.1–15.

Lavrenko, V., Schmill, M., Lawrie, D. and Ogilvie, P. (2000) ‘Mining of cocurrent text and time series’, KDD-2000 Workshop on Text Mining, Boston, USA.

Lynch, M. (2000) e-Business Analytics, In-depth Report, Merrill Lynch & Co., USA, http://www.ml.com, Accessed 11/3/2001

Mei, Q. and Zhai, C. (2005) ‘Discovering evolutionary theme patterns from text: an exploration of temporal text mining’, Proc. ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp.198–207.

Morchen, F., Dejory, M., Fradkin, D., Etienne, J., Wachmann, B. and Bundschus, M. (2008a) ‘Anticipating annotations and emerging trends in biomedical literature’, Proceedings Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 26, pp.954–962.

Morchen, F., Fradkin, D., Dejory, M. and Wachmann, B. (2008b) ‘Emerging trend prediction in biomedical literature’, Proceedings American Medical Informatics Association (AMIA) 2008 Annual Symposium, Washington DC, USA, p.29.

Discovering market trends in the biotechnology industry 201

Nahm, U.Y. and Mooney, R.J. (2000) ‘Using information extraction to aid the discovery of prediction rules from text’, KDD-2000.Workshop on Text Mining, August, Boston, USA.

Pustejovsky, J., Roser, S., Setzer, A., Giazauskas, R. and Ingria, B. (2002) TimeML Annotation Guideline 1.00 (internal version 0.4.0), 21 July, http://www.cs.brandeis.edu/%7Ejamesp/arda/ time/documentation/AnnotationGuideline-v0.4.0.pdf

Rajman, M. and Besançon, R. (1997) ‘Text Mining: Natural Language Techniques and Text Mining Applications’, Proceedings of the 7 th IFIP Working Conference on Database Semantics (DS-7), Leysin, Switzerland.

Rinaldi, F., Dowell, J., Hess, M., Persidis, A., Theodoulidis, B., Black, B., McNaught, J., Karanikas, H., Vasilakopoulos, A., Zervanou, K., Bernard, L., Zarri, G.P., Bruins Slot, H., Van der Touw, C., Daniel-King, M., Underwood, N., Lisowska, A., Van der Plas, L., Sauron, V., Spiliopoulou, M., Brunzel, M., Elleman, J., Orphanos, G., Mavroudakis, T. and Taraviras, S. (2003) ‘Parmenides: an opportunity for ISO TC37 SC4? ACL-2003’, Workshop on Linguistic Annotation: Getting the Model Right, July, Sapporo, Japan, pp.38–45.

Schult, R. and Spiliopoulou, M. (2006) ‘Discovering emerging topics in unlabelled text collections’, Proc. East European ADBIS Conf., Thessaloniki, Greece, Volume 4152 of Lecture Notes in Computer Science, Springer, pp.353–366.

Sullivan, D. (2001) Document Warehousing and Text Mining, Wiley Computer Publishing, USA Sullivan, D. (2002) Enterprise Content Management: Information Extraction in Enterprise Content

Management, DM Review, USA. Swan, R. and Jensen, D. (2000) ‘TimeMines: constructing timelines with statistical models of word

usage’, KDD-2000 Workshop on Text Mining, Boston, USA. Tan, Ah-H. (1999) ‘Text mining: the state of the art and the challenges’, Proceedings of the Pacific

Asia Conf on Knowledge Discovery and Data Mining PAKDD’99 Workshop on Knowledge Discovery from Advanced Databases, China, pp.65–70.

TEI (2008) The Text Encoding Initiative, http://www.tei-c.org/, Chapter 6 and 7, Accessed 10/2/2008.

Wang, X. and McCallum, A. (2006) ‘Topics over time: a non-Markov continuous-time model of topical trends’, Proc. ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, Philadelphia, USA, pp.424–433.

Zhang, R. and Chundi, P. (2006) ‘Using time decompositions to analyze PubMed abstracts’, Proc. IEEE Symposium on Computer-Based Medical Systems, Maribor, Slovenia, pp.569–576.