Exploring the Clinical Notes of Pathology Ordering by Australian General Practitioners: a text...

10
Exploring the Clinical Notes of Pathology Ordering by Australian General Practitioners: a text mining perspective Zoe Yan Zhuang, Rasika Amarasiri, Leonid Churilov, Damminda Alahakoon Monash University, Melbourne, Australia Ken Sikaris The University of Melbourne, Melbourne, Australia E-mail: {zoe.zhuang, leonid.churilov}@ buseco.monash.edu.au; {rasika.amarasiri, damminda.alahakoon}@infotech.monash.edu.au; [email protected] Abstract A massive rise in the number and expenditure of pathology ordering by general practitioners (GPs) concerns the government and attracts various studies with the aim to understand and improve the ordering behavior. In this paper we attempt to understand the reasons for and implications of pathology ordering by general practitioners by applying an unsupervised text mining technique on the clinical notes of the pathology requests obtained from a pathology company in Australia. Pathology requests are clustered into different groups based on the information that is included by the doctors in clinical notes accompanying the requests. Features and patterns of the groups are investigated and analyzed. The novelty of the paper is in using text mining techniques to extract knowledge from unstructured text data in the area of pathology ordering and to understand the reasons for pathology ordering from a doctors’ perspective. 1. Introduction Pathology ordering both in Australia and worldwide has been attracting special attention from government and raising controversial arguments among researchers and practitioners. This is mainly due to perceived significant increases in both volume and expenses of test ordering and the complexity of the ordering system. In Australia, general practitioners (GPs) order and manage most of the pathology tests [1], and surveys show continuous growth of ordering rate by GPs in Australia [2]. Although reasons for the increase may vary from better communication between patients and GPs to increased concern of medical litigation and patient expectations [3], the appropriateness of pathology utilization by Australian GPs is not assured, and doctors’ ordering behavior is not yet fully understood. Within a typical patient case scenario, the GP encounters a patient with specific characteristics and existing symptoms or diseases. The doctor checks the patient’s condition and may decide to order certain test(s) for the purpose of screening, diagnosis, monitoring, or prognosis [4]. A pathology request is initiated with the information about the patient and the test(s) ordered. In addition, the doctor may include some clinical notes explaining the context of this ordering, which are relevant to the patient, the test(s) ordered, or the doctor’s opinion of the ordering and are expressed in a free-text form. Thus, clinical notes are typically used as means of communication between the GP and laboratory pathologist concerning particular requesting situation [3]. In the course of modern pathology practice, millions of test request records are accumulated each year in pathology laboratory databases. This information can be of value to the study of the logic and rationale of GP test ordering, which might shed some light on the issues of appropriateness or inappropriateness of GP pathology utilization [5], [6], [7], [8]. At present, systematic or structured analyses of this information to understand the reasons of test ordering are very limited due to the diverse medical vocabulary and unformatted medical terms used by individual doctors. The objective of this paper is to investigate the application of text mining techniques for understanding the reasons or related issues of pathology ordering by Australian GPs as captured in the clinical notes of pathology requests. Text mining is the automated or partially automated processing of text by imposing structure upon text and extracting useful information from text [9]. A novel growing high dimensional feature map algorithm - the High Dimensional Growing Self Organizing Map with Randomness (HDGSOMr) [10], [11] - is used for the efficient processing of text data in clinical notes. Pathology requests are clustered into homogenous Proceedings of the 40th Hawaii International Conference on System Sciences - 2007 1 © 1530-1605/07 $20.00 2007 IEEE

Transcript of Exploring the Clinical Notes of Pathology Ordering by Australian General Practitioners: a text...

Exploring the Clinical Notes of Pathology Ordering by Australian General

Practitioners: a text mining perspective

Zoe Yan Zhuang, Rasika Amarasiri, Leonid Churilov, Damminda Alahakoon

Monash University, Melbourne, Australia

Ken Sikaris

The University of Melbourne, Melbourne, Australia

E-mail: {zoe.zhuang, leonid.churilov}@ buseco.monash.edu.au;

{rasika.amarasiri, damminda.alahakoon}@infotech.monash.edu.au; [email protected]

Abstract

A massive rise in the number and expenditure of

pathology ordering by general practitioners (GPs)

concerns the government and attracts various studies

with the aim to understand and improve the ordering

behavior. In this paper we attempt to understand the

reasons for and implications of pathology ordering by

general practitioners by applying an unsupervised text

mining technique on the clinical notes of the pathology

requests obtained from a pathology company in

Australia. Pathology requests are clustered into

different groups based on the information that is

included by the doctors in clinical notes accompanying

the requests. Features and patterns of the groups are

investigated and analyzed. The novelty of the paper is

in using text mining techniques to extract knowledge

from unstructured text data in the area of pathology

ordering and to understand the reasons for pathology

ordering from a doctors’ perspective.

1. Introduction

Pathology ordering both in Australia and worldwide

has been attracting special attention from government

and raising controversial arguments among researchers

and practitioners. This is mainly due to perceived

significant increases in both volume and expenses of

test ordering and the complexity of the ordering

system. In Australia, general practitioners (GPs) order

and manage most of the pathology tests [1], and

surveys show continuous growth of ordering rate by

GPs in Australia [2]. Although reasons for the increase

may vary from better communication between patients

and GPs to increased concern of medical litigation and

patient expectations [3], the appropriateness of

pathology utilization by Australian GPs is not assured,

and doctors’ ordering behavior is not yet fully

understood.

Within a typical patient case scenario, the GP

encounters a patient with specific characteristics and

existing symptoms or diseases. The doctor checks the

patient’s condition and may decide to order certain

test(s) for the purpose of screening, diagnosis,

monitoring, or prognosis [4]. A pathology request is

initiated with the information about the patient and the

test(s) ordered. In addition, the doctor may include

some clinical notes explaining the context of this

ordering, which are relevant to the patient, the test(s)

ordered, or the doctor’s opinion of the ordering and are

expressed in a free-text form. Thus, clinical notes are

typically used as means of communication between the

GP and laboratory pathologist concerning particular

requesting situation [3].

In the course of modern pathology practice,

millions of test request records are accumulated each

year in pathology laboratory databases. This

information can be of value to the study of the logic

and rationale of GP test ordering, which might shed

some light on the issues of appropriateness or

inappropriateness of GP pathology utilization [5], [6],

[7], [8]. At present, systematic or structured analyses of

this information to understand the reasons of test

ordering are very limited due to the diverse medical

vocabulary and unformatted medical terms used by

individual doctors.

The objective of this paper is to investigate the

application of text mining techniques for understanding

the reasons or related issues of pathology ordering by

Australian GPs as captured in the clinical notes of

pathology requests.

Text mining is the automated or partially automated

processing of text by imposing structure upon text and

extracting useful information from text [9]. A novel

growing high dimensional feature map algorithm - the

High Dimensional Growing Self Organizing Map with

Randomness (HDGSOMr) [10], [11] - is used for the

efficient processing of text data in clinical notes.

Pathology requests are clustered into homogenous

Proceedings of the 40th Hawaii International Conference on System Sciences - 2007

1©1530-1605/07 $20.00 2007 IEEE

groups based on the similarity of the text in clinical

notes. These groups are then classified and

investigated, and the hidden information and patterns

are identified and analyzed.

The novelty of the paper is two-fold. Firstly, it

applies text mining to unstructured textual data in the

area of pathology ordering. Text mining in medical

domain has been applied to the mining of electronic

medical records [12], [13] and large medical journal

collections such as MEDLINE [14], but its application

in pathology ordering is still lacking. Secondly, by

choosing the clinical notes as the study object, this

research focuses on doctors’ perspective in tackling the

issues of pathology utilization by GPs, thus

complementing the previous research by Zhuang et al.

[15] that provided a patient-centric perspective on

pathology ordering. Zhuang et al. [15] applied data

mining techniques on pathology ordering data from a

pathology company in Australia and discovered

homogenous patient groups and different ordering

patterns within and across these groups, thus

addressing the question “What kinds of patients are

consuming what types of pathology tests and in which

manner?”. The research presented in this paper

explores the clinical notes provided by requesting GPs

using text mining techniques with the view to

understand the reasons why the doctors order the tests

they order.

The rest of the paper is organized as follows.

Section 2 describes the data preparation and text

mining process using HDGSOMr. Section 3 presents

the method of analyzing the clusters generated by the

algorithm. Section 4 and section 5 describe the special

distinctive nodes and other distinctive nodes

respectively. Summary and discussion for future

research are provided in section 6.

2. Data preprocessing and text mining with

HDGSOMr

Clustering algorithms, such as feature map

algorithms based on the Self Organizing Map (SOM)

[16], are frequently used for the purposes of text

mining. The unsupervised nature of these algorithms

leads to the grouping of text based on the notion of

similarity. These algorithms require the textual data to

be encoded into numbers. This is done by converting

the text into phrases or words and then encoding them

using techniques such as Term Frequency-Inverse

Document Frequency (TF-IDF) [17]. Text is

preprocessed in order to eliminate spelling mistakes.

Stem forms of the words are produced to reduce

complexity of the processing.

A growing high dimensional feature map algorithm,

High Dimensional Growing Self Organizing Map with

Randomness (HDGSOMr), is capable of producing

good clusters from very large collections of text in a

reasonably short time [10], [11]. An overview of the

technique is illustrated in Figure 1.

Figure 1. Diagram of the text mining technique

The process begins by pre-processing the text

records to be presented to the HDGSOMr algorithm. In

the next step, the algorithm clusters the text data

records and produces a map that has clusters organized

in such a way that related clusters are placed close to

each other. A hypertext map is produced from the

algorithm, which is then used as the user interface for

the technique. By clicking on the hyperlinks in the

map, the data analyst can navigate to the records

associated with the group. Finally, the records are

extracted and a summary of the other associated fields

and the full texts are presented in the detail page.

The described HDGSOMr-based text mining

technique was applied to a dataset obtained from XYZ

Pathology Company in Australia containing the

pathology requests by GPs from May 2002 to April

2004. Only requests with clinical notes are included in

the experiment. Altogether 1,152,605 records are

included for the text mining process.

The text records of clinical notes are pre-processed

first to correct the spelling and to remove commonly

used words such as “a”, “is” and “the” using a stop

word list. The text is then broken down into individual

words and stemmed using the Porter’s stemming

algorithm [18] to obtain the root form of the words.

The resulting text from each record is then converted to

a normalized vector using the Term Frequency-Inverse

Document Frequency (TF-IDF) method [17]. In the

next step, the HDGSOMr algorithm is executed on the

preprocessed data, and the result is a map of 213

nodes. Size (number of records) and average

quantization error (AQE) for each node is presented

Database

Extract Text & Pre-Process

HDGSOMr

Hypertext

Map

Record Extractor Backend

Mappings Records

Detailed Page

Step 1

Step 2

Step 3

Step 4

Step 5

Proceedings of the 40th Hawaii International Conference on System Sciences - 2007

2

together with the node number in the hypertext map.

The AQE measures the average squared distance of all

the records in a node to the corresponding projection in

the map (weight vector of the node) [16]. The value of

AQE varies from 0 to 1, and the smaller the value, the

more similar the records are within a node. Each node

is also accompanied by the top 5 most dominant words

in the set of records mapped to that node along with

their relative percentage. A hypertext map (see Figure

2) is produced from the HDGSOMr algorithm, which

is then used as the user interface for the analysis. The

topology of the map is such that similar nodes are

located close to each other. By clicking on the

hyperlinks in the map, records associated with the node

can be extracted and a summary and details of other

fields such as age, gender, tests associated with the

records and full texts of clinical notes are presented in

the detail page.

Figure 2. Hypertext map of nodes produced from the HDGSOMr

3. Overview of results

Each node is a cluster of request records with

similar input patterns of clinical notes by GPs. A total

of 213 nodes were generated by the text mining

process. As described in the previous section and

shown in Figure 2, each node is accompanied by its

size, average quantization error (AQE) and top 5

keywords with relative percentage. A slightly negative

correlation (Pearson Correlation -0.182 significant at

the 0.01 level) between the size and error can be

observed.

Based on the percentages of the top 5 keywords,

two types of clusters can be observed: distinctive and

“fuzzy” nodes. Distinctive nodes (accounting for about

65% of all pathology requests) are those nodes that

have at least one of the top 5 keywords present in

100% of the records in the node. For example, Node 2

in Figure 2 is a distinctive node because it has the

keyword “diabetes” that is present in all the records in

that node. Any given distinctive node has one clinical

indication which is determined by the keyword(s) that

is/are present in all the records in that node. In cases

where there are more than one keywords that can be

found in all the records in a given node, it can be

observed that these keywords always come together as

a clinical term and usually refer to only one clinical

indication. For example, one of the nodes has 3 of the

keywords “cervix” “look” and “normal” all

accompanied with the percentage “100%”. These

words together form a clinical term which refers to the

clinical condition of a Pap smear ordering.

“Fuzzy” nodes (~35%), on the other hand, do not

have any keyword encountered in all of the records in

the node. These nodes usually contain more than one

2 1855

AQE:0.926

diabetes-100%

glucose-9%

mellitus-9%

mha-9%

level-8%

Proceedings of the 40th Hawaii International Conference on System Sciences - 2007

3

clinical indications of ordering. “Fuzzy” nodes require

further mining in order to discover the less outstanding

in scale but still important clinical implications.

In addition to separating the nodes with their

distinctive or “fuzzy” nature, distinctive nodes can be

further divided into subgroups for analysis. As

illustrated in Figure 3, using the topology of the map

and the relevant information linked to each node, two

distinctive groups of nodes are singled out due to their

“special” nature. Based on the combination of test

ordering patterns and the information in clinical notes,

these two groupings can be identified as “Pap smear”

group and “Warfarin” group. Their special

characteristics are as follows: within each grouping,

there are a number of similar nodes located next to

each other, the nodes are typically large in size, and

each individual node has the same pattern of test

ordering. Due to the large size of both groups (~23%

and ~15% of all the records respectively), as well as

their unique patterns of ordering, these groups are

analyzed separately from other distinctive nodes. A

summary of the node groups is presented in Table 1.

Table 1. Summary of node groups

Group of Nodes Description Number of

nodes

Absolute Size

(number of

records)

Relative

Size (%)

Pap smear Nodes with “smear” or related terms as

dominant key words. 68 259,401 22.51

Warfarin Nodes with “warfarin” or related terms

as dominant key words. 14 177,026 15.36

Other Distinctive Nodes with distinctive dominant key

words. 78 315,502 27.37

Fuzzy Nodes with no distinctive dominant key

words. 53 400,676 34.76

Total 213 1,152,605 100

Figure 3. The method of analysis

For conducting systematic analysis of the distinctive

nodes (“Pap smear”, “Warfarin”, or other distinctive

nodes), the characteristic features of these nodes that

are of specific interest need to be observed. The

method of analysis is illustrated in Figure 3. For the

purposes of this study, these node-specific features

would include quantitative criteria such as the node

size, the corresponding AQE, patient profiles, and a

qualitatively defined type of the node. The types are

constructed qualitatively by systematically observing

and summarizing the top 5 keywords for each of the

distinctive nodes. Referring to the domain knowledge

of previous works on mining and structuring free-text

medical records such as [19] and [20], and further

Yes

No

Yes

No

Analysis

-size

-AQE

-clinical

indication

category Nodes

in map

Distinctive

nodes

“Fuzzy”

nodes Future research

Other

distinctive

nodes

Clinical

indication

taxonomy

Distinctive?

Special?

Special

distinctive

nodes

Pre-existing

domain

knowledge

Proceedings of the 40th Hawaii International Conference on System Sciences - 2007

4

considering the special clinical terms that emerged

from mining the clinical notes for pathology ordering,

we developed a taxonomy of the clinical indication

categories. The clinical indications accompanying

pathology requests as captured in the clinical notes are

classified into seven categories as follows:

1. Clinical conditions including symptoms,

clinical findings and signs such as “elevated

cholesterol”, “high/low blood pressure”,

“persistent cough”, etc.

2. Clinical problems or diseases such as

“diabetes”, “pregnancy”, “infection”, etc.

3. Medications such as “lipitor”, “thyroxin”, etc.

4. Nature of ordering such as “screening”,

“routine check”, “follow-up”, etc.

5. Sample descriptions such as “fasting

blood/glucose”, “eye/ear swab”, etc.

6. Previous test results such as “previous

abnormal LFT”, “unsatisfactory smear”, etc.

7. Management issues by laboratory such as

“specimen not labeled”, “contact doctor”, etc.

Needless to say that such a taxonomy is highly

problem-specific and is a direct product of the

knowledge discovery through text mining process that

is superimposed on the pre-existing knowledge

available in wider domain. This is consistent with the

generic stages of a Data Mining project as discussed by

Berry et al. [21].

4. Special distinctive nodes: results and

analysis

Pap smear nodes

Overall there are 68 nodes consisting of 259,401

(22.51%) records relating to Pap smear orderings,

which makes up of a large proportion of all the

pathology orderings. Although all these nodes are

related to Pap smear ordering, they vary in terms of

size, AQE, descriptions, and average age of patients.

The top 5 nodes are displayed in Table 2. Note that out

of the 68 Pap smear nodes, the top 5 account for more

than 25% of all Pap smear related request. The “top

descriptive terms” are the top keywords (up to 5) that

are dominating the records. Terms with 100% of

domination are in bold italic font.

There is a medium negative correlation between

AQE and size (Pearson Correlation -0.628 significant

at the 0.01 level). The largest “Pap smear” nodes are

very homogenous, with very small AQEs (close to 0).

Very small value of the AQE indicates the consistency

of the expression in clinical nodes, while large size

may reflect that the expression is commonly used by a

large number of GPs. The expressions in clinical notes

vary from node to node. Using the categories

developed in previous section, one can observe that the

records in some nodes stress the nature of ordering

(e.g. “routine smear”), some emphasize on clinical

condition (e.g. “cervix look normal”), some include

medication (e.g. “contraception”), and some indicate

previous test results (e.g. “previous normal/abnormal

smear”), etc. The average patient age of each node also

varies considerably, depending on individual

situations. For instance, some nodes can contain

records of young patients (average age ~25 years), with

“first smear” as dominant term in clinical notes. In

other nodes patients can be much older (~59) with the

term “postmenopause” dominant the clinical notes.

Table 2. Top 5 “Pap smear” nodes

Rank AQE Size % of Pap

smear requests

Average

age Top descriptive terms

1 0.0007 22,257 8.58 42.12 routine smear warfarin

2 0.0004 12,075 4.65 39.84 cervix look normal discharge post

3 0.1530 11,000 4.24 31.97 contraception cervix look normal oral

4 0.5093 10,070 3.88 38.36 cin previous normal smear lmp

5 0.3880 9,184 3.54 39.09 abnormal grade low previous normal

“Warfarin” nodes

Another significant proportion of pathology

ordering is “Warfarin” related ordering. These requests

always have the test “PR” (also called “INR”) ordered

and the clinical notes always indicate “Warfarin” as the

medication the patient is taking. Although there are

only 14 “Warfarin” nodes, they contain over 15% of all

pathology requests. Most “Warfarin” nodes are large

and homogenous. There is some negative correlation

(Pearson Correlation -0.478 significant at the 0.01

level) between AQE and size. The largest node

distinguishes itself by the extremely large size (53% of

all “Warfarin” orderings) and extremely small AQE

(0). Descriptions in clinical notes do not differ

significantly from node to node except the level of

detail GPs include in the notes. Level of detail varies

from the single word “Warfarin” indicating the

medication the patient is taking to detailed information

Proceedings of the 40th Hawaii International Conference on System Sciences - 2007

5

concerning the dose (e.g. 2mg) and time (e.g. 7pm) of

“Warfarin” intake by the patient. Only one node

indicates in clinical notes the nature of ordering (serial

testing) rather than the medication “Warfarin”. There

are no significant differences in age or gender across

these nodes. Top 5 “warfarin” nodes are presented in

Table 3.

Table 3. Top 5 “warfarin” nodes

Rank AQE Size % of warfarin

requests

Average

age % of female

Top descriptive

terms

1 0 93,830 53.00 71.53 0.46 rx warfarin

2 0.0001 33,255 18.79 69.21 0.46 warfarin

3 0.0019 23,468 13.26 69.89 0.42 mg warfarin

4 0.6910 8,382 4.73 67.77 0.46 mg warfarin pm

5 0.0363 8,286 4.68 71.49 0.43 serial PR

5. Other distinctive nodes: results and

analysis

The taxonomy for reasons for ordering that consists

of seven categories as presented in Section 3 is used to

classify each distinctive node into one of the categories

according to its dominant keyword(s). As discussed in

section 3, in cases when there is more than one

keyword encountered in all the records in the node,

there is typically only one clinical indication for

pathology ordering. Thus, using this approach, each

node can only be assigned into a single category. Table

4 summarizes the classification statistics for each

category.

Table 4. Classification of other distinctive nodes into seven clinical indication categories

Category of nodes Descriptive terms (dominant

keywords)

Number

of nodes

Absolute

Size

Clinical condition Elevated, high, low, persistent,

reoccurring, etc. 17 64,606

Clinical Problem Diabetes, hypertension, pain,

infection, etc. 35 147,037

Medication Rx, lipitor, thyroxin, etc. 10 36,570

Nature of ordering Screening, routine, follow-up, etc. 11 50,042

Sample description Fasting blood, etc. 1 8,146

Previous test results Previous, abnormal, etc. 2 8,269

Management Specimen, label, doctor, etc. 2 832

Total 78 315,502

Medication

12%

Nature of ordering

16%

Clinical condition

20%

Clinical problem

46%

Previous test

result

3%Management

< 1%

Sample

description

3%

Figure 4. Distribution of nodes into clinical

indication categories

There are 17 nodes with a total size of 64,606

requests (about 20% of all distinctive nodes) with

clinical notes mainly containing the description of the

clinical condition of the patient. The biggest grouping

(35 nodes), with clinical problems or diseases as

dominant feature in clinical notes, consists of 147,037

requests, almost half (~47%) of all distinctive nodes.

Medication category grouping (10 nodes) contains

36,570 requests (~12%), while nature of ordering

category grouping of nodes makes up about 16% of all

distinctive nodes (50,042 requests). The groupings that

are based on the categories of sample description and

Proceedings of the 40th Hawaii International Conference on System Sciences - 2007

6

previous test results (each containing one node) are of

the same size (~2.60%). The smallest category

grouping, management, consists of two nodes and only

accounts for 0.26% of all distinctive nodes.

Figure 4 illustrates the distribution of the node

categories.

The biggest grouping of nodes, clinical problem

nodes, is further broken down into smaller groups of

nodes, each group representing an individual problem

or disease. Zhuang et al. [15] used keyword extraction

method to analyze the clinical problems occurred in

clinical notes. The selection of key words was based on

the past literature or survey reports in the area of

pathology ordering. In contrast, in this study the

clinical problems are elicited using unsupervised text

mining, which reflects the natural clustering of similar

words or terms in clinical notes without any manual

manipulation or domain expert inputs. The results

presented here, therefore, may not necessarily

accurately reflect the frequency of clinically pre-

defined terms for problems or diseases, but rather

provide an important general reflection of the way

doctors commonly describe certain problem or diseases

in day-to-day practice. For example, certain keywords

such as “pain” and “infection” did not appear as pre-

defined clinical problems in the study by Zhuang et al.

[15], but in this study they emerge as dominant

keywords for some nodes.

The clinical problem category based grouping

consists of 35 nodes that cover 10 different clinical

problems. The groups of nodes relating to the same

clinical problem are ranked according to their size

(number of requests for each clinical problem). The

statistics for each problem group are presented in Table

5.

In order to illustrate the variety of different patterns

of expression in clinical notes among nodes with same

dominating clinical problem, consider the “pregnancy”

nodes as an example. There are three nodes with

“pregnancy” as dominant keyword, each with a

different error and size. A closer look at the three

nodes reveals that the node with the smallest error

contains the simplest and most consistent expressions

of pregnancy problem, while the one with largest error

contains the longest and most sophisticated expressions

of pregnancy problem with other clinical

complications. Table 6 shows the comparison of the

three “pregnancy” nodes.

Table 5. Statistics for “clinical problem” groups

Rank Problem groups Number of

nodes

Absolute

Size

Relative Size (% of “clinical problem”

nodes)

1 Hypertension 3 23,148 15.74

2 Pain 4 22,624 15.39

3 Lipid disorder 8 20,053 13.64

4 Diabetes 7 17,299 11.77

5 UTI 3 16,276 11.07

6 Infection 3 14,684 9.99

7 Pregnancy 3 10,533 7.16

8 Chest problem 2 10,477 7.13

9 Prostate problem 1 7,468 5.08

10 Abdominal problem 1 4,475 3.04

Total 35 147,037 100

Table 6. Comparison of the “pregnancy” nodes

Node AQE Size Description

1 0.1352 6,916 Simple and short description of pregnancy, e.g. “21 weeks pregnant”.

2 0.6735 1,752 Pregnancy with some other problem, e.g. “Pregnant haematuria”.

3 0.8328 1,865 Pregnancy with a long expression of other complications, e.g. “Patient is

pregnant and has had possible contact with measles case”.

Proceedings of the 40th Hawaii International Conference on System Sciences - 2007

7

6. Summary and Conclusions

In this paper an unsupervised text mining algorithm

was used on a large dataset of pathology requests

obtained from a pathology company including the free-

text clinical notes. The underlying assumption is that

test requests with similar wordings of clinical notes are

clustered into one group while those with dissimilar

wordings clustered into different groups. Homogenous

nodes of requests are generated as a result, and

characteristics such as size, average quantization error

(AQE), and most frequent keywords with percentage

are presented for each node.

The research design involved two stages. At the

first stage, the text mining algorithm was applied on

the text data to come up with emerging patterns. At the

second stage, these patterns were combined with

existing knowledge in a wider domain to provide a

framework for the analysis. The unsupervised

knowledge discovered by the HDGSOMr algorithm is

highly appropriate for discovering emerging patterns,

while it does not provide the means for results

interpretation. Therefore, human inputs based on pre-

existing domain knowledge were required for the

analysis and interpretation of the results.

Based on whether there are distinctive terms

dominating the node or not, the nodes are separated

into distinctive and “fuzzy” nodes. Due to the scope of

this study, “fuzzy” nodes are filtered out for future

research while distinctive nodes are included for close

investigation.

“Fuzzy” nodes are the nodes generated from the text

mining process that do not have a distinctive keyword

that is encountered in all the records within the node.

Almost 35% of all the pathology requests are in

“fuzzy” nodes exhibiting textual pattern in clinical

notes that requires further analysis. These nodes

typically contain several clinical indications that may

be less common than those observed in distinctive

nodes. However, these indications are also important as

they are indispensable parts of clinical practice of

pathology ordering. Besides, some interesting

problems may be diluted and scattered into different

“fuzzy” nodes because of various terms different GPs

use to describe the same problem. For example,

“fatigue” can often also be described as “tiredness”,

“lethargy”, “weakness”, etc. Therefore, “fuzzy” nodes

(especially large ones) need to be further investigated

to discover the hidden patterns at a deeper level.

Through a general exploration of the distinctive

nodes two “special” groups of nodes can be identified:

“Pap smear” and “Warfarin”. Nodes within both

groups are located together, usually large in size, and

exhibit unique test ordering patterns.

Different “Pap smear” nodes present different

clinical implications of Pap smear ordering and

different patient profile for the orderings. In contrast,

“Warfarin” nodes differ from each other mainly by the

level of detail expressed in clinical notes concerning

Warfarin intake. Other distinctive nodes are classified

according to their types of clinical indications such as

clinical conditions, clinical problems, medications,

nature of ordering, sample descriptions, previous test

results, and management issues. Nodes based on the

clinical problems can be further subclassified into

specific clinical problem related groups including

hypertension, pain, lipid disorder, diabetes, UTI,

infection, pregnancy, chest problem, prostate problem,

and abdominal problem.

Average quantization errors (AQEs) vary from node

to node, with slightly negative correlation (Pearson

Correlation -0.182) to the size. Small error values may

indicate the consistency of expressions in clinical notes

within a group (node) of data records. Usually, the

shorter the expression, the more likely it can be

consistent, and the smaller the error may be. Larger

error values do not necessarily mean that the node is

“fuzzy” or the quality of the clustering is “bad”. They

may indicate that a certain clinical problem or

implication can be expressed in considerably different

ways by different doctors.

Size of the nodes may also generate important

implications. Large and homogenous nodes require

special attention because they are likely to capture both

common and consistent patterns of pathology ordering.

Some of the “Pap smear” and “Warfarin” nodes belong

to this category. Type of the nodes reflects the clinical

implications of ordering. It may contain important

clinical information in regard to reasons or issues of

pathology ordering from a doctor’s perspective.

The vocabulary of clinical notes is considerably

different from the language of formal medical texts,

references and publications. Some doctors are very

brief in expression while some others are quite wordy.

Most of the sentences in the notes are incomplete, and

the terms describing the same clinical indication can

vary widely. Traditional statistical methods will be of

little use to analyze this kind of information.

Abstracting or coding the text is a feasible option, but

an existing pre-knowledge and/or significant expert

understanding of the text may be required for the

effective coding, thus heavily relying on the external

domain knowledge as an input to the process. The

unsupervised text mining technique used in this study,

however, requires little pre-understanding or domain

knowledge. Rather, it may generate new knowledge

that facilitates or enhances the process of abstracting or

coding.

Proceedings of the 40th Hawaii International Conference on System Sciences - 2007

8

Presenting the case of a general exploration of the

vast dataset of free-text clinical information, this

research does not aim to provide the complete analysis

of all the clinical indications of pathology requests.

Presented here is just a top level overview of the most

frequently encountered patterns of the text. In addition,

the information recorded in the clinical notes by GPs is

sensory rather than objective. For example, the

cholesterol or lipid level is seldom noted by doctors in

numerical terms, but often an abnormal level is

interpreted as being either “low” or “elevated”. This

observation re-emphasizes the limitations of the

proposed approach as compared to standard data

mining on numerical data in regards to numerical

accuracy, but at the same time illustrates its benefits as

far as making use of invaluable text information that is

beyond the scope of classical numeric data mining.

As a part of future research, text mining results

discovered from this study will be combined with the

numeric data mining results to enhance understanding

of pathology ordering from both patient- and doctor-

centric perspectives. Future research also includes an

online GP learning activity. This program is designed

with the intention to investigate the reasons why GPs

order certain tests in particular circumstances and the

factors influencing their test ordering which are not

fully understood at present.

Acknowledgements

This research was partially supported by the

Australian Research Council Linkage Grant

LP0347622.

Authors would like to acknowledge the support of

XYZ pathology in providing the data.

We would like to thank anonymous referees for

constructive comments that helped us to improve the

quality of this paper.

7. References

[1] Cohen, J., Piterman, L., McCall, L., and Segal, L.,

“Near-patient testing for serum cholesterol:

attitudes of general practitioners and patients,

appropriateness, and costs”, Medical Journal of

Australia, 1998; 168:605-610.

[2] Britt, H., Miller, G.C., Charles, J., Knox, S.,

Valenti, L., Henderson, J., Pan, Y., Bayram, C.,

and Harrison, C., “General practice activity in

Australia 2002-03”, Australian Institute of Health

and Welfare (General Practice Series No. 14),

2004 p. 81.

[3] Guibert, R., Wicker, S., and Horrocks, M.,

Background Reading for QUP-GP workshop, “The

development of a research proposal to address the

appropriate use of pathology to general practice –

stage 1”, The Royal Australian College of General

Practitioners Research and Practice Support

Directorate, prepared by 31 August – 1 September

2001.

[4] Zhuang Z.Y., Churilov, L., Burstein, F., Sikaris, K.

“Combining data mining and case-based reasoning

for intelligent decision support for pathology

ordering by general practitioners in Australia”,

Proceedings of the Conference on Creativity and

Innovation in Decision Making and Decision

Support (CIDMDS 2006), an IFIP TC8/WG 8.3

Open Conference, 2006.

[5] Vinning R.F. and Mara P., “General practitioners

and pathology testing”, Medical Journal of

Australia, 1998; 168:591-592.

[6] Van Walraven, C. and Naylor, C.D., “Do we know

what inappropriate laboratory utilization is? A

systematic review of laboratory clinical audits”,

JAMA 1998; 280:550-8.

[7] Lundberg, G.D., “The need for an outcomes

research agenda for clinical laboratory testing”,

JAMA 1998; 280:565-6.

[8] Smellie, W.S.A., “Appropriateness of test use

in pathology: a new era or reinventing the

wheel?” The Association of Clinical

Biochemists, 2003; 40:585-592.

[9] Miller, T. W. Data and text mining, a business

applications approach. Pearson Prentice Hall, New

Jersey, 2005.

[10] Amarasiri, R., Alahakoon, D., Premaratne, M. and

Smith, K. “HDGSOMr: A High Dimensional

Growing Self Organizing Map Using Randomness

for Efficient Web and Text Mining”, Proceedings

of IEEE/ACM/WIC Conference on Web

Intelligence (WI) 2005, Paris, France, pp. 215-221

[11] Amarasiri, R., Ceddia, J. and Alahakoon, D.

(2005b) “Exploratory Data Mining Lead by Text

Mining Using a Novel High Dimensional

Clustering Algorithm”, Proceedings of 4th

international Conference on Machine Learning

Applications (ICMLA '05), Los Angeles,

California, USA, 2005, pp. 267-272.

Proceedings of the 40th Hawaii International Conference on System Sciences - 2007

9

[12] Heinze D.T., Morsch, M. L., Holbrook, J. “Mining

free-text medical records”, Proceedings of the

AMIA 2001 Annual Symposium. American Medical

Informatics Association, November 2001, pp. 254-

8.

[13] Cerrito, P. C. and Cerrito, J. C. “Data and text

mining the electronic medical record to improve

care and to lower costs”, Proceedings of the Thirty-

first Annual SAS Users Group International

Conference, Cary, NC: SAS Institute Inc. 2006,

Paper 077-31.

[14] Srinivasan, P. and Sehgal, A.K., “Mining

MEDLINE for similar genes and similar drugs”,

Department of Computer Science, The University

of Iowa TR# 03-02, July 2003.

[15] Zhuang, Z.Y., Churilov, L., Sikaris, K.,

“Uncovering the patterns in pathology ordering by

Australian general practitioners: a data mining

perspective,” Proceedings of the 39th Annual

Hawaii International Conference on Systems

Sciences (HICSS’06) Track 5, 2006, p. 92c.

[16] Kohonen, T., Self Organizing Maps, Springer,

2001.

[17] Salton, G. and Buckley, C. “Term-weighting

approaches in automatic text retrieval”,

Information Processing & Management, 1998, 24

(5), 513-523.

[18] Porter, M. “An algorithm for suffix stripping”,

Program, 1980, 14 (3), 130-137.

[19] Heinze, D.T., Morsch, M.L., and Holbrook, J.

“Mining free-text medical records”,

Proceedings of the AMIA 2001 Annual

Symposium. American Medical Informatics

Association: November 2001.

[20] Abidi, SSR and Manickam, S., “Augment

medical case base reasoning systems with

clinical knowledge derived from

heterogeneous electronic patient records”, 14th

IEEE Symposium on Computer Based

Medical Systems (CBMS’2001), 26-27 July

2001, Bethesda (USA).

[21] Berry, M.J.A. and Linoff, G.S. Mastering Data

Mining, New York, John Wiley and Sons, 2000.

Proceedings of the 40th Hawaii International Conference on System Sciences - 2007

10