Exploring the Clinical Notes of Pathology Ordering by Australian General Practitioners: a text...
-
Upload
independent -
Category
Documents
-
view
0 -
download
0
Transcript of Exploring the Clinical Notes of Pathology Ordering by Australian General Practitioners: a text...
Exploring the Clinical Notes of Pathology Ordering by Australian General
Practitioners: a text mining perspective
Zoe Yan Zhuang, Rasika Amarasiri, Leonid Churilov, Damminda Alahakoon
Monash University, Melbourne, Australia
Ken Sikaris
The University of Melbourne, Melbourne, Australia
E-mail: {zoe.zhuang, leonid.churilov}@ buseco.monash.edu.au;
{rasika.amarasiri, damminda.alahakoon}@infotech.monash.edu.au; [email protected]
Abstract
A massive rise in the number and expenditure of
pathology ordering by general practitioners (GPs)
concerns the government and attracts various studies
with the aim to understand and improve the ordering
behavior. In this paper we attempt to understand the
reasons for and implications of pathology ordering by
general practitioners by applying an unsupervised text
mining technique on the clinical notes of the pathology
requests obtained from a pathology company in
Australia. Pathology requests are clustered into
different groups based on the information that is
included by the doctors in clinical notes accompanying
the requests. Features and patterns of the groups are
investigated and analyzed. The novelty of the paper is
in using text mining techniques to extract knowledge
from unstructured text data in the area of pathology
ordering and to understand the reasons for pathology
ordering from a doctors’ perspective.
1. Introduction
Pathology ordering both in Australia and worldwide
has been attracting special attention from government
and raising controversial arguments among researchers
and practitioners. This is mainly due to perceived
significant increases in both volume and expenses of
test ordering and the complexity of the ordering
system. In Australia, general practitioners (GPs) order
and manage most of the pathology tests [1], and
surveys show continuous growth of ordering rate by
GPs in Australia [2]. Although reasons for the increase
may vary from better communication between patients
and GPs to increased concern of medical litigation and
patient expectations [3], the appropriateness of
pathology utilization by Australian GPs is not assured,
and doctors’ ordering behavior is not yet fully
understood.
Within a typical patient case scenario, the GP
encounters a patient with specific characteristics and
existing symptoms or diseases. The doctor checks the
patient’s condition and may decide to order certain
test(s) for the purpose of screening, diagnosis,
monitoring, or prognosis [4]. A pathology request is
initiated with the information about the patient and the
test(s) ordered. In addition, the doctor may include
some clinical notes explaining the context of this
ordering, which are relevant to the patient, the test(s)
ordered, or the doctor’s opinion of the ordering and are
expressed in a free-text form. Thus, clinical notes are
typically used as means of communication between the
GP and laboratory pathologist concerning particular
requesting situation [3].
In the course of modern pathology practice,
millions of test request records are accumulated each
year in pathology laboratory databases. This
information can be of value to the study of the logic
and rationale of GP test ordering, which might shed
some light on the issues of appropriateness or
inappropriateness of GP pathology utilization [5], [6],
[7], [8]. At present, systematic or structured analyses of
this information to understand the reasons of test
ordering are very limited due to the diverse medical
vocabulary and unformatted medical terms used by
individual doctors.
The objective of this paper is to investigate the
application of text mining techniques for understanding
the reasons or related issues of pathology ordering by
Australian GPs as captured in the clinical notes of
pathology requests.
Text mining is the automated or partially automated
processing of text by imposing structure upon text and
extracting useful information from text [9]. A novel
growing high dimensional feature map algorithm - the
High Dimensional Growing Self Organizing Map with
Randomness (HDGSOMr) [10], [11] - is used for the
efficient processing of text data in clinical notes.
Pathology requests are clustered into homogenous
Proceedings of the 40th Hawaii International Conference on System Sciences - 2007
1©1530-1605/07 $20.00 2007 IEEE
groups based on the similarity of the text in clinical
notes. These groups are then classified and
investigated, and the hidden information and patterns
are identified and analyzed.
The novelty of the paper is two-fold. Firstly, it
applies text mining to unstructured textual data in the
area of pathology ordering. Text mining in medical
domain has been applied to the mining of electronic
medical records [12], [13] and large medical journal
collections such as MEDLINE [14], but its application
in pathology ordering is still lacking. Secondly, by
choosing the clinical notes as the study object, this
research focuses on doctors’ perspective in tackling the
issues of pathology utilization by GPs, thus
complementing the previous research by Zhuang et al.
[15] that provided a patient-centric perspective on
pathology ordering. Zhuang et al. [15] applied data
mining techniques on pathology ordering data from a
pathology company in Australia and discovered
homogenous patient groups and different ordering
patterns within and across these groups, thus
addressing the question “What kinds of patients are
consuming what types of pathology tests and in which
manner?”. The research presented in this paper
explores the clinical notes provided by requesting GPs
using text mining techniques with the view to
understand the reasons why the doctors order the tests
they order.
The rest of the paper is organized as follows.
Section 2 describes the data preparation and text
mining process using HDGSOMr. Section 3 presents
the method of analyzing the clusters generated by the
algorithm. Section 4 and section 5 describe the special
distinctive nodes and other distinctive nodes
respectively. Summary and discussion for future
research are provided in section 6.
2. Data preprocessing and text mining with
HDGSOMr
Clustering algorithms, such as feature map
algorithms based on the Self Organizing Map (SOM)
[16], are frequently used for the purposes of text
mining. The unsupervised nature of these algorithms
leads to the grouping of text based on the notion of
similarity. These algorithms require the textual data to
be encoded into numbers. This is done by converting
the text into phrases or words and then encoding them
using techniques such as Term Frequency-Inverse
Document Frequency (TF-IDF) [17]. Text is
preprocessed in order to eliminate spelling mistakes.
Stem forms of the words are produced to reduce
complexity of the processing.
A growing high dimensional feature map algorithm,
High Dimensional Growing Self Organizing Map with
Randomness (HDGSOMr), is capable of producing
good clusters from very large collections of text in a
reasonably short time [10], [11]. An overview of the
technique is illustrated in Figure 1.
Figure 1. Diagram of the text mining technique
The process begins by pre-processing the text
records to be presented to the HDGSOMr algorithm. In
the next step, the algorithm clusters the text data
records and produces a map that has clusters organized
in such a way that related clusters are placed close to
each other. A hypertext map is produced from the
algorithm, which is then used as the user interface for
the technique. By clicking on the hyperlinks in the
map, the data analyst can navigate to the records
associated with the group. Finally, the records are
extracted and a summary of the other associated fields
and the full texts are presented in the detail page.
The described HDGSOMr-based text mining
technique was applied to a dataset obtained from XYZ
Pathology Company in Australia containing the
pathology requests by GPs from May 2002 to April
2004. Only requests with clinical notes are included in
the experiment. Altogether 1,152,605 records are
included for the text mining process.
The text records of clinical notes are pre-processed
first to correct the spelling and to remove commonly
used words such as “a”, “is” and “the” using a stop
word list. The text is then broken down into individual
words and stemmed using the Porter’s stemming
algorithm [18] to obtain the root form of the words.
The resulting text from each record is then converted to
a normalized vector using the Term Frequency-Inverse
Document Frequency (TF-IDF) method [17]. In the
next step, the HDGSOMr algorithm is executed on the
preprocessed data, and the result is a map of 213
nodes. Size (number of records) and average
quantization error (AQE) for each node is presented
Database
Extract Text & Pre-Process
HDGSOMr
Hypertext
Map
Record Extractor Backend
Mappings Records
Detailed Page
Step 1
Step 2
Step 3
Step 4
Step 5
Proceedings of the 40th Hawaii International Conference on System Sciences - 2007
2
together with the node number in the hypertext map.
The AQE measures the average squared distance of all
the records in a node to the corresponding projection in
the map (weight vector of the node) [16]. The value of
AQE varies from 0 to 1, and the smaller the value, the
more similar the records are within a node. Each node
is also accompanied by the top 5 most dominant words
in the set of records mapped to that node along with
their relative percentage. A hypertext map (see Figure
2) is produced from the HDGSOMr algorithm, which
is then used as the user interface for the analysis. The
topology of the map is such that similar nodes are
located close to each other. By clicking on the
hyperlinks in the map, records associated with the node
can be extracted and a summary and details of other
fields such as age, gender, tests associated with the
records and full texts of clinical notes are presented in
the detail page.
Figure 2. Hypertext map of nodes produced from the HDGSOMr
3. Overview of results
Each node is a cluster of request records with
similar input patterns of clinical notes by GPs. A total
of 213 nodes were generated by the text mining
process. As described in the previous section and
shown in Figure 2, each node is accompanied by its
size, average quantization error (AQE) and top 5
keywords with relative percentage. A slightly negative
correlation (Pearson Correlation -0.182 significant at
the 0.01 level) between the size and error can be
observed.
Based on the percentages of the top 5 keywords,
two types of clusters can be observed: distinctive and
“fuzzy” nodes. Distinctive nodes (accounting for about
65% of all pathology requests) are those nodes that
have at least one of the top 5 keywords present in
100% of the records in the node. For example, Node 2
in Figure 2 is a distinctive node because it has the
keyword “diabetes” that is present in all the records in
that node. Any given distinctive node has one clinical
indication which is determined by the keyword(s) that
is/are present in all the records in that node. In cases
where there are more than one keywords that can be
found in all the records in a given node, it can be
observed that these keywords always come together as
a clinical term and usually refer to only one clinical
indication. For example, one of the nodes has 3 of the
keywords “cervix” “look” and “normal” all
accompanied with the percentage “100%”. These
words together form a clinical term which refers to the
clinical condition of a Pap smear ordering.
“Fuzzy” nodes (~35%), on the other hand, do not
have any keyword encountered in all of the records in
the node. These nodes usually contain more than one
2 1855
AQE:0.926
diabetes-100%
glucose-9%
mellitus-9%
mha-9%
level-8%
Proceedings of the 40th Hawaii International Conference on System Sciences - 2007
3
clinical indications of ordering. “Fuzzy” nodes require
further mining in order to discover the less outstanding
in scale but still important clinical implications.
In addition to separating the nodes with their
distinctive or “fuzzy” nature, distinctive nodes can be
further divided into subgroups for analysis. As
illustrated in Figure 3, using the topology of the map
and the relevant information linked to each node, two
distinctive groups of nodes are singled out due to their
“special” nature. Based on the combination of test
ordering patterns and the information in clinical notes,
these two groupings can be identified as “Pap smear”
group and “Warfarin” group. Their special
characteristics are as follows: within each grouping,
there are a number of similar nodes located next to
each other, the nodes are typically large in size, and
each individual node has the same pattern of test
ordering. Due to the large size of both groups (~23%
and ~15% of all the records respectively), as well as
their unique patterns of ordering, these groups are
analyzed separately from other distinctive nodes. A
summary of the node groups is presented in Table 1.
Table 1. Summary of node groups
Group of Nodes Description Number of
nodes
Absolute Size
(number of
records)
Relative
Size (%)
Pap smear Nodes with “smear” or related terms as
dominant key words. 68 259,401 22.51
Warfarin Nodes with “warfarin” or related terms
as dominant key words. 14 177,026 15.36
Other Distinctive Nodes with distinctive dominant key
words. 78 315,502 27.37
Fuzzy Nodes with no distinctive dominant key
words. 53 400,676 34.76
Total 213 1,152,605 100
Figure 3. The method of analysis
For conducting systematic analysis of the distinctive
nodes (“Pap smear”, “Warfarin”, or other distinctive
nodes), the characteristic features of these nodes that
are of specific interest need to be observed. The
method of analysis is illustrated in Figure 3. For the
purposes of this study, these node-specific features
would include quantitative criteria such as the node
size, the corresponding AQE, patient profiles, and a
qualitatively defined type of the node. The types are
constructed qualitatively by systematically observing
and summarizing the top 5 keywords for each of the
distinctive nodes. Referring to the domain knowledge
of previous works on mining and structuring free-text
medical records such as [19] and [20], and further
Yes
No
Yes
No
Analysis
-size
-AQE
-clinical
indication
category Nodes
in map
Distinctive
nodes
“Fuzzy”
nodes Future research
Other
distinctive
nodes
Clinical
indication
taxonomy
Distinctive?
Special?
Special
distinctive
nodes
Pre-existing
domain
knowledge
Proceedings of the 40th Hawaii International Conference on System Sciences - 2007
4
considering the special clinical terms that emerged
from mining the clinical notes for pathology ordering,
we developed a taxonomy of the clinical indication
categories. The clinical indications accompanying
pathology requests as captured in the clinical notes are
classified into seven categories as follows:
1. Clinical conditions including symptoms,
clinical findings and signs such as “elevated
cholesterol”, “high/low blood pressure”,
“persistent cough”, etc.
2. Clinical problems or diseases such as
“diabetes”, “pregnancy”, “infection”, etc.
3. Medications such as “lipitor”, “thyroxin”, etc.
4. Nature of ordering such as “screening”,
“routine check”, “follow-up”, etc.
5. Sample descriptions such as “fasting
blood/glucose”, “eye/ear swab”, etc.
6. Previous test results such as “previous
abnormal LFT”, “unsatisfactory smear”, etc.
7. Management issues by laboratory such as
“specimen not labeled”, “contact doctor”, etc.
Needless to say that such a taxonomy is highly
problem-specific and is a direct product of the
knowledge discovery through text mining process that
is superimposed on the pre-existing knowledge
available in wider domain. This is consistent with the
generic stages of a Data Mining project as discussed by
Berry et al. [21].
4. Special distinctive nodes: results and
analysis
Pap smear nodes
Overall there are 68 nodes consisting of 259,401
(22.51%) records relating to Pap smear orderings,
which makes up of a large proportion of all the
pathology orderings. Although all these nodes are
related to Pap smear ordering, they vary in terms of
size, AQE, descriptions, and average age of patients.
The top 5 nodes are displayed in Table 2. Note that out
of the 68 Pap smear nodes, the top 5 account for more
than 25% of all Pap smear related request. The “top
descriptive terms” are the top keywords (up to 5) that
are dominating the records. Terms with 100% of
domination are in bold italic font.
There is a medium negative correlation between
AQE and size (Pearson Correlation -0.628 significant
at the 0.01 level). The largest “Pap smear” nodes are
very homogenous, with very small AQEs (close to 0).
Very small value of the AQE indicates the consistency
of the expression in clinical nodes, while large size
may reflect that the expression is commonly used by a
large number of GPs. The expressions in clinical notes
vary from node to node. Using the categories
developed in previous section, one can observe that the
records in some nodes stress the nature of ordering
(e.g. “routine smear”), some emphasize on clinical
condition (e.g. “cervix look normal”), some include
medication (e.g. “contraception”), and some indicate
previous test results (e.g. “previous normal/abnormal
smear”), etc. The average patient age of each node also
varies considerably, depending on individual
situations. For instance, some nodes can contain
records of young patients (average age ~25 years), with
“first smear” as dominant term in clinical notes. In
other nodes patients can be much older (~59) with the
term “postmenopause” dominant the clinical notes.
Table 2. Top 5 “Pap smear” nodes
Rank AQE Size % of Pap
smear requests
Average
age Top descriptive terms
1 0.0007 22,257 8.58 42.12 routine smear warfarin
2 0.0004 12,075 4.65 39.84 cervix look normal discharge post
3 0.1530 11,000 4.24 31.97 contraception cervix look normal oral
4 0.5093 10,070 3.88 38.36 cin previous normal smear lmp
5 0.3880 9,184 3.54 39.09 abnormal grade low previous normal
“Warfarin” nodes
Another significant proportion of pathology
ordering is “Warfarin” related ordering. These requests
always have the test “PR” (also called “INR”) ordered
and the clinical notes always indicate “Warfarin” as the
medication the patient is taking. Although there are
only 14 “Warfarin” nodes, they contain over 15% of all
pathology requests. Most “Warfarin” nodes are large
and homogenous. There is some negative correlation
(Pearson Correlation -0.478 significant at the 0.01
level) between AQE and size. The largest node
distinguishes itself by the extremely large size (53% of
all “Warfarin” orderings) and extremely small AQE
(0). Descriptions in clinical notes do not differ
significantly from node to node except the level of
detail GPs include in the notes. Level of detail varies
from the single word “Warfarin” indicating the
medication the patient is taking to detailed information
Proceedings of the 40th Hawaii International Conference on System Sciences - 2007
5
concerning the dose (e.g. 2mg) and time (e.g. 7pm) of
“Warfarin” intake by the patient. Only one node
indicates in clinical notes the nature of ordering (serial
testing) rather than the medication “Warfarin”. There
are no significant differences in age or gender across
these nodes. Top 5 “warfarin” nodes are presented in
Table 3.
Table 3. Top 5 “warfarin” nodes
Rank AQE Size % of warfarin
requests
Average
age % of female
Top descriptive
terms
1 0 93,830 53.00 71.53 0.46 rx warfarin
2 0.0001 33,255 18.79 69.21 0.46 warfarin
3 0.0019 23,468 13.26 69.89 0.42 mg warfarin
4 0.6910 8,382 4.73 67.77 0.46 mg warfarin pm
5 0.0363 8,286 4.68 71.49 0.43 serial PR
5. Other distinctive nodes: results and
analysis
The taxonomy for reasons for ordering that consists
of seven categories as presented in Section 3 is used to
classify each distinctive node into one of the categories
according to its dominant keyword(s). As discussed in
section 3, in cases when there is more than one
keyword encountered in all the records in the node,
there is typically only one clinical indication for
pathology ordering. Thus, using this approach, each
node can only be assigned into a single category. Table
4 summarizes the classification statistics for each
category.
Table 4. Classification of other distinctive nodes into seven clinical indication categories
Category of nodes Descriptive terms (dominant
keywords)
Number
of nodes
Absolute
Size
Clinical condition Elevated, high, low, persistent,
reoccurring, etc. 17 64,606
Clinical Problem Diabetes, hypertension, pain,
infection, etc. 35 147,037
Medication Rx, lipitor, thyroxin, etc. 10 36,570
Nature of ordering Screening, routine, follow-up, etc. 11 50,042
Sample description Fasting blood, etc. 1 8,146
Previous test results Previous, abnormal, etc. 2 8,269
Management Specimen, label, doctor, etc. 2 832
Total 78 315,502
Medication
12%
Nature of ordering
16%
Clinical condition
20%
Clinical problem
46%
Previous test
result
3%Management
< 1%
Sample
description
3%
Figure 4. Distribution of nodes into clinical
indication categories
There are 17 nodes with a total size of 64,606
requests (about 20% of all distinctive nodes) with
clinical notes mainly containing the description of the
clinical condition of the patient. The biggest grouping
(35 nodes), with clinical problems or diseases as
dominant feature in clinical notes, consists of 147,037
requests, almost half (~47%) of all distinctive nodes.
Medication category grouping (10 nodes) contains
36,570 requests (~12%), while nature of ordering
category grouping of nodes makes up about 16% of all
distinctive nodes (50,042 requests). The groupings that
are based on the categories of sample description and
Proceedings of the 40th Hawaii International Conference on System Sciences - 2007
6
previous test results (each containing one node) are of
the same size (~2.60%). The smallest category
grouping, management, consists of two nodes and only
accounts for 0.26% of all distinctive nodes.
Figure 4 illustrates the distribution of the node
categories.
The biggest grouping of nodes, clinical problem
nodes, is further broken down into smaller groups of
nodes, each group representing an individual problem
or disease. Zhuang et al. [15] used keyword extraction
method to analyze the clinical problems occurred in
clinical notes. The selection of key words was based on
the past literature or survey reports in the area of
pathology ordering. In contrast, in this study the
clinical problems are elicited using unsupervised text
mining, which reflects the natural clustering of similar
words or terms in clinical notes without any manual
manipulation or domain expert inputs. The results
presented here, therefore, may not necessarily
accurately reflect the frequency of clinically pre-
defined terms for problems or diseases, but rather
provide an important general reflection of the way
doctors commonly describe certain problem or diseases
in day-to-day practice. For example, certain keywords
such as “pain” and “infection” did not appear as pre-
defined clinical problems in the study by Zhuang et al.
[15], but in this study they emerge as dominant
keywords for some nodes.
The clinical problem category based grouping
consists of 35 nodes that cover 10 different clinical
problems. The groups of nodes relating to the same
clinical problem are ranked according to their size
(number of requests for each clinical problem). The
statistics for each problem group are presented in Table
5.
In order to illustrate the variety of different patterns
of expression in clinical notes among nodes with same
dominating clinical problem, consider the “pregnancy”
nodes as an example. There are three nodes with
“pregnancy” as dominant keyword, each with a
different error and size. A closer look at the three
nodes reveals that the node with the smallest error
contains the simplest and most consistent expressions
of pregnancy problem, while the one with largest error
contains the longest and most sophisticated expressions
of pregnancy problem with other clinical
complications. Table 6 shows the comparison of the
three “pregnancy” nodes.
Table 5. Statistics for “clinical problem” groups
Rank Problem groups Number of
nodes
Absolute
Size
Relative Size (% of “clinical problem”
nodes)
1 Hypertension 3 23,148 15.74
2 Pain 4 22,624 15.39
3 Lipid disorder 8 20,053 13.64
4 Diabetes 7 17,299 11.77
5 UTI 3 16,276 11.07
6 Infection 3 14,684 9.99
7 Pregnancy 3 10,533 7.16
8 Chest problem 2 10,477 7.13
9 Prostate problem 1 7,468 5.08
10 Abdominal problem 1 4,475 3.04
Total 35 147,037 100
Table 6. Comparison of the “pregnancy” nodes
Node AQE Size Description
1 0.1352 6,916 Simple and short description of pregnancy, e.g. “21 weeks pregnant”.
2 0.6735 1,752 Pregnancy with some other problem, e.g. “Pregnant haematuria”.
3 0.8328 1,865 Pregnancy with a long expression of other complications, e.g. “Patient is
pregnant and has had possible contact with measles case”.
Proceedings of the 40th Hawaii International Conference on System Sciences - 2007
7
6. Summary and Conclusions
In this paper an unsupervised text mining algorithm
was used on a large dataset of pathology requests
obtained from a pathology company including the free-
text clinical notes. The underlying assumption is that
test requests with similar wordings of clinical notes are
clustered into one group while those with dissimilar
wordings clustered into different groups. Homogenous
nodes of requests are generated as a result, and
characteristics such as size, average quantization error
(AQE), and most frequent keywords with percentage
are presented for each node.
The research design involved two stages. At the
first stage, the text mining algorithm was applied on
the text data to come up with emerging patterns. At the
second stage, these patterns were combined with
existing knowledge in a wider domain to provide a
framework for the analysis. The unsupervised
knowledge discovered by the HDGSOMr algorithm is
highly appropriate for discovering emerging patterns,
while it does not provide the means for results
interpretation. Therefore, human inputs based on pre-
existing domain knowledge were required for the
analysis and interpretation of the results.
Based on whether there are distinctive terms
dominating the node or not, the nodes are separated
into distinctive and “fuzzy” nodes. Due to the scope of
this study, “fuzzy” nodes are filtered out for future
research while distinctive nodes are included for close
investigation.
“Fuzzy” nodes are the nodes generated from the text
mining process that do not have a distinctive keyword
that is encountered in all the records within the node.
Almost 35% of all the pathology requests are in
“fuzzy” nodes exhibiting textual pattern in clinical
notes that requires further analysis. These nodes
typically contain several clinical indications that may
be less common than those observed in distinctive
nodes. However, these indications are also important as
they are indispensable parts of clinical practice of
pathology ordering. Besides, some interesting
problems may be diluted and scattered into different
“fuzzy” nodes because of various terms different GPs
use to describe the same problem. For example,
“fatigue” can often also be described as “tiredness”,
“lethargy”, “weakness”, etc. Therefore, “fuzzy” nodes
(especially large ones) need to be further investigated
to discover the hidden patterns at a deeper level.
Through a general exploration of the distinctive
nodes two “special” groups of nodes can be identified:
“Pap smear” and “Warfarin”. Nodes within both
groups are located together, usually large in size, and
exhibit unique test ordering patterns.
Different “Pap smear” nodes present different
clinical implications of Pap smear ordering and
different patient profile for the orderings. In contrast,
“Warfarin” nodes differ from each other mainly by the
level of detail expressed in clinical notes concerning
Warfarin intake. Other distinctive nodes are classified
according to their types of clinical indications such as
clinical conditions, clinical problems, medications,
nature of ordering, sample descriptions, previous test
results, and management issues. Nodes based on the
clinical problems can be further subclassified into
specific clinical problem related groups including
hypertension, pain, lipid disorder, diabetes, UTI,
infection, pregnancy, chest problem, prostate problem,
and abdominal problem.
Average quantization errors (AQEs) vary from node
to node, with slightly negative correlation (Pearson
Correlation -0.182) to the size. Small error values may
indicate the consistency of expressions in clinical notes
within a group (node) of data records. Usually, the
shorter the expression, the more likely it can be
consistent, and the smaller the error may be. Larger
error values do not necessarily mean that the node is
“fuzzy” or the quality of the clustering is “bad”. They
may indicate that a certain clinical problem or
implication can be expressed in considerably different
ways by different doctors.
Size of the nodes may also generate important
implications. Large and homogenous nodes require
special attention because they are likely to capture both
common and consistent patterns of pathology ordering.
Some of the “Pap smear” and “Warfarin” nodes belong
to this category. Type of the nodes reflects the clinical
implications of ordering. It may contain important
clinical information in regard to reasons or issues of
pathology ordering from a doctor’s perspective.
The vocabulary of clinical notes is considerably
different from the language of formal medical texts,
references and publications. Some doctors are very
brief in expression while some others are quite wordy.
Most of the sentences in the notes are incomplete, and
the terms describing the same clinical indication can
vary widely. Traditional statistical methods will be of
little use to analyze this kind of information.
Abstracting or coding the text is a feasible option, but
an existing pre-knowledge and/or significant expert
understanding of the text may be required for the
effective coding, thus heavily relying on the external
domain knowledge as an input to the process. The
unsupervised text mining technique used in this study,
however, requires little pre-understanding or domain
knowledge. Rather, it may generate new knowledge
that facilitates or enhances the process of abstracting or
coding.
Proceedings of the 40th Hawaii International Conference on System Sciences - 2007
8
Presenting the case of a general exploration of the
vast dataset of free-text clinical information, this
research does not aim to provide the complete analysis
of all the clinical indications of pathology requests.
Presented here is just a top level overview of the most
frequently encountered patterns of the text. In addition,
the information recorded in the clinical notes by GPs is
sensory rather than objective. For example, the
cholesterol or lipid level is seldom noted by doctors in
numerical terms, but often an abnormal level is
interpreted as being either “low” or “elevated”. This
observation re-emphasizes the limitations of the
proposed approach as compared to standard data
mining on numerical data in regards to numerical
accuracy, but at the same time illustrates its benefits as
far as making use of invaluable text information that is
beyond the scope of classical numeric data mining.
As a part of future research, text mining results
discovered from this study will be combined with the
numeric data mining results to enhance understanding
of pathology ordering from both patient- and doctor-
centric perspectives. Future research also includes an
online GP learning activity. This program is designed
with the intention to investigate the reasons why GPs
order certain tests in particular circumstances and the
factors influencing their test ordering which are not
fully understood at present.
Acknowledgements
This research was partially supported by the
Australian Research Council Linkage Grant
LP0347622.
Authors would like to acknowledge the support of
XYZ pathology in providing the data.
We would like to thank anonymous referees for
constructive comments that helped us to improve the
quality of this paper.
7. References
[1] Cohen, J., Piterman, L., McCall, L., and Segal, L.,
“Near-patient testing for serum cholesterol:
attitudes of general practitioners and patients,
appropriateness, and costs”, Medical Journal of
Australia, 1998; 168:605-610.
[2] Britt, H., Miller, G.C., Charles, J., Knox, S.,
Valenti, L., Henderson, J., Pan, Y., Bayram, C.,
and Harrison, C., “General practice activity in
Australia 2002-03”, Australian Institute of Health
and Welfare (General Practice Series No. 14),
2004 p. 81.
[3] Guibert, R., Wicker, S., and Horrocks, M.,
Background Reading for QUP-GP workshop, “The
development of a research proposal to address the
appropriate use of pathology to general practice –
stage 1”, The Royal Australian College of General
Practitioners Research and Practice Support
Directorate, prepared by 31 August – 1 September
2001.
[4] Zhuang Z.Y., Churilov, L., Burstein, F., Sikaris, K.
“Combining data mining and case-based reasoning
for intelligent decision support for pathology
ordering by general practitioners in Australia”,
Proceedings of the Conference on Creativity and
Innovation in Decision Making and Decision
Support (CIDMDS 2006), an IFIP TC8/WG 8.3
Open Conference, 2006.
[5] Vinning R.F. and Mara P., “General practitioners
and pathology testing”, Medical Journal of
Australia, 1998; 168:591-592.
[6] Van Walraven, C. and Naylor, C.D., “Do we know
what inappropriate laboratory utilization is? A
systematic review of laboratory clinical audits”,
JAMA 1998; 280:550-8.
[7] Lundberg, G.D., “The need for an outcomes
research agenda for clinical laboratory testing”,
JAMA 1998; 280:565-6.
[8] Smellie, W.S.A., “Appropriateness of test use
in pathology: a new era or reinventing the
wheel?” The Association of Clinical
Biochemists, 2003; 40:585-592.
[9] Miller, T. W. Data and text mining, a business
applications approach. Pearson Prentice Hall, New
Jersey, 2005.
[10] Amarasiri, R., Alahakoon, D., Premaratne, M. and
Smith, K. “HDGSOMr: A High Dimensional
Growing Self Organizing Map Using Randomness
for Efficient Web and Text Mining”, Proceedings
of IEEE/ACM/WIC Conference on Web
Intelligence (WI) 2005, Paris, France, pp. 215-221
[11] Amarasiri, R., Ceddia, J. and Alahakoon, D.
(2005b) “Exploratory Data Mining Lead by Text
Mining Using a Novel High Dimensional
Clustering Algorithm”, Proceedings of 4th
international Conference on Machine Learning
Applications (ICMLA '05), Los Angeles,
California, USA, 2005, pp. 267-272.
Proceedings of the 40th Hawaii International Conference on System Sciences - 2007
9
[12] Heinze D.T., Morsch, M. L., Holbrook, J. “Mining
free-text medical records”, Proceedings of the
AMIA 2001 Annual Symposium. American Medical
Informatics Association, November 2001, pp. 254-
8.
[13] Cerrito, P. C. and Cerrito, J. C. “Data and text
mining the electronic medical record to improve
care and to lower costs”, Proceedings of the Thirty-
first Annual SAS Users Group International
Conference, Cary, NC: SAS Institute Inc. 2006,
Paper 077-31.
[14] Srinivasan, P. and Sehgal, A.K., “Mining
MEDLINE for similar genes and similar drugs”,
Department of Computer Science, The University
of Iowa TR# 03-02, July 2003.
[15] Zhuang, Z.Y., Churilov, L., Sikaris, K.,
“Uncovering the patterns in pathology ordering by
Australian general practitioners: a data mining
perspective,” Proceedings of the 39th Annual
Hawaii International Conference on Systems
Sciences (HICSS’06) Track 5, 2006, p. 92c.
[16] Kohonen, T., Self Organizing Maps, Springer,
2001.
[17] Salton, G. and Buckley, C. “Term-weighting
approaches in automatic text retrieval”,
Information Processing & Management, 1998, 24
(5), 513-523.
[18] Porter, M. “An algorithm for suffix stripping”,
Program, 1980, 14 (3), 130-137.
[19] Heinze, D.T., Morsch, M.L., and Holbrook, J.
“Mining free-text medical records”,
Proceedings of the AMIA 2001 Annual
Symposium. American Medical Informatics
Association: November 2001.
[20] Abidi, SSR and Manickam, S., “Augment
medical case base reasoning systems with
clinical knowledge derived from
heterogeneous electronic patient records”, 14th
IEEE Symposium on Computer Based
Medical Systems (CBMS’2001), 26-27 July
2001, Bethesda (USA).
[21] Berry, M.J.A. and Linoff, G.S. Mastering Data
Mining, New York, John Wiley and Sons, 2000.
Proceedings of the 40th Hawaii International Conference on System Sciences - 2007
10