Data mining of administrative claims data for pathology services

8
Data Mining of Administrative Claims Data for Pathology Services Simon Hawkins ([email protected]) Graham J. Williams ([email protected]) Rohan A. Baxter ([email protected]) Peter Christen ([email protected]) Michael J. Fett ([email protected]) Markus Hegland ([email protected]) Fuchun Huang ([email protected]) Ole Nielsen ([email protected]) Tatiana Semenova ([email protected]) Andrew Smith ([email protected]) Cooperative Research Centre for Advanced Computational Systems (ACSys) GPO Box 664, Canberra ACT 2601, Australia. Abstract Australia has a universal health insurance scheme called Medicare. Medicare payments for pathology services gen- erate voluminous transaction data on patients, doctors and pathology laboratories. The Health Insurance Commission (HIC) currently uses predictive models to monitor compli- ance with regulatory requirements. The HIC commissioned a project to investigate the generation of new features from the data. These features were summarised, visualised and used as inputs for clustering and outlier detection methods. Some initial interpretations and insights into the pathology service industry are discussed. Further work is required for feature selection, training of predictive models with the new features and the evaluation of performance against the cur- rently deployed models. 1 Introduction Australia has a universal health insurance scheme called Medicare. Medicare payments for pathology services gen- erate voluminous transaction data on patients, doctors and pathology laboratories. These payments are administered by a government agency called the Health Insurance Com- mission (HIC). The HIC’s charter is to make accurate and timely payments while maintaining confidentiality and pri- vacy. The administrative claims transaction data poten- tially contain valuable information about the nature of the pathology services industry and its regulatory compliance. This data mining project was undertaken by the Advanced Computational Systems Cooperative Research Centre (AC- Sys), which is a third-party research consultancy center con- tracted by the HIC. This paper reports on the project results as well as some project management issues arising from an out-sourced data mining project in the privacy-conscious health industry. The business problem for the HIC is that of monitoring the compliance of pathology laboratories with the payment system’s regulatory framework. The HIC already has capa- bilities for training and deploying predictive models to aid in predicting levels of compliance. If a compliance level for a sample of pathology laboratories for a regulatory re- quirement is given, then the HIC can use predictive model- ing techniques, such as neural networks and decision trees, to predict the compliance of a pathology laboratory. The resulting predictive models can then be deployed to alert the HIC whenever a pathology laboratory is predicted to be above a risk-threshold. These predictive models currently use a small number of features as inputs. These features typ- ically include volumes of transactions, the types of pathol- ogy tests performed and the dollar value of tests performed each quarter for each pathology laboratory. They are statis- tics derived from the online transaction processing system. For the current project, the HIC was interested in the question of what other features may be generated from the transaction data for the purposes of characterising pathol- 0-7695-0981-9/01 $10.00 (c) 2001 IEEE 1 1 Proceedings of the 34th Hawaii International Conference on System Sciences - 2001

Transcript of Data mining of administrative claims data for pathology services

Proceedings of the 34th Hawaii International Conference on System Sciences - 2001

Data Mining of Administrative Claims Data for Pathology Services

Simon Hawkins ([email protected])Graham J. Williams ([email protected])

Rohan A. Baxter ([email protected])Peter Christen ([email protected])

Michael J. Fett ([email protected])Markus Hegland ([email protected])

Fuchun Huang ([email protected])Ole Nielsen ([email protected])

Tatiana Semenova ([email protected])Andrew Smith ([email protected])

Cooperative Research Centre forAdvanced Computational Systems (ACSys)

GPO Box 664, Canberra ACT 2601, Australia.

Abstract

Australia has a universal health insurance scheme calledMedicare. Medicare payments for pathology services gen-erate voluminous transaction data on patients, doctors andpathology laboratories. The Health Insurance Commission(HIC) currently uses predictive models to monitor compli-ance with regulatory requirements. The HIC commissioneda project to investigate the generation of new features fromthe data. These features were summarised, visualised anused as inputs for clustering and outlier detection methods.Some initial interpretations and insights into the pathologyservice industry are discussed. Further work is required forfeature selection, training of predictive models with the newfeatures and the evaluation of performance against the cur-rently deployed models.

1 Introduction

Australia has a universal health insurance scheme calledMedicare. Medicare payments for pathology services gen-erate voluminous transaction data on patients, doctors anpathology laboratories. These payments are administereby a government agency called the Health Insurance Commission (HIC). The HIC’s charter is to make accurate andtimely payments while maintaining confidentiality and pri-vacy. The administrative claims transaction data poten-

0-7695-0981-9/01 $1

d

dd-

tially contain valuable information about the nature of thepathology services industry and its regulatory compliance.This data mining project was undertaken by the AdvancedComputational Systems Cooperative Research Centre (AC-Sys), which is a third-party research consultancy center con-tracted by the HIC. This paper reports on the project resultsas well as some project management issues arising froman out-sourced data mining project in the privacy-conscioushealth industry.

The business problem for the HIC is that of monitoringthe compliance of pathology laboratories with the paymentsystem’s regulatory framework. The HIC already has capa-bilities for training and deploying predictive models to aidin predicting levels of compliance. If a compliance levelfor a sample of pathology laboratories for a regulatory re-quirement is given, then the HIC can use predictive model-ing techniques, such as neural networks and decision trees,to predict the compliance of a pathology laboratory. Theresulting predictive models can then be deployed to alertthe HIC whenever a pathology laboratory is predicted to beabove a risk-threshold. These predictive models currentlyuse a small number of features as inputs. These features typ-ically include volumes of transactions, the types of pathol-ogy tests performed and the dollar value of tests performedeach quarter for each pathology laboratory. They are statis-tics derived from the online transaction processing system.

For the current project, the HIC was interested in thequestion of what other features may be generated from thetransaction data for the purposes of characterising pathol-

0.00 (c) 2001 IEEE 1 1

Proceedings of the 34th Hawaii International Conference on System Sciences - 2001

ogy laboratory utilization patterns. In the context of theknowledge discovery in databases (KDD) process [4], thisproject focusses on the earlier steps of data preprocessinand feature generation (also called data transformation). Icommercial data mining application areas, such as markeing, financial modelling and telecommunications, the fea-ture generation task is vital for competitive success. Thechoice of the right features in financial modelling offerscompetitive advantages for the analysts which may explainthe dearth of literature. However, the feature generation tashas not been prominent in the literature [8].

The latter steps of the KDD process were not the pri-mary focus of this project in terms of: pattern searchingusing data mining techniques, model evaluation and modedeployment. One key reason for structuring the project thisway was the HIC’s legislated privacy requirements. It can-not legally release identified data about pathology laboratories, doctors or patients to a third-party such as ourselves.

We now briefly describe the HIC’s concern with thisstudy. The HIC aims to better understand the structure othe pathology services industry and the behaviour of theindustry players, using the claims transactions that it processes. This understanding may have a bearing on:

• Ways of identifying unnecessary, wasteful and excessive servicing and inappropriate practice (in extremecases, fraudulent practice).

• Policy recommendations for controlling pathologycosts whilst maintaining pathology services [13].

• Policy recommendations for improving health care ser-vice delivery.

The pathology services transactions provide data abouhealth care consumers (patients), doctors providing service(general practitioners and specialists) and pathology laboratories. A transaction arises for each Medicare item (whichmay comprise of a standard combination of two or moretests) ordered when a patient visits a doctor. Except in rarcases, the doctor chooses the pathology laboratory that caries out the test. In most cases the pathology laboratorthen makes a fee claim to the HIC for the pathology serviceprovided. In other cases, the pathology laboratory invoicethe patient directly, who then makes a claim to the HIC.For most tests, patients will go to a collection centre run bya pathology laboratory for specimen collection. For sometests, doctors collect the specimen themselves and makefee claim to the HIC for collecting the specimen.

Pathology laboratories are not permitted to offer induce-ments to doctors to order large number of pathology testfrom their laboratories. In rural and regional areas, doc-tors have limited choice of pathology laboratory. This priorknowledge of the pathology services industry suggests thapatterns in the relationships between doctors and patholog

0-7695-0981-9/01 $1

gnt-

k

l

-

f

-

-

ts-

er-y

s

a

s

ty

Period File Size Transactions

Quarter 1, 1997 680MB 4,448,547Quarter 2, 1997 704MB 4,529,848Quarter 3, 1997 706MB 4,496,426Quarter 4, 1997 700MB 4,423,777Quarter 1, 1998 749MB 4,777,493Quarter 2, 1998 730MB 4,640,190Quarter 3, 1998 758MB 4,819,261Quarter 4, 1998 733MB 4,623,801Total 5.8GB 36,759,343

Table 1. Number of transactions by quarter

services are of primary interest for assessing any inappro-priate practices and for understanding the pathology indus-try’s structure and behaviour. Patterns in test ordering forparticular patients by doctors are also of interest in assess-ing health care service delivery quality.

The HIC has not yet completed its review of the resultsof this project. However we will motivate our results withsome initial interpretations and hypothetical policy impli-cations. The results should be of general interest to healthadministrators possessing health service transaction data.

The remainder of the paper is organised as follows: Sec-tion 2 describes the data organisation and data transforma-tions undertaken before features could be generated and vi-sualised efficiently and flexibly. Section 3 describes thenew feature sets that were generated, and some data min-ing methods that utilise these features and some visualisa-tions of the feature sets. Section 4 describes additional fea-tures with time components that were investigated. Section5 summarises the current and prospective insights gainedfrom using the generated features. It also discusses whetherthe project has real benefits despite privacy requirementsrestricting the model evaluation and testing that could bedone. Section 6 gives our conclusions.

2 Data Organisation

In this section, we describe the available data, the datatransformations and the data organisation used to enable thefast access required for the feature generation methods.

2.1 Data Types

The project data were Medicare Benefits Schedule Cat-egory 6 (Pathology Services) transactions for the State ofNew South Wales for the eight quarters in 1997/1998. Ta-ble 1 summarizes the dataset. Additional data on referringdoctor attributes were also provided.

Each transaction has 44 fields relating to four distinct en-tities. They are thepathology laboratory, which performs

0.00 (c) 2001 IEEE 1 2

Proceedings of the 34th Hawaii International Conference on System Sciences - 2001

Entity Entity fields (meaningdescribed in text wherethey arise)

Transaction Test item number, date ofservice, date of processing,date of referral, date oflodgement, schedule fee fortest, benefit paid, hospitalindicator

Pathology Laboratory Unique identifier,RRMA

Doctor Unique identifier,RRMA,specialty (GP or specialist)

Patient Unique identifier, date ofbirth, gender,RRMA, home country, age

Table 2. Summary of transaction fields,grouped by the entity they describe

the pathology test, thedoctor, who orders the test, thepa-tient, for whom the test is ordered and thetransactionit-self. Table 2 gives a summary of the transaction fieldsThe 36.8 million transactions covered79 pathology labo-ratories,20, 314 doctors and3, 853, 603 patients. The HICis required by law to de-identify fields that could identifyany individual entity. This was done by encrypting entityidentifiers and postcodes. In lieu of unencrypted postcodlocation information, a RRMA field coded seven differenttypes of geographic regions, including rural, metropolitanand city.

2.2 Data Transformation

The following pre-processing was performed:

• The five date fields were converted to day offsets, starting from January 1, 1970 (the Unix epoch startingdate). The offsets for dates before January 1, 1970 arnegative. This simplified the calculation of time lagsused in feature generation.

• Empty field values were replaced with a marker value.

• Since some pathology tests have different item numbers in different years, all test item numbers weremapped to those current at June 1999.

The preprocessing was done using the Perl scripting language [12], because we noticed an order of magnitude difference in performance between Perl and Tcl [10]. This performance difference is important considering the quantity of

0-7695-0981-9/01 $1

.

e

-

e

-

---

data involved. For example, a single pass of the data usingTcl took 72 hours, whereas it took 3 hours in Perl on ourten 167MHz-processor, 4.5 Gigabyte Sun 4000 Enterpriseserver. A number of passes over the data were required dur-ing the data transformation and data stratification process;a three day wait for each would soon take up a significantproportion of the project time.

2.3 Data Organisation and Access

The transactions were originally stored in a single largerelational database table. One approach, and a current areaof research in data mining, is to interface data mining meth-ods with this relational database [6, 11]. SQL queries wereused for ad hoc querying of the data throughout the project.However, our explorations of feature generation requiredfast access to one or more individual transaction columns,whereas a relational database provides fast access to indi-vidual transaction rows.

Alternative approaches for fast column access includedata-cubes [1], and sufficient statistic caching [9]. Theseapproaches are efficient for specific data methods such asassociative rules or clustering, but are not efficient enoughfor intensive exploration of interesting features. We devel-oped a column-binary-flat file approach that was efficient,yet flexible enough, for feature generation. This organisa-tion of the data allows our feature generation programs toselectively access one or more columns in an efficient, flex-ible way.

2.4 Data Stratification

Test subset Code Description

Specialist, sh Test ordered by specialistin hospital doctor for patient in

hospital.Specialist, so Test ordered by specialistout of hospital doctor for patient

out of hospital.GP, gh Test ordered by Generalin hospital Practice(GP) doctor

for patient in hospital.GP, go Test ordered by GPout of hospital doctor for patient

out of hospital.

Table 3. The four subsets of the stratified data

We stratified the data into the four subsets shown in table3. The motivation for the stratification was two-fold:

0.00 (c) 2001 IEEE 1 3

Proceedings of the 34th Hawaii International Conference on System Sciences - 2001

• The subsets were smaller and more manageable fodata manipulations.

• It was expecteda priori that ordering patterns for eachsubset would be distinct. This expectation was onlypartilly borne out. We found that test ordering patternsfor thego andso subsets did not significantly vary insection 3.1

3 Features for Pathology Laboratory utiliza-tion patterns

The purpose of this project was to find new features thacould provide a basis for predictive modeling of pathol-ogy laboratory utilization patterns for compliance monitor-ing with regulatory requirements. As mentioned in section1, existing features include counts of columns in the trans-action table, including volumes of transactions, volume oftypes of pathology tests performed and the total dollar valueof the tests performed.

In section 3.1 we examine the structure of the pathologylaboratory market using clustering on relative proportionsof types of pathology tests. The clusters found could be in-terpreted as ‘market niches’, summarising the relative tesvolumes from each market sector. Each pathology laboratory can be classified according to the market niche (ocluster) to which it belongs. Knowledge of market structurecan have implications for health care financial policy andservice delivery.

In section 3.2, pathology laboratories, which are outlierswith respect to various feature distributions, are identified.

3.1 Relative volume of tests in each subset

The features generated for input into the clustering algo-rithm were the proportion of tests provided by a particularpathology laboratory in each of the four subsets. The pro-portion of tests, rather than absolute count of tests, is usein order to avoid pathology laboratory test volume affectinganalyses.

A k-means clustering method [5] was applied to the 79pathology laboratories using the ‘relative proportion in eachof the four subsets’ feature. The clustering method requiresthe number of clusters (k) to be given as an input parameter.Analyses were performed using the range betweenk = 2and k = 10 as input to the clustering method. Distinctgroups arose withk = 4, k = 4, andk = 5 clusteringsolutions. We choose to describe thek = 4 result becauseit can be interpreted as follows. The first two clusters in fig-ure 1 contained laboratories that mainly processed pathoogy tests for specialists in and out of hospitals (sh andso ),respectively. They contained 17 and 15 laboratories. Laboratories in cluster one processed about65% of their tests

0-7695-0981-9/01 $1

r

t

t-r

d

l-

-

for doctors in thesh category, and laboratories in clustertwo did more than60% of their tests for doctors in thesocategory. Cluster three covers the laboratories which almostexclusively did tests from doctors out of hospitals.60% oftheir tests were for GPs out of hospitals and about30% werefor specialists out of hospitals. This is the largest clusterwith 25 laboratories. Cluster four contains the 22 laborato-ries which processed more than85% of their tests with GPsout of hospitals (go). The clusters seem to identify market

0

0.2

0.4

0.6

0.8

1

gh go sh so

Cluster 1 (contains 17 laboratories)

0

0.2

0.4

0.6

0.8

1

gh go sh so

Cluster 2 (contains 15 laboratories)

0

0.2

0.4

0.6

0.8

1

gh go sh so

Cluster 3 (contains 25 laboratories)

0

0.2

0.4

0.6

0.8

1

gh go sh so

Cluster 4 (contains 22 laboratories)

Figure 1. Laboratories clustered according torelative volume of tests for each doctor sub-set. For each cluster, the relative volume oftests in the four groups gh , go , sh and so isgiven.

niches for the pathology laboratories. The market nichescould be due to geographical factors, to marketing niche,

0.00 (c) 2001 IEEE 1 4

Proceedings of the 34th Hawaii International Conference on System Sciences - 2001

to doctor preferences or to other factors we have not considered. There are possible policy implications from thisinsight into market structure. For example, there have beechanges in the pathology laboratory market in recent yeardue to bankruptcies and mergers. It should be interesting tsee if this activity is focussed within any of the identifiedniches or across them.

3.2 Outlier Pathology Laboratories

A simple but effective data mining method is to exam-ine outliers. For univariate continuous features, an outliercan be defined as a value outside two (or other suitablevalue) standard deviations from the mean. For univariatecategorical features, an outlier can be defined as a relativproportion which differs from the mean by more than 10%(or other suitable value).

We used many of the 44 transaction fields as featureand then looked for outliers with respect to those featuresWe now give examples of features that were discriminatory(i.e. identified a small proportion of pathology laboratoriesas outliers) and the outlier laboratories that were found:

• The relative proportion of tests over the eleven pathol-ogy test categories. The test categories and overall av

Group Number Group Name Percentage

1 Haematology 162 Chemical 283 Microbiology 134 Immunology 15 Tissue 16 Cytology 47 Cytogenetics 18 Infertility 09 Basic 010 Episode Initiation 3611 Specimen Referred 0

Table 4. Percentage of tests in each test groupfor laboratories that have mainly out of hos-pital GPs

erage percentages are shown in table 4. Laborator9@6had its percentage of tests in Basic tests signifi-cantly higher than the other laboratories. Laboratories999 and+9%had test percentages for Chemical testssignificantly higher than the other laboratories. Theseoutlier results provide further insight into the structureof the pathology services market.

• Patient gender. The average male-female distributionof patients across all laboratories was 62% male and

0-7695-0981-9/01 $1

-

nso

e

s.

-

y

38% female. The laboratories differing from this pro-portion and their outlier proportions are shown in table5.

Laboratory Identifier Proportion of Males to Females(female,male)

33$ (0.54, 0.46)335 (0.70, 0.30)

%$,3@5,9$6,944 (0.67, 0.33)

Table 5. Laboratories with relative gender pro-portion outliers

• Laboratories with outliers in the patient age distribu-tion are shown in table 6.

Some explanations will be due to differences in patientcatchment area. For example, a pathology laboratory whoseprimary market is in regional coastal areas, with a high pro-portion of retirees, will be expected to have a higher propor-tion of older patients. We observe that pathology laboratory335 performs relatively more tests for female patients andfor older patients. More female patients can be explainedby more older patients, as females live longer on average.More older patients could be explained by geographic loca-tion or market niche. Pathology laboratory9@6does moreBasic tests and also has more patients in the late 30s thanother laboratories. It may be that patients of this age grouptend to have more Basic tests than other groups. The causefor this pathology laboratory having patients aged in theirlate 30s is once again possibly its geographic location, ormarket niche.

4 Features with a time component

Next, we describe some relevant temporal features andvisualisations for characterising pathology laboratory be-haviour. The pathology services market is dynamic. Newpathology laboratories are formed (from mergers orab ini-tio), while others are disbanded. Doctors change their

Laboratory Identifier More patients than average in:

%9 late 40s335 late 70s566 mid 60s9@6 late 30s

Table 6. Laboratories with age distributionoutliers

0.00 (c) 2001 IEEE 1 5

Proceedings of the 34th Hawaii International Conference on System Sciences - 2001

choice of pathology laboratories; some use only one labratory, others use a combination and some switch, add nelaboratories and trial new ones. These trends, and explantions behind these trends, are of interest to the HIC.

4.1 Doctors changing pathology laboratories

The HIC is interested in the ordering patterns of individual doctors as a function of the pathology laboratories use

Figures 2 and 3 show visualisations of the relationshitwo doctors have with pathology laboratories over timeThe first panel for each doctor shows the time pattern oordering of different tests by that doctor from different laboratories. Pathology laboratories have been mapped totegers between 1 and 79 on the y-axis of the first panelfigures 2 and 3. Each× stands for any number of tests thatthe doctor has ordered with a specific laboratory in a giveweek. The second panel shows the total number of tesper week (TpW) this doctor ordered from all laboratoriesThe combination of these two plots shows when, where anhow many pathology tests a doctor ordered. Doctor%5%4

0

10

20

30

40

50

60

70

80

Jan97 Apr97 Jul97 Oct97 Jan98 Apr98 Jul98 Oct98 Dec98

Doctor %5%4

0

50

100

150

200

250

300

350

400

450

500

Jan97 Apr97 Jul97 Oct97 Jan98 Apr98 Jul98 Oct98 Dec98

Doctor %5%4

Figure 2. Pattern of laboratory use and testsper week for doctor %5%4

0-7695-0981-9/01 $

o-wa-

-d.p.f

-in-in

nts.d

0

10

20

30

40

50

60

70

80

Jan97 Apr97 Jul97 Oct97 Jan98 Apr98 Jul98 Oct98 Dec98

Doctor 447

0

10

20

30

40

50

60

70

80

Jan97 Apr97 Jul97 Oct97 Jan98 Apr98 Jul98 Oct98 Dec98

Doctor 447

Figure 3. Pattern of laboratory use and testsper week for doctor 447

in figure 2 started test ordering from two new laboratories(at 18 and 55 on the y-axis) in July 1997, while simultane-ously doubling the number of tests ordered from around 200per week to 400 per week. It is interesting to consider whatcaused this change in behaviour. It could be that the doctorhas changed from working half-time to full-time, since 400tests per week is about average for a full-time GP.

Doctor447 in figure 3 in August 1998 ceased orderingfrom two laboratories (those at 31 and 61 on the y-axis),and started ordered from a new one. During this transition,the doctor had a complete break from ordering (perhaps aholiday or relocation of practice) and then resumed orderingat previous test volumes.

Although there are over 20,000 doctors in the data, mostof the ordering patterns between a doctor and pathologylaboratories are relatively stable over time. Besides thetwo doctors shown, we have manually identified about fiftyother interesting ordering patterns. Of course, it is desirableto automate this process, but data mining algorithms in theliterature we have reviewed do not currently handle mul-tivariate data with a time component (while we have only

10.00 (c) 2001 IEEE 1 6

Proceedings of the 34th Hawaii International Conference on System Sciences - 2001

visualised two of the fields over time, there are other time-varying fields of interest). An important area of current datamining research is the development of algorithms and visu-alisation techniques for time series analysis [7] and eventsequence analysis [2].

4.2 Service Lags

A service lagis defined as the time interval between dateof referral (DOR) and date of service (DOS) for a pathologytest. A chronically-ill patient may have tests ordered forfuture periodic visits to a doctor. In these circumstances, itis convenient for the patient to have a specimen taken beforethe next visit, so that the doctor has the results availablefor consultation. However the regulatory guidelines do notgenerally allow this to be done more than a year in advance

The summary of service lags in figure 4 reveals that mosttests are ordered within 6 months of the referral date. Thiscompares with just 0.18% of tests with a service lag of sixmonths to one year and 0.04% of tests have a service lag omore than one year. Two break points between main timeintervals fall approximately at 183 days (6 months) and 365days (1 year). Although service lags of more than a yearare infrequent, a closer examination of them indicated somedata quality issues. It became apparent that date of servicand date of referral fields in these transactions had been subject to a high proportion of data entry errors. For example,1997 is relatively commonly entered incorrectly as 1979.The significance of the six month gap is explained by refer-ence to multiple test ordering rules in the Medicare pathol-ogy regulations: multiple tests can only be ordered up to sixmonths ahead for seriously or chronically ill patients [3].Outlier detection using the service lag feature reveals fourlaboratories with significantly longer service moving aver-age lags than usual.

1

10

100

1000

10000

100000

1e+06

1 10 100 1000 10000

Num

ber

of o

ccur

ence

s

DAYS (DOS - DOR)

Figure 4. Frequency of service lags (time in-terval between DORand DOS)

0-7695-0981-9/01 $1

.

f

e-

4.3 Patient Episodes

An episodeis defined as the group of pathology tests or-dered for a patient by the same doctor on the same day ofconsultation. Episode size is defined as the number of testsin an episode. Episode duration is defined as the maximumtime between the date of referral for the episode tests andthe date of service for an episode test.

Various test-based features of episodes were examined.Laboratories that have more repeated tests in an episodethan is typical are identified.

Approximately 6.5 million episodes were initiated in1997. For 99% of these episodes, the tests in the episodewere performed by a single laboratory.

Using episodes as features introduces the complicationfor analysis of windowing effects. Episodes that are ini-tiated before the beginning of 1997 continue into 1997 andepisodes that start near the end of 1998 do not end until afterthe available data window. We excluded these incompleteepisodes from the analysis.

Figure 5 presents the distribution of the number of testsin each episode size. Typically episodes of size2k are muchmore frequent than episodes of size2k + 1. The most fre-quent episodes have size of 2 to 4 tests. Episodes of thissize form the majority of episodes. There is a drop in testsat around size 60.

The issue in interpreting this feature is how many testsper episode can be clinically justified? The issue is a com-plicated one, but can be broken down into two separate is-sues. The first issue concerns multiple tests of the sametype. This is indicated for chronically-ill patients, where itmay be convenient to order multiple tests for the next weeksor months. The second issue concerns the number of dif-ferent tests that can be ordered in the same episode withclinical justification.

10

100

1000

10000

100000

1e+06

1e+07

1e+08

1 10 100 1000

tota

l num

ber

of p

atho

l tes

ts fo

r th

is s

ize

size of episode

Figure 5. Number of episodes by episode size

0.00 (c) 2001 IEEE 1 7

Proceedings of the 34th Hawaii International Conference on System Sciences - 2001

5 Implications

We have used the features generated in this paper tidentify pathology laboratories and doctors who are outlierswith respect to distributions over these features. Addition-ally, as mentioned in the introduction, the features generatein this paper will be used in new predictive models and theirperformance compared with existing predictive models us-ing existing features.

As one would expect, some of the pathology laboratoryutilization patterns discovered using the new features presented here are novel, wherease others are known from aternative sources of knowledge. The client intends to investigate the novel patterns for explanation and significanceWhy are we not reporting on these patterns? Under theHealth Insurance Commission Act, as a third-party contrac-tor to the HIC, we can only legally receive de-identified in-formation about patients, pathology laboratories and doctors. This privacy requirement excludes interpretations ofresults using information that can identify entities (such astheir geographic market focus). In this paper, we have useexamples of how our results may possibly affect HIC policy.

Of longer term interest, outside the scope of the presenstudy, is comparing pathology ordering patterns against astandard of best practice. This is difficult using Australianhealth data because of the absence of diagnostic information that would explain why the pathology test was ordered.In some cases, clinical diagnoses can be inferred from otheadministrative data. For example, there is a standard batery of tests for the second trimester of pregnancy, and sodeviations from standard, practice through under- or over-servicing may be observed. It is an open question whethethis type of inference can be derived reliably from admin-istrative claims data. For instance, over-servicing can beconfounded with further co-morbidity investigations.

6 Conclusion

We have generated new features from pathology serviceclaims data that were then used to identify outlying labora-tories and doctors. The features were also used to visualizof doctors’ ordering practices.

Algorithms for automating the process of finding out-liers, and for clustering entities characterised by featuresinvolving multivariate time series and outliers, are neededin this domain and are not currently available.

We have extended the range of features available to thHIC beyond those computed from counts of columns in thetransaction table. We identified a number of new interestingfeatures for use in predictive modeling. These features wersummarised, visualised and used as inputs for clusterinand outlier detection methods. Data organisation and dat

0-7695-0981-9/01 $1

o

d

-l--.

-

d

t

-

rt-

r

s

e

e

ega

tranformation methods were described for the efficient ac-cess and manipulation of these new features. Further workis required for feature selection and training of predictivemodels with the new features and evaluation of performanceagainst the currently deployed models.

Acknowledgements

We thank the Health Insurance Commission(HIC) for ac-cess to the data and financial support of the project. PeterChristen was funded by the Swiss National Science Foun-dation (SNF) and the Novartis Stiftung, Switzerland. Wethank the referees for their suggestions which greatly im-proved the paper.

References

[1] S. Agarwal et al. On the computation of multidimensionalaggregates. InProc. VLDB’96, pages 506–521, 1996.

[2] R. Agrawal and R. Srikant. Mining sequential patterns. InProc. of the 11th Int’l Conference on Data Engineering,pages 487–499, 1995.

[3] Commonwealth Department of Health and Family Services.Medicare Benefits Schedule Book. Australian GovernmentPublishing Service, 1997.

[4] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth.Advances inKnowledge Discovery and Data Mining, chapter From DataMining to Knowledge Discovery:An Overview, pages 1–36.AAI Press, Menlo Park,CA., 1996.

[5] J. Hartigan and M. Wong. A K-means clustering algorithm.Applied Statistics, 28:100–108, 1979.

[6] G. John and B. Lent. Sipping from the data firehose. InThird Int. Conf. on Knowledge Discovery and Data Mining,pages 199–202. AAAI Press, Menlo Park,CA., 1997.

[7] E. Keogh and P. Smyth. A probabilistic approach to fastpattern matching in time series databases. InThird Int. Conf.on Knowledge Discovery and Data Mining, pages 24–30.AAAI Press, Menlo Park,CA., 1997.

[8] H. Liu and H. Motoda. Feature selection for knowledgediscovery and data mining. Kluwer Academic Publishers,Boston, 1988.

[9] A. Moore et al. Cached sufficient statistics for automatedmining and discovery from massive data sources. Technicalreport, Robotics Institute and School of Computer Science,Carnegie Mellon University, 1999.

[10] J. Ousterhout.Tcl and the Tk toolkit. Addison Wesley Long-man, 1994.

[11] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating as-sociation rule mining with relational database systems: Al-ternatives and implications.Data Mining and KnowledgeDiscovery, 4(2/3):89–125, 2000.

[12] L. Wall, T. Christiansen, and R. Schwartz.ProgrammingPerl. O’Reilly and Associates, 1996.

[13] K. Wheelwright. Controlling pathology expenditure underMedicare- a failure of regulation? Federal Law Review,22(1), 1995.

0.00 (c) 2001 IEEE 1 8