On the validity of client-side vs server-side web log data analysis

16
On the validity of client-side vs server-side web log data analysis Gi Woong Yun Department of Telecommunications, School of Communication Studies, Bowling Green State University, Bowling Green, Ohio, USA Jay Ford Center for Health Systems Research and Analysis, University of Wisconsin–Madison, Wisconsin, USA Robert P. Hawkins School of Journalism and Mass Communication, University of Wisconsin–Madison, Wisconsin, USA Suzanne Pingree Department of Life Sciences Communication, University of Wisconsin–Madison, Wisconsin, USA Fiona McTavish Center for Health Systems Research and Analysis, University of Wisconsin–Madison, Wisconsin, USA David Gustafson Department of Industrial Engineering, University of Wisconsin – Madison, Wisconsin, USA, and Haile Berhe Center for Health Systems Research and Analysis, University of Wisconsin–Madison, Wisconsin, USA Abstract Purpose – This paper seeks to discuss measurement units by comparing the internet use and the traditional media use, and to understand internet use from the traditional media use perspective. Design/methodology/approach – Benefits and shortcomings of two log file types will be carefully and exhaustively examined. Client-side and server-side log files will be analyzed and compared with proposed units of analysis. Findings – Server-side session time calculation was remarkably reliable and valid based on the high correlation with the client-side time calculation. The analysis result revealed that the server-side log file session time measurement seems more promising than the researchers previously speculated. Practical implications – An ability to identify each individual user and low caching problems were strong advantages for the analysis. Those web design implementations and web log data analysis scheme are recommended for future web log analysis research. Originality/value – This paper examined the validity of the client-side and the server-side web log data. As a result of the triangulation of two datasets, research designs and propose analysis schemes could be recommended. Keywords Data analysis, Worldwide web, Measurement Paper type Research paper The current issue and full text archive of this journal is available at www.emeraldinsight.com/1066-2243.htm Web log data analysis 537 Internet Research Vol. 16 No. 5, 2006 pp. 537-552 q Emerald Group Publishing Limited 1066-2243 DOI 10.1108/10662240610711003

Transcript of On the validity of client-side vs server-side web log data analysis

On the validity of client-side vsserver-side web log data analysis

Gi Woong YunDepartment of Telecommunications, School of Communication Studies,

Bowling Green State University, Bowling Green, Ohio, USA

Jay FordCenter for Health Systems Research and Analysis,

University of Wisconsin–Madison, Wisconsin, USA

Robert P. HawkinsSchool of Journalism and Mass Communication,

University of Wisconsin–Madison, Wisconsin, USA

Suzanne PingreeDepartment of Life Sciences Communication,

University of Wisconsin–Madison, Wisconsin, USA

Fiona McTavishCenter for Health Systems Research and Analysis,

University of Wisconsin–Madison, Wisconsin, USA

David GustafsonDepartment of Industrial Engineering, University of Wisconsin–Madison,

Wisconsin, USA, and

Haile BerheCenter for Health Systems Research and Analysis,

University of Wisconsin–Madison, Wisconsin, USA

Abstract

Purpose – This paper seeks to discuss measurement units by comparing the internet use and thetraditional media use, and to understand internet use from the traditional media use perspective.

Design/methodology/approach – Benefits and shortcomings of two log file types will be carefullyand exhaustively examined. Client-side and server-side log files will be analyzed and compared withproposed units of analysis.

Findings – Server-side session time calculation was remarkably reliable and valid based on the highcorrelation with the client-side time calculation. The analysis result revealed that the server-side logfile session time measurement seems more promising than the researchers previously speculated.

Practical implications – An ability to identify each individual user and low caching problems werestrong advantages for the analysis. Those web design implementations and web log data analysisscheme are recommended for future web log analysis research.

Originality/value – This paper examined the validity of the client-side and the server-side web logdata. As a result of the triangulation of two datasets, research designs and propose analysis schemescould be recommended.

Keywords Data analysis, Worldwide web, Measurement

Paper type Research paper

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/1066-2243.htm

Web log dataanalysis

537

Internet ResearchVol. 16 No. 5, 2006

pp. 537-552q Emerald Group Publishing Limited

1066-2243DOI 10.1108/10662240610711003

One of the main motivations of internet content providers in expanding the availabilityof multimedia is the perception that the internet provides unparalleled access toaccurate usage data. It is generally felt that the web log traces left by individualinternet users provide unprecedented quantity and quality of information to researcherand those who would study consumer and market behavior.

However, there were some warning signs about the validity of web log data(Goldberg, 2001). In fact, some researchers paid only a minor attention to the validity ofit during their analysis (e.g. Davis, 2004; Eveland and Dunwoody, 1998a; Eveland andDunwoody, 1998b; Jansen and Resnick, 2005; Phippen, 2004). It might be because it isexpected that the internet use data collected from computers will provide precise anddetailed information about users’ internet use behavior (Eveland and Dunwoody,1998a). Indeed, it is a reasonable assumption that internet use behavior tracked bycomputer software will be more valid than previous media use tracking methods. Thishigh expectation of the validity is due to the pinpoint accuracy of the client computers’or server computers’ data collection software.

Some researchers suspected usefulness of the transaction log data (Peters et al.,1993; Kurth, 1993; Larson, 1991). Others argued that data structures and a complexcollection algorithm should be explored before the data are analyzed, as thiscontributes greatly to the data quality and quantity (Phippen, 2004). For instance, aunit of analysis of the data needs more attention before scientific analysis of theinternet use data. Deciding a proper unit of analysis is difficult and it will influencepredicting and including analysis units ahead of data collection.

The validity of web data logs cannot be taken for granted and there is much we stillneed to learn about how to collect and accurately interpret online activity. First, thispaper will discuss measurement units by comparing the internet use and thetraditional media use. It will help to understand internet use from the traditional mediause perspective. Second, it will examine two log file types by carefully andexhaustively examining benefits and shortcomings of the internet use data. Third, itwill discuss the implication of a unit of analysis in web log file analysis. Finally, it willsuggest some solutions, along with some questions for further discussions of this topic.

A unit of analysisMany researchers already utilized internet log data to understand individual patterns ofknowledge seeking via the internet. They constructed variables to track which webpages users have visited (e.g. Eveland and Dunwoody, 1998a; Eveland and Dunwoody,1998b; Phippen et al., 2004), what users have queried (e.g. Jansen and Spink, 2005; Jansenet al., 2005; Jones et al., 2000; Sandore, 1993; Taha, 2004), what they wrote while they wereusing a computer, who they communicated with, what they communicate, or how theycommunicated (e.g. McTavish et al., 2003; Phippen, 2004). These units of analysis of website use have been operationalized based on the availability of web log data.

What to measureInternet use is different from watching a network TV program where millions oftelevision viewers share a limited number of variations of channel surfing patterns.Each internet user uniquely engages in non-linearly structured cyber space. Therefore,it is not an easy task to record and analyze all users’ navigation behavior. However,some measurement units within a computerized recording system can be traced. The

INTR16,5

538

analysis units can be the amount of time spent during the navigation or the number ofcomputer file access.

Time. One of the most frequently measured units in media research is time. The sheervolume of time exposure has been investigated since the beginning of the media researchfield. Survey respondents are asked to answer questions like “how many hours did youspend reading newspaper per week?”, “how many hours did you watch television lastweek?”, or “how many hours did you watch television news?” Such questions will be alsoapplicable to computer users. In fact, computerized recording systems can trace smallerand more precise increments of time. The duration of stay on web sites can represent“time” spent for this medium and it can be measured in millisecond units.

Hawkins and Pingree (1997) have extensively discussed time as a unit of analysisfor computer use. They have suggested five time levels for computer use where timerepresents the foundation of each measure. The five levels are:

(1) “a lifestyle time frame”;

(2) “multi-episode segment of time”;

(3) “an episode of use”;

(4) “individual message”; and

(5) “within-message”.

Each measure helps researchers explore the amount of time allocated to computerinteraction as well as the nature of the interaction. “A lifestyle time frame” measuresthe general use during the lifetime. “Multi-episode segment of time” measures mediause during the particular segments of the lifetime such as hours, weeks, or months. An“episode of use” is a time frame used to measure a specific occasions of use. “Individualmessage” measures the time spent for the specific message and “within messages”frame can measure the accesses to the certain section of the message (Hawkins andPingree, 1997).

By targeting computer interaction in increasingly smaller segments or time intervals,researchers explored how computer use becomes beneficial to individual users (e.g.Booske and Sainfort, 1998; Smaglik et al., 1998). The five time levels proposed by Hawkinsand Pingree can be applied to the internet use data as well, although some modificationsare required before adopting their time units. Among five levels, “multi-episode segmentof time” can be applied to two different time levels. Both computer and internet use can be“multi-episode segment of time”. For instance, turning on and off the computer multipletimes create multi-episode segments. At the same time, connection and disconnection tothe internet can create multi-episode segments through multiple log-ins and log-offs.Although both episodes can be treated as multi-episode segment of time, multi-episode ofinternet happens only within the multi-episode of computer use because a computer mustbe turned on before an internet use episode can start.

If we accept both computer use and internet use as “multi-episode segments oftime”, the rest of use time levels are relatively easy to apply. For internet usemeasurements, “an episode of use” will start with the internet connection and it willend at the point of the internet disconnection. Also, the “individual message” can be acertain domain access (e.g. www.cnn.com) and the “within message” can be web pageaccess within a domain (e.g. www.cnn.com/news/space.htm).

Web log dataanalysis

539

Session is defined as a set of sequentially or semantically related clicks and sessionunits are very convenient to calculate time from the server-side web log data (e.g. Joneset al., 2000; Jansen and Spink, 2005; Jansen et al., 2005; Peters, 1993). In fact, session isvery similar to “an episode of use”. A session starts when a user connects to the serverand stops when a user leaves the web site. One of the rationales for the web log dataminers to utilize the session is that it is a useful unit of analysis, which can beinstrumental in calculating media use time. A more detailed discussion about sessionunit will appear in the method section of this paper.

Frequency of login vs page request. Web log information such as the volume of datasent or received[1], the locations of users[2], or query strings can be measured in manyways. However, those measurements are less relevant to the individual’s media usequantities and qualities. One of the more popular ways to measure the internet use isthe frequency of login, which is based on the record of users’ points of entry to the website. Thus, the cumulated entries to the specific web site or specific internet programcan be calculated. The frequency of login will be equivalent to the number of episodesof internet use, or the number of sessions, in the internet multi-episode time frame.

Another method of recording user behavior is the frequency of page requests. Whenusers access any web page, a user’s computer requests a certain uniform resourcelocator (URL). Server computers respond and send the requested data to the users’computer. Subsequently, a user’ computer displays received web site files on thescreen. During this process, server computers or a user’s computer can recordrequested URLs. And, the accumulated number of page requests can represent theamount of content accesses from the web site.

The frequency of login and the page requests measurement are similar to the timemeasure for the media use. In fact, “page requests” can tell us more about media use thanthe traditional media use measurement because the recorded URLs reveal filenames andthey can be used to determine the specific contents. However, a meaningful analysis ofthe page requests requires preparation. That is, files should be sorted and tagged intospecific categories in order to achieve meanings of the content accessed by users duringtheir navigation. The revelation of the URLs in a log file may not mean anything whenURLs are not clarified in some meaningful ways. Yet, tagging each file with somemeaningful category requires a great deal of resources even for a moderate size web site.In fact, creating this meta data for each web page is one of the most time consumingprocesses in web log data mining (e.g. Handschuh and Staab, 2002).

The conventional data mining scholars also worked on the web log data mining.They used data mining theory as a frame of reference to analyze web log mining. Forinstance, web log mining scholars used the term “web content mining”. It is the processthat categorizes content of the web site and it is considered to be one of the most criticalelements for the meaningful analysis of the data (Kanerva et al., 2004).

Web content mining includes important steps such as data clean-up process. Forinstance, web page requests records entire files requested by user’s browser. Onescreen access can leave multiple page requests in the web log data when one screenview requires multiple file accesses. If all page requests, or hits, are counted as web use,it will overestimate the access of the web page (Bertot and McClure, 1997). Therefore,the web content mining process should cut the overestimation by designating one fileper one screen view. Web log mining scholars name one screen access as “page access”or “page view” and distinguish it from “page requests” (Burton and Walther, 2001).

INTR16,5

540

Defining an episode of use: stand-alone software vs. internet. In the whole “life styletime frame”, computer use is only a part of the lifetime media consumption which includestelevision, radio, newspaper and many other media use experience. Likewise, internet useis only a part of the whole computer use experience. We can conveniently divide computeruse into the stand-alone program use and the internet-based computer use. Stand-alonecomputer use is any computer program use, which does not demand internet connection.Therefore, stand-alone programs include software such as stand-alone games, wordprocessors, spreadsheets, or graphic design software. Although stand-alone computerprograms occasionally make connections to the internet, it typically does not require aconstant connection to the internet[3]. On the contrary, internet-based computer userequires a constant connection to the internet. The gate to the internet should be kept opencontinuously for the internet experience and the data should be exchanged between user’scomputer and the computers on the internet without any interruption.

Since stand-alone computer use occurs without an internet connection, it is logical toconsider that the stand-alone computer use is not internet use. Only after theconnection to the internet and exchanges of data between user’s computer andcomputers on the internet, an episode of internet use starts. However, when aresearcher includes stand-alone computer use in the study, turning on the computerwill be a beginning of an episode.

Such as, theoretical division between stand-alone program use and internet use canbe relatively clear, but it may not be practical to separate internet use and stand-alonecomputer use in the contemporary computer use environment. In the real world,computer use includes a variety of use patterns among computer users. For instance,some users have constant network connections and they constantly switchapplications between stand-alone software and internet-based software. In fact, theymay even use stand-alone, internet, and many other programs simultaneously byopening multiple windows and working on them at the same time. This type of userswill constantly go in and out of the internet. Thus, if a researcher defines an episode ofinternet use by excluding stand-alone computer use, every switch between stand-alonecomputer use and internet use will create an episode of internet use. There is no doubtthat this pattern of use will over-represent a number of episodes of internet use fromthe traditional media use measurement perspective.

The distinction between stand-alone computer use and internet use blurs even morewhen we adopt more micro level analysis. For instance, conventional electronic mediause usually requires constant connections, such as watching television requires TV’sconstant connections to the broadcasting frequency or cable. Therefore, watching TValmost always means connecting to the broadcasting frequency or cable. On the otherhand, internet use does not always require constant connection. Once the content isdownloaded or cached in the user’s computer[4], internet users can browse the contentwithout a live connection.

The micro level analysis becomes more complicated when we consider thestand-alone software designed for the internet communication such as e-mail. Whenusers write e-mails with e-mail software, they typically open the program and startwriting e-mail. During an e-mail writing session, the e-mail program does not need toconnect to the internet, but it only momentarily connects to the internet when users finishwriting and push a send button. These examples indicate that a boundary betweenstand-alone computer use and internet use is quite complicated in contemporary

Web log dataanalysis

541

computer use environment and it creates difficulties of measuring internet use as mediause. Internet use researchers should not only be aware of this problem, but also shouldclarify how they included or excluded stand-alone use in the analysis algorithm.

Attention to the media. Another dimension that needs to be considered to obtain avalid measure of media use is an individual’s attention to the media. The measurementof media use often requires the assumption that the audience paid some level ofattention during the measured media use time. When we ask audience “how manyhours do you spend watching TV?”, we assume that they paid at least some attentionto TV during the whole period rather than no attention at all during the answeredamount of TV watching time (Danaher, 1995). It is a reasonable assumption since theanswer is based on user’s perceptual judgment of their media use time, although theanswer is vulnerable to many measurement issues such as instrumental decays,history, or social desirability effects (Babbie, 2004).

However, when we are willing to adopt a detailed computer recording system, theminimal attention assumption should be carefully re-examined. In fact, there is aserious problem measuring user’s detailed behavior when the computer programcannot record users’ movements. There are many scenarios where minimum attentionlevel could not be satisfied. First of all, a web log file cannot tell whether users aresitting in front of the monitor and reading information displayed on the screen. Usersmay work on something else while they are sitting in front of the computer such asorganizing their desks or answering phone calls (Catledge and Pitkow, 1995; Jansenand Spink, 2005). Additionally, typical web log data cannot accurately measure webuse when users activate multiple browsers and access many web sites simultaneously.Furthermore, it is also possible that the users may not read an entire page but navigateonly after reading the first line of the page (Burton and Walther, 2001). Therefore, theduration of time spent for specific web site can misrepresent the real media useduration. The internal validity of the measured time from web log file will be seriouslythreatened, if above instances are prevalent amongst web users.

Two types of log files (client vs server)Previous research on transaction log analysis (TLA) reported that the validity of the logfile can be triangulated and examined. Some researchers (e.g. Barber and Riccalton, 2000)confirmed the validity of the TLA and some (e.g. Nielsen, 1986) questioned the validity ofTLA. We will look at the two different types of web log data: Client-side and server-side.

Cost or privacy. Ideally, internet use data should be able to seamlessly collect eachindividual’s entire internet use in both time and page request units. However, thelikelihood of getting ideal data is very low due to the limitations of research methods.Therefore, researchers frequently collect only partial information about howindividuals use the internet depending on the method available to them.

In general, there are two ways of collecting internet use data: the server-sidecollection and the client-side collection. It is important to distinguish the characteristicsof these two methods in terms of cost, privacy, research burden, as well as its ability tocollect specific and general internet use measures (Table I). Data collected from theserver-side use web log file, which identifies users’ accesses to files in a certain webserver. This method has several advantages. First of all, it is inexpensive. It onlyrequires a single program which can collect data from the server-side computer andmost of the web servers are equipped with this functionality. In addition, there are

INTR16,5

542

minimal human resources required for the server-side data collection, which makes thismethod very economical. Second, server-side data collection is less invasive andreduces the ethical concerns of privacy. It is because typical web server log cannotidentify individuals. However, it should be noted that web site users can be identifiedby their computer IP addresses, which can pinpoint the locations of users’ computers.Nonetheless, unless users freely give away their private information to on-line surveys,the server-side method is relatively free from privacy problems and it is not feasible fora researcher to link users’ private information and their web log data.

On the other hand, relative safety from invading privacy is one of the criticalshortcomings of this method. Because this method cannot provide individualinformation such as user demographics, studies based solely on web log data can onlyproduce limited research results. Although researchers can ask for voluntaryresponses from participants of the study, research that relies on on-line users’voluntary submission of their personal information can face severe samplingproblems. Furthermore, the validity of the information is highly suspected consideringthe anonymous nature of the internet.

Contrary to the server-side method, client-side data collection can be more accurate inmeasuring personal information. This is because the client-side method requires somecontacts with study participants. The client-side method works by installing amonitoring program in users’ computers and the researchers are then able to collectpersonal information during those contacts. Researchers can retrieve collected data fromthe user computer’s hard drive after a certain period of time. In fact, some client-side datacollection software can constantly send accessed URLs from users’ computers to aremote computer which is readily accessible to researchers. This simultaneous collectionprocess cannot only eliminate time delays of data collection, but it can also reducerequired physical presence of researchers for periodical data collection.

Regardless of the specifics of each method, client-side data collection providesdetailed internet use information. Indeed, the collected data is the whole internet usedata from the specific computer. However, quite contrary to the server side datacollection method, the cost of user recruitment for the client side collection method canbe high. This method does not only need software designed to collect data fromresearch participants’ computers, but it also requires a software installation on eachparticipant’s computer. Costs for software and human resources required for theinstallation process can be high enough to discourage many researchers fromemploying client-side data collection method. Furthermore, there are serious concerns

Client side Server side

Cost High for user recruitment EconomicalPrivacy Invasive

Higher concernsLess invasiveLess ethical concerns

Demographic data High validity Low validityStaff Involvement High LowData coverage Whole web use Only from a specific web serverBurden Secure data storage

Human subject approvalCaching No caching problem Suffer caching problem

Table I.Comparisons between

client-side and server-sidedata

Web log dataanalysis

543

about users’ privacies. Participant’s personal information is easily exposed to aresearcher at the internet user recruitment stage. Once a researcher gathered personalinformation, s/he can link it with the users’ internet use data. Every key stroke will beexposed to researchers including sensitive private information such as bank accountnumbers, credit card numbers, personal health information, social security numbers,and so on. The burden for a researcher to securely lock this type of data in the databasemight be overwhelming and it can hamper the human subject approval process.

Multiple computer access. According to Burton and Walther (2001), client-side datacollection method can overcome file-caching problems because it records users’ activitiesfrom client computers. However, more importantly and beyond this advantage,client-side collection method can produce more inclusive data collection by recording allactivities from the client computers compared with server-side data collection methodwhich can only record accesses to the content located in the specific web server.

For these reasons, the client-side data covers a wider range of the user behavior.However, an important disadvantage of client-side data collection method is that the datacollection program must be installed on every computer accessed by users to accomplishexhaustive navigation recording. If a study participant has a home computer and a workcomputer, the data collection program must be on both computers to truly capture theiruse patterns. Indeed, a complete and perfect data collection requires all computersaccessed by participants to have the client-side log collection program. Obviously, theproblem is that it is not only difficult to gain accesses to all users’ computers, but, also,the installation process can consume significant amount of research resources. As wediscussed, server-side data collection does not have this problem because the data onlycontains accesses to the specific web server. It can record all accesses to the web server,no matter which computer a participant accesses from. All accesses from participants tothe specific web server leave traces in the web server log file.

What to measure: time vs page request (page access)Client-side recording systems can record virtually all mouse and keyboard moves(MacKenzie et al., 2001). Some of them have capabilities of recording point-to-pointmouse moves and some computer programs can record events[5]. The event handler’sreport on some significant browsing behaviors such as focus, mouse down, resize,submit, hyperlink will be a valuable instrument for the media use time measurement(Etgen and Cantor, 1998).

Event recording is especially advantageous when a researcher wants to measuresession time which is equivalent to “episode of use” time. The client-side recordingsystem can record the moment when the user’s web browser is out of focus[6] whichindicates that a user stopped accessing the web site. Also, it can record the moment ofin-focus event or program-open event. Therefore, the client-side data can tell theduration of time during an episode of use by subtracting the in-focus event time fromthe out-of-focus event time. In the same manner, it can even measure the duration of“within episode of use”.

Such a function does not exist on the server-side data collection method. The method”snon-intrusive measurement characteristic cannot tell when a user minimizes or leaves theweb site. The only indicator is the web server’s session connection time function, which isset to expire after a certain minutes of inactivity (typically 20 to 30 minutes). After theconnection time expires, the server considers the user disconnected. This means that

INTR16,5

544

server-side data almost always will indicate that the last page access will have maximumconnection time because a web server does not know when a user has left the web site.

The time measurement for the client-side data collection can be reasonably accuratebecause the out-of-focus function can detect the moment when users’ attention leavesthe web site. Since server-side data does not have this luxury, it is difficult to argue thatthe server-side collection can present the duration of use. Although there are somesoftware which attempt to measure time from the server-side web log data (i.e.Webtrends), the internal validity of the server-side data’s time measurement needs tobe carefully examined (Burton and Walther, 2001).

On the other hand, there are no such validity problems when a researcher employspage requests, or page access, as a unit of analysis. Both collection methods canmeasure page requested by users, although the server-side collection method isrecommended than client-side collection method due to the barriers of installing datacollection software on subjects’ computers.

Research questionsThe previous sections revealed the difference between server-side and client-side logfiles and, also, questioned the validities of two data collection methods. Three researchhypotheses were developed to answer those questions:

RQ1. Will the time unit and the page request unit measures of the two methods becompatible?

H1. An individual user’s number of page requests and amount of time use arehighly correlated between server-side and client-side logs.

Previous discussions indicate that the client-side web log is better at measuring timeand the server-side web log is better at reporting page requests. However, thecompatibility of the two different measurement units is not known. If twomeasurement methods’ comparison resulted in having a great deal of resemblance,it will provide high confidence in these variables. Furthermore, if two data sets arecompatible, we can interchangeably implement page request and duration of use ininternet use research.

RQ2. Will the server-side data collection method accurately represent web use time?

H2. The web use duration of time calculated from the server-side data and the webuse duration of time from the client-side data are highly correlated.

The server-side web log is more available to researchers due to the convenientcollection process. If the time calculation from the server-side can be as valid asclient-side log, researchers can surrogate the web use time variable with the server-sideweb log without sacrificing validity of the measurement.

RQ3. Is the server-side collection method better at recording multiple computer usethan the client-side collection method?

H3. A significant number of access is from computers that do not have client-sidedata recording software.

Web log dataanalysis

545

One of the shortcomings of the client-side collection method is that the users may usemultiple computers and all of them do not necessarily have a tracking programinstalled in them. To make this matter more complicated, the accelerated diffusion ofinternet will allow users to access web site not only from the designated computer butfrom many places such as public library, internet cafe, public university computer,portable device, and so on. We suspect that substantial number of our web site userswill utilize multiple computers to access the internet. The comparison between twodata sets can tell whether users are using multiple computers to access the site.

Research methodResearch participantsParticipants for this research were recruited through the on-going internet healthinformation research project, Comprehensive Health Enhancement Support System(CHESS, http://chess.chsra.wisc.edu). A total of 145 participants voluntarily joined theresearch and they were all breast cancer patients. They were in various stages of breastcancer and the ages ranged between 40 and 70.

When they joined the research, they were provided with a notebook computer with amodem. Also, they were informed that their computer will record their browsingactivities. They were asked to access CHESS web site only with the specially modifiedbrowser provided to them in their notebook computer. When they open the browser,they are asked to enter their codename and password before they can navigate the weband the program resides in the browser send navigated URLs to the researchers’database through the internet. Both server-side and client-side data are collected over aone-year period from 6/1/2001 to 5/31/2002. Because the server-side data contains allweb site use data from the multiple computers while the client-side data only recordsdata from a provided computer, we were able to tell the users’ multiple computer useby comparing two data sets.

It should be noted that the purpose of the research is to compare the two types ofweb log data, rather than to investigate the users’ navigation behavior and generalizethe result. In other words, internal validity of the web log data is a major topic of theresearch. Caching, session recognition, and time calculation are major threats to theinternal validity of the research.

Methodological challengesCaching. One of the most frequently criticized problems of the server-side web loganalysis is the caching problem. It occurs when a local computer uses its own memory,intercepts user’s page request to the web server, and deliver the requested page fromone’s own memory. Caching can prevent web page requests from leaving log records inthe web server log file.

The local gateway servers such as proxy servers can store the requested files intheir own memory, intercept the next request of the same files, and provide them to theindividual computers (Batista and Silva, 2001; Goldberg, 2001; Reips, 2000; Yu et al.,2004). This can be a fatal flaw in server-side web log file (Goldberg, 2001). However,some file types, which require server-side script execution, are not generally cached inbrowsers or proxy servers. File names with the extensions such as “asp”, “cgi”, “jsp”,“php”, and many more, are not designed to be cached in local memory due to theirrequired script executions in web servers. Therefore, those file types are supposed to be

INTR16,5

546

recognized by browsers and proxy servers and shouldn’t be distributed from localmemory unless browsers or proxy servers are mis-configured. However,misconfiguration rarely happens because of frequent browser updates and constantattention required for the network server maintenance. Approximately 79 percent ofthis study’s web pages are in “asp” file format and, therefore, our data is relatively freefrom the caching problem.

Individual user recognition and sessions. The collected IP addresses are one of themost valuable pieces of information that enable individual user recognition. However,it becomes more difficult to use to identify individual users due to the diffusion ofDynamic Host Configuration Protocol (DHCP) and Network Address Translator(NAT). When DHCP or NAT is enabled, IP addresses are randomly assigned tocomputers and a computer is not designated with a fixed IP. Thus, IP address alonecannot identify a unique user (Burton and Walther, 2001; Goldberg, 2001; Jansen andSpink, 2005; Jones et al., 1998; Wobus, 1998).

Unlike generic web log data, this study’s web log data can identify users owing toits secure login system. Users are forced to enter the assigned codename and passwordwhen they enter CHESS web site. The secure login system is devised due to its privatenature of the web site’s medical information. This secure login system does not onlyresolve many problems of identifying sessions and calculating time from the data, butit also identifies a user when a user accesses from multiple computers. As a matter offact, it is one of the most important implementations of this research project.

Time calculation algorithm. The problem of calculating time in the client-side data isrelatively minor because it typically has an event recording function. However, it is stillchallenging to calculate time from the server-side web log. Fortunately, this study’sweb log data identifies a unique user from codenames. Several data mining and datacleaning processes were conducted before we were able to produce a reasonable timemeasurement.

For instance, this study’s client-side data had an unusually high amount of timespent for some web pages (over 60 minutes for a single web page access). However, it isdifficult to see any user reading a web page for such a long period of time. Indeed, itwas difficult to justify what users were doing when the web log data showed anunexplainable high number of minutes. It was highly unlikely that users were readinga single page content about chemotherapy more than 20 minutes.

The only way to resolve this problem was to review the particular content and servicesprovided by the CHESS web site and determine the maximum session time for eachcontent and service. After examining all CHESS service contents, we could be quitecertain that users cannot spend more than 20 minutes for reading a web page, except theBulletin Board Systems (BBS) which typically take more time to read and writemessages[7]. Thus, researchers decided to cut unusual high amount of time spent for theweb page access to 20 minutes. As for the BBS service, users’ BBS contents were typicallyshort and, therefore, could not have taken more than 30 minutes to read and/or write thosecontents. Researchers decided to reduce unusually high BBS use time to 30 minutes.

Server-side data collection required more data cleanup and filtering. After the timefield was calculated by subtracting from the next page access, unusual time length wasreduced following client-side data clean-up algorithm, and the last web page in thesession was reduced to 0 minutes. The last page access in the session was ignoredbecause web server cannot capture the user’s exit points (Jansen et al., 2005).

Web log dataanalysis

547

ResultsThe simple correlation analysis indicated that individual user’s page requests, or pageaccess, and time measurements as obtained from server-side and client-side log fileswere highly correlated. However, the result was short of our expectations. Consideringthe accuracy of the measurement methods we implemented, the correlation shouldhave been higher although we did not expect an absolutely perfect match between twomeasurements (failed to support H1).

Specifically, the page access variable was created based on each user’s number ofpage access for a day and the access time variable was created by summing theamount of time spent for web site on a specific day. The result indicated that thecorrelation between client-side’s page access and time were high, but it was short ofexpectation (r ¼ 0:86* ; Table II). After careful examinations of the data, we were ableto attribute the residuals of the correlation to the uneven distribution of access time fordifference services provided by the web site. Users spent longer time for some servicesthan others[8]. In other words, when users spent their time on services that requiredmore time to explore, daily web page access times showed a high number of minutesalthough they only accessed a small number of pages.

The correlation between server-side page access and time was even less than that ofthe client-side’s ðr ¼ 0:82* ;Table II). It is suspected that the caching problem createdmore discrepancy on the top of the uneven distribution of access time for differentservices. Further examination of the caching problem confirmed our speculation.

For instance, client-side log reported an average of 28.5 seconds ðs:d: ¼ 17:87Þ perweb page access while the server-side log showed 61 seconds ðs:d: ¼ 46:02Þ per pageaccess. The difference of two measurements was unexpected and the investigation of thisproblem led the researchers to suspect the “html” file caching problems, although only 21percent of the files in the web server were in “html” format. However, it was likely thatthey could have been potentially retrieved from the local cache memory when they wererepeatedly accessed by users and it could have made an impact on the result of theanalysis. The result of the total “html” file accesses recorded by the client-side andserver-side comparison confirmed this speculation. There were 49 percent[9] less numberof page access files recorded on server-side. It was clear that “html” files caching problemcaused the discrepancy between server-side page access and server-side time.

This client-side and server-side file access discrepancy is related to the thirdresearch question. Through a number of session calculations, we found that 42 percent(61/145) of users had accessed the web site from multiple computers. Furthermore,although most of the subjects mainly used the provided computer, 10 percent (14/145)users spent more than half of their access time in front of the computers which did not

Correlation

Page access vs time (client-side) 0.86 *

Page access vs time (server-side) 0.82 *

Server-side session time vs client-side session time 0.94 *

Notes: Server-side data were screened and some server-side user records were excluded from sessiontime calculation when server-side session record does not have a matching open session in theclient-side record on a specific day; * ,0.01 significant

Table II.Correlation betweenmeasurements

INTR16,5

548

have tracking program. This result confirmed the H3. A substantial number of usersdid use multiple computers to access the web site and it can be potentially a criticalproblem for a client-side recording system.

On the other hand, the server-side time is calculated by subtracting the current pagetime from the next page time. Caching problem did not impact on this calculation methodbecause some missing page access time was included in other page access time, andwhole session calculation time was not influenced by those cached page requests.

This analysis relates to the second research question. The server-side measuredsession time and client-side measured session time resulted in a very high correlation(r ¼ 0.94 *, Table II). The high correlation exceeded our expectation considering all therequired assumptions and difficulties of server-side time measurement. Nonetheless,the result indicated that server-side time measurement can be implemented withcareful preparations and calculations. In addition, since the server-side session timecalculation is compatible with the highly valid client-side session time measurement,we can be more certain that the overestimation of the time spent per page access wasdue to the “html” file caching problem and it accounts for the lower correlation betweenserver-side page access and server-side time in Table II.

DiscussionThe unexpected underperformance of the server-side page access calculation wasdisappointing. One-fifth of web site files in “html” format substantially reduced thenumber of recorded server-side page requests. It is especially painful to accept themagnitude of caching problem considering a relatively small number of “html” files inthe web site. One can only imagine the validity of the web log file when the entire website content files are consisted of cache-able “html” format. In essence, one thing thatthis analysis shows is that critics on usefulness of the web log file are correct in that theserver-side log cannot accurately represent the web site page access.

The result of low server-side data performance provides a recommendation forfuture studies. The scientific analysis of server-side log data requires the web site to beproduced with server-side script language. If the entire web site is not cache-able,server-side data may accurately measure a number of files accessed by users.

On the other hand, server-side session time calculation was remarkably reliable andvalid based on the high correlation with the client-side time calculation. The analysisresult revealed that the server-side log file time measurement seems more promisingthan the researchers previous speculated. However, it should be noted that the accuratetime measure is limited to the session time units. Individual page access time measurescannot be accurate if the caching problem persists.

While providing some good initial indications, this study needs to be replicatedbefore the results can be generalized for generic web log data analysis. One weaknessis the fact that we used a secure web site, which enabled us to identify individuals, andour caching problem was in a tolerable level. Typically, generic web server logs do nothave those features. Despite this shortcoming, our web design implementations andweb log data analysis scheme can serve as a basis for future web log analysis research.The ability to identify each individual user and low caching problems were strongadvantages to this approach.

Unfortunately, the client-side log file faced a serious challenge due to the users’multi-computer utilization and made the method very incomplete. Particularly, it

Web log dataanalysis

549

missed a chunk of multiple computer use data that was covered by the server-side datacollection method. Considering the increasing ownership of multiple computers, thediffusion of computers at work places, and the development of wireless mobiletechnologies, this problem will only grow in the future. Therefore, it would be wise tocontemplate and resolve methodological challenges of the client-side data collectionmethod before researchers invest resources to collect data from the client-side. In theend, with the proper implementation of the web design and web log analysis algorithm,server-side collection method could provide reliable and valid result exceeding ourprevious expectations.

Analyzing web log data reminded us that the web was not developed forresearchers, although it is a great place to research user activity. The research onmedium use should follow the medium development as other media research has beenperformed. It is one of the limitations of research on this topic.

Notes

1. A sheer amount of data represents bits and bytes of information exchanged amongcomputers. It can overestimate the content when the data contains multi-media informationwhich requires a larger storage space compared with the text based information. Therefore,it is a less valuable unit of analysis for social science scholars.

2. The locations of computers can be recognized by IP addresses. They are designated to eachcomputer when users’ computers are connected to the internet.

3. Authors of this paper decided to exclude the intranet. It is because Intranet typicallymaintains a closed system where general public cannot access information, although itprovides network environments.

4. Proxy servers sometimes save internet content files on their hard drives and transmit files toindividual computers. This process is designed to conserve redundant web traffics.

5. Events represent activities of running software such as save, delete, paste, font size change,and so on.

6. It means that the browser is not on the top of opened windows.

7. We have disabled the session time-out function in our web server. Many users complainedthat they’ve lost what they were writing on the discussion board due to session time-outs.However, a conventional wisdom is to limit session time to 20 to 30 minutes to save webserver’s computational resources.

8. Services such as Video Gallery and Journaling were frequently accessed by users and theusers stayed more than two minutes on average. On the other hand, Diagnosis service,Message service, Basic Information service, and many other web pages were accessed lessthan 30 seconds on average.

9. Server-side data were screened and some server-side user records were excluded from thiscalculation when server-side session record does not have a matching “open session” in theclient-side record on a specific day.

References

Babbie, E. (2004), The Practice of Social Research, Wadsworth Publishing, Belmont, CA.

Barber, A.S. and Riccalton, C. (2000), “The use of the LS/2000 online public access catalogue atNewcastle University Library” (Grant No. SI/G/816): British Library Research andDevelopment Department Report.

INTR16,5

550

Batista, P. and Silva, M.J. (2001), “Web access mining from an on-line newspaper logs”, paperpresented at the 12th International Meeting of the Euro Working Group on DecisionSupport Systems (EWG-DSS 2001), Cascais, Portugal.

Bertot, J.C. and McClure, C.R. (1997), “Web usage statistics: measurement issues and analyticaltechniques”, Government Information Quarterly, Vol. 14 No. 4, pp. 373-96.

Booske, B.C. and Sainfort, F. (1998), “Relation between quantitative and qualitative measures ofinformation use”, International Journal of Human-Computer Interaction, Vol. 10 No. 1,pp. 1-21.

Burton, M.C. and Walther, J.B. (2001), “The value of web log data in use-based design andtesting”, Journal of Computer-Mediated Communication, Vol. 6 No. 3.

Catledge, L.D. and Pitkow, J.E. (1995), “Characterizing browsing strategies in the Worldwideweb”, paper presented at the 3rd International World Wide Web Conference, Darmstadt,Germany.

Danaher, P.J. (1995), “What happens to television ratings during commercial breaks?”, Journal ofAdvertising Research, Vol. 35 No. 1, pp. 37-47.

Davis, M.P. (2004), “Information-seeking behavior of chemists: a transaction log analysis ofreferral URLs”, Journal of the American Society for Information Science and Technology,Vol. 55 No. 3, pp. 326-32.

Etgen, M. and Cantor, J. (2002), (1998), What Does Getting WET (Web Event-logging Tool) Meanfor Web Usability?, available at: www.itl.nist.gov/iaui/vvrg/hfweb/proceedings/etgen-cantor/index.html (accessed 22 October 2002).

Eveland, W.P. Jr and Dunwoody, S. (1998a), “Surfing the web for science: early data on the usersand uses of The Why Files”, NISE Brief, Vol. 2 No. 2, pp. 1-10.

Eveland, W.P. Jr and Dunwoody, S. (1998b), “Users and navigation patterns of a scienceWorldwide web site for the public”, Public Understanding of Science, Vol. 7, pp. 285-311.

Goldberg, J. (2001), “Why web usage statistics are (worse than) meaningless”, available at: www.goldmark.org/netrants/webstats/ (accessed 15 October).

Handschuh, S. and Staab, S. (2002), “Authoring and annotation of web pages in CREAM”.Proceedings of the 11th International Conference on World Wide Web, Honolulu, Hawaii.

Hawkins, R.P. and Pingree, S. (1997), “Measuring time frames of communication behaviors incomputer use”, paper presented at the International Communication Association,Montreal, Canada.

Jansen, B. and Resnick, M. (2005), “Examining searcher perception of and interactions withsponsored results”, paper presented at the ACM Conference on Electronic Commerce,Vancouver, Canada.

Jansen, B. and Spink, A. (2005), “An analysis of web searching by European AlltheWeb.comusers”, Information Processing and Management, Vol. 41, pp. 361-81.

Jansen, B., Jansen, K. and Spink, A. (2005), “Using the web to look for work: implications foronline job seeking and recruiting”, Internet Research, Vol. 15 No. 1, pp. 49-66.

Jansen, B.J., Spink, A. and Pedersen, J. (2005), “A temporal comparison of AltaVista Websearching”, Journal of the American Society for Information Science and Technology,Vol. 56 No. 6, pp. 559-70.

Jones, S., Cunningham, S.J. and McNab, R. (1998), “Usage analysis of a digital library”, paperpresented at the Digital Libraries, 98, 3rd ACM Conference on Digital Libraries,Pittsburgh, PA.

Jones, S., Cunningham, S.J., McNab, R. and Boddie, S. (2000), “A transation log analysis of adigital library”, International Journal of Digital Libraries, Vol. 3, pp. 152-69.

Web log dataanalysis

551

Kanerva, A., Keeker, K., Risden, K., Schuh, E. and Czerwinski, M. (2004), Web usability research atMicrosoft Corporation, Microsoft Corporation, available at: http://research.microsoft.com/users/marycz/webchapter.html (accessed 12 January 2004).

Kurth, M. (1993), “The limits and limitations of transaction log analysis”, Library Hi Tech, Vol. 42,pp. 98-104.

Larson, R.R. (1991), “Between Scylla and Charybdis: subject searching in the online catalog”,Advances in Librarianship, Vol. 15, pp. 175-236.

Mackenzie, I.S., Kauppinen, T. and Silfverberg, M. (2001), “Accuracy measures for evaluatingcomputer printing devices”, The CHI2001, Seattle, WA, March 31-April 5, pp. 9-16.

McTavish, F., Pingree, S., Hawkins, R. and Gustafson, D. (2003), “Cultural differences in use of anelectronic discussion group”, Journal of Health Psychology, Vol. 8 No. 1, pp. 105-17.

Nielsen, B. (1986), “What they say they do and what they do: assessing online catalog useinstruction through transaction monitoring”, Information Technology and Libraries, Vol. 5,pp. 28-34.

Peters, T. (1993), “The history and development of transaction log analysis”, Library Hi Tech,Vol. 42, pp. 41-66.

Peters, T.A., Kurth, M., Flaherty, P., Sandore, B. and Kaske, N.K. (1993), “An introduction to thespecial section on transaction log analysis”, Library Hi Tech, Vol. 42, pp. 38-9.

Phippen, A. (2004), “An evaluation methodology for virtual communities using web analytics”,Proceedings of the International Networks Conference, Plymouth.

Phippen, A., Sheppard, L. and Furnell, S. (2004), “A practical evaluation of web analysis”,Internet Research, Vol. 14 No. 4, pp. 284-93.

Reips, U.-D. (2000), “The web experiment method: advantages, disadvantages, and solutions”,in Birnbaum, M.H. (Ed.), Psychological Experiments on the Internet, Academic Press,San Diego, CA, pp. 89-117.

Sandore, B. (1993), “Applying the results of transaction log analysis”, Library Hi Tech, Vol. 42,pp. 87-97.

Smaglik, P., Hawkins, R.P., Pingree, S., Gustafson, D.H., Boberg, E. and Bricker, E. (1998),“The quality of interactive computer use among HIV-infected individuals”, Journal ofHealth Communication, Vol. 3, pp. 53-68.

Taha, A. (2004), “Wired research: transaction log analysis of e-journal databases to assess theresearch activities and trends in UAE university”, paper presented at the NordicConference on Information and Documentation, Aalborg, Denmark.

Wobus, J. (2004), DHCP FAQ, available at: www.dhcp-handbook.com/dhcp_faq.html (accessed15 February 2004).

Yu, H.-F., Chen, Y.-M. and Tseng, L.-M. (2004), “Archive knowledge discovery by proxy cache”,Internet Research, Vol. 14 No. 1, pp. 34-47.

Corresponding authorGi Woong Yun can be contacted at: [email protected]

INTR16,5

552

To purchase reprints of this article please e-mail: [email protected] visit our web site for further details: www.emeraldinsight.com/reprints