Data Preparation For Mining Web Browsing Patterns

Data Preparation For Mining Web Browsing Patterns 1

Abstract— Data preparation for miningweb browsing patterns poses researchersand academicians with few key questionsin terms of data quality measurementthat is qualifying a data, thepreprocessing of the data, and thenclusterization of data based on theirhomogeneity or heterogeneity and thenafter clusterization the challenge is inanalyzing for which we need algorithm.In preparing this report we havediscussed various mining algorithm andcompared them and used more graphicinterpretation for our findings.

I. INTRODUCTIONS the World Wide Web is a greatresource pool containing lots of

information, links and user details, itprovides great potentiality in miningthe web browsing patterns usingdifferent data preparation techniques.In this paper we are to look in detailsthe various data preparation for miningweb browsing patterns.

A

II. WEB MININGThe web mining primarily refers to theweb usage mining to get to know aboutthe behavioral patterns of users and itinvolves data collection, preprocessing,pattern discovery and analysis. For datacollection we account users log detailsclient side and server side, proxyservers. For preprocessing the subjectsinclude user, transaction and sessionidentification, data cleaning andreadiness and completeness, pathcompletion. The pattern discoveryinvolves statistical analysis,clustering and matching and otherstatistically interpretable measurementtechniques while the pattern analysis

involves knowledge query mechanismsmostly on SQL or OLAP. (Lalithadevi,Ida, & Breen, 2013)

A. Data Collection

In data collection the information weprocess to gather or collect involvesserver and client side data, proxyservers. Web logs stored in serverfollowing the most common log format1andthe extendable ones also with thereferral links, referral URLs, web cacheand cookies, explicit user input willcover the sources. At the same time itwill also include Java applets, remoteagents, favorite and copy of web usagestatus, HTTP requests and server trafficload. (Lorenzo & Domenechl, 2007)

B. Data PreprocessingThe data preprocessing details withdeveloping the heterogenosity of theunstructured data processed to not togovern the patterns identification butto help the pattern identification.(Thiyagarajan & Venkatchalapathy, 2013)In this regard, the preprocessing getswith data cleaning, for instance,checking the status codes of server logentries, numerically applying a filterto them, connection protocol usage andpage content. (Jain & Purohit, 2011) Italso looks in user detailsidentification covering his log onextendable format with agentinformation, caching, IP and browserenforcement of the HTML pages. Another

1ipaddress-username-password-date-timestamp-url-version-status-code-bytessent

Data Preparation For Mining Web Browsing Patterns Faisal Farrukh


important aspect in the datapreprocessing goes with the sessiondetails identification detailing theclick stream, session duration,multiplicity of sessions, sessionization– on the basis of time orientation andnavigation, session construction – againon time and navigation. The pathcompletion details looks in referencelength approach – the time one spends ona page to correlate the secondary pageshe visits or visited; maximal forwardreference – transactional forward moveswithout correlating a backward page orindex; time window – duration of visitfor pages, calls, transactions. (Vajk &Ivancsy, 2006)

C. Pattern Discovery and AnalysisThe pattern discovery and analysis isthe next step after data preprocessingand here the importance is on definingthe approached for discovery andanalysis. Mostly the conduction of theanalysis contains approaches likeclustering, classification, associationrules and sequential patterning. It isdifferent from mapping a data.Algorithms, classifiers, decision tree,vector support all get related here.These aspects can be detailed littlemore here as classification is thesupervised assigning of data classes;the association rules apply ontransaction item’s set, Apriorialgorithm looks for frequent accessbetween the transactions of users toitems, clustering focuses on groupexhibition; sequential patterning goeswith intersession patterns – in timedefined way – it will not fit withnavigational data that are moreobservatory at the beginning of theprocess. This way, several discovery andanalysis aspects can be applied to minebrowsing patterns. (Levene & Broges)(Tomai & Mican)

III. LITERATURE REVIEWThe contemporary researches leave usdifferent situations and understandingswith practical aspect. In one literaturethe search engine’s status caching isrecommended2 from where weblogs would begenerated without much effort and userswill be able to make analysis based onthis. While indiscernibility approach isadvocated by many saying, rough settheory can help information extractionfrom the extended web logs involving thekeywords, visit origins and websitedesign with SEO. (Mele, 2013)

The User and Session Identificationalgorithm is also considered stronglyfor the pattern discovery analysis anddata preparation focusing on detailingthe identification with IP details, Userdetails, session and log requests. Butthe problem with the algorithm is itwill report same IP for once and willreport the sessions based on the useractivities and time spent. (Jose & Lal,2013)

Contrary to the algorithm one, customertransaction grouping with clustering isfavored as the clusterization isintended with the establishinghomogeneity between the data and usersare identifiable with their uniquebehavior. While the approach is morepersonalized with the risk of being toomuch personification, other researchesfavor the clusterization as an approachfor grouping only. (Huiying & Wei.An,2004)

More research on clusterization callsfor different clusterization on a matrix

2 an approach for improving search engine performance through static caching of search results, and helping users to find interesting web pages by recommending news articles and blog posts. A query covering approach was used to searchthe web pages from cache and web logs and searching time, recall and precision was calculated on behalf of that.


scale, for instance, wed clusterizationand user’s clusterization, theirbehavior and browsing pattern. Suchfinding later pushed the agenda towardshaving a divisive hierarchical algorithmfor clustering itself which can later beruled with mining principles. (Yang &Padmanabhan, 2005)

Another approach to the same called forthe rough set theory where unwanted orabundance of non-useful data can becatered with quick reduct algorithm andvariable precision. This will help toidentify important qualified data fromthe logs and process them for finerselections of features. While tested theresult indicates that the rough settheory can be applied on same data set,same cluster only; it will not help incross cluster mining for patternderivation. (Jiao, Zhang, & Dong, 2006)

Researchers have also focused thealgorithms used by NASA in theirresearch where logs get filtered to helpthe processing and it does not follow aspecific algorithm set but customize onthe preprocessing. (H., K., & A, 2007)

Our review of contemporary literature inweb data mining for users browsingpattern also includes the study findingsfrom the sequential association rulealgorithm where different sequence withapplication of terminal constraints leadto better predictability. (Yang & Li,2005)

IV. DISCUSSION AND ANALYSISAs from the previous details and our review findings, we can see that the clusterization and analysis algorithm are receiving much attention with regardto the discovery and analysis. Now it isimportant to find, analyze and discuss both the aspects in deep level. At

first, we need to look at below, the proposition for the entire process:

(Cooley, Srivastava, & Mobasher, 1999)

While the sole roles for web content stands at:

PageElement

CharacteristicsPhysical Behavioral

Header Compliance withsearch engines

Page impression

Navigation

Use of menu, links Average engagement

Search On site in and outbound links

Low engagement,specific forward

referencingCustomiza

tionUser friendliness low

Content Text to graphicratio, lesscontent

Better reference

This can be inferred with the user transactions the content make on a site,referencing below:


(Cooley, Srivastava, & Mobasher, 1999)Now this takes us back to the larger diagram we used at the beginning of thissection and we get below:

(Cooley, Srivastava, & Mobasher, 1999)This is what we get as for the mining ofbrowsing pattern but not without muchglitches. The issue of user, sessionidentification still remains the samedebated as it did in the reviews wereviewed. So, we better look at theinternal aspects, the algorithms indetails.A general perspective search algorithmfollows like below:

(Langhnoja, Barot, & Mehta, 2013)An interpretation here is essential,which entails, users input their datamost of the time as keyword, if there isno keyword the search automaticallyhalts while the search continues withqualified keyword in the input box andestablishes the connection with theindexed database for domains and thedatabase connection can fail to returnany result if the indexing does notcontain any information relevant to thesearching keyword. The next step beforeretrieving the results for the keywordwould involve looking at the record set,that is the database itself would beworking as a reference as a record setand now the process is coming to ourpapers work scope in a defined manner.At this level it will fetch the relevantrecords and incorporate them into thedataset item and will move those recordsto the search server repository. Now theretrieved data would be matched for webclient’s / browser’s compatibility, iffound compatible the algorithm willprint the result and users will be ableto view them. Thus the process will endup with success for the session. This isvery simple mechanism which pose certainlimitation for clustering in our purposeas for users the algorithm is highlylikely to generate results that aremostly visited or indexed for relevancy


because web logs will contain theinformation of how many times an userhas visited the page. At this level this would be important tounderstand how the user, session and weblog identification algorithm work – incontrast to the general searchalgorithm. The session identificationalgorithm contains:“Algorithm Name: Session Identification fromWeb Log FileInput: Web Server Log fileOutput: Number of SessionSteps:SessionSet = {}UserSet = {}K = 0While not EOF(LogFile) DOLogRecord = Read(LogFile)If(LogRecord.TimeTaken > 30 ) OR(LogRecord.UserId not in UserSet) thenk = k+1SK = LogRecord.URLSessionSet = SessionSet U {SK}Write(SessionFile , SessionSet)End IfEnd While.Algorithm Name: User Identification.Input: Processed Web Log File.Output: Number of Distinct User.Step1: Read records from Web Log File.Step2: User’s IP addresses of two consecutiveentries are compared.Step3: If (IP address is same) thencheck user’s browser and operatingsystemif both are same thenconsider same user.elseconsider new user.end ifend ifStep 4: Repeat above 2 steps until EOF (WebLog File).Algorithm Name: Data Cleansing of Web Log

FileInput: Web Server Log fileOutput: Log DatabaseStep1: Read Log Record from Web Log FileStep2: If(Log Record .urlcontains(gif.jpeg,jpg,css)) AND(different error like HTTP 404or more) found thenRemove from web log file.End of If condition.Step3: Repeat the above two steps until EOF(Web Log File)Step4: Stop the process.” (Langhnoja,Barot, & Mehta, 2013) (Raheja &Katiyar, 2014) Clustering calls for divisioning theresources on the log based on thepopularity rank based on single usersbehavior, the clustering wants it to beweighed based on multiple usersindication. So a modification isnecessary which can assessed on twoscales the relevancy and clusterization.For relevancy the algorithm can entailto have, assuming ‘w’ web page number:

For r=1,r<=w,r++; visit[r]=0 IF(nth web page); then

visit[n]=visit[n]+1; For r=1;r<=w;r++; Sort (visit[r]) For r=1;r<=w;r++; Rank (r) = r o=n/g For i=1; i<=g;i++; for (P=(o*(i-

1))+1;j<=p*i;P++)(Cluster[i], rank[P] - cluster[i] is of o pages per rank)

IF (nth page is visited by theuser) goto step II

Such modification can cater small numberof users or individual one, it cannot beset for the mass. So, this comes as atechnical limit. Our focus is now on the


APRIORI3 and APRIORIALL4 algorithm, sinceboth uses users data at optimum levelwhere the prior’s impetus is more on weblog mining and the later’s impetus is onweb usage mining.

(Rajagopalan & Shanthi, 2013)

In this context Shanthi et al. puts upthe algorithm as“Input: U= { U1,U2,…,Ui } // The set of usersD={t1,t2,…,tk} //Database of sessions with UserIDS //Support3 “Apriori is a classic algorithm for frequent itemset mining and association rule learning overtransactional databases. It proceeds by identifyingthe frequent individual items in the database andextending them to larger and larger item sets aslong as those item sets appear sufficiently oftenin the database. The frequent item sets determinedby Apriori can be used to determine associationrules which highlight general trends in thedatabase. This has applications in domains such asmarket basket analysis.” (Gao, 2010)4 “The algorithm happens to be a modification ofApriori Algorithm. The modification allows to putthe data in correct order by using User-ID andtime-stamp sort. The major difference betweenApriori and Apriori-All is that Apriori-All makesuse of full join for candidate sets. In case ofApriori, it is only forth joined. Thus, Apriori-Allis more appropriate for web usage mining ratherthan Apriori. Apriori is found suitable for web logmining. The sorting of candidate sets identifiesthe sequential patterns that are complete referencesequence for a user across various transactions. Itis iterative in the senses that first scan findslarge 1-itemset. Initially, a frequent 1-itemset isthe same as a frequent 1-sequence. The subsequentscan divulges more candidate sets from this largeritem sets of the previous scan and it may becounted for reference. The counting indicatessupport.” (Dunham, 2003)

Output: sequential patterns CkD`=sort D` on UserID and time of first page reference inEach session;L1 with UserID={large 1-itemsets};For (k=2; Lk-1! =null; k++) doBeginCk=Apriori-gen(Lk-1,U);//new candidate setFor all transaction ti _ D` doBeginCi=subset(Ck, ti);For all candidate c _ Ci doc.count++;EndLk={c _ Ck, c.count>S};//S:supportEndFind maximal reference sequences from L;Procedure Apriori-gen(Lk-1,S,U)Ck=null;For each itemset Li _ Lk-1For each itemset Lj _ Lk-1BeginIf Li and Lj has same UbeginC=Li join Lj;If has infrequent-subset(c,Lk-1) Delete c; Else Add c to Ck;EndEndReturn Ck;Procedure has infrequent-subset(c,Lk-1)For each (k-1) subset s of cIf s Lk-1 then returns False; ElseTrue.” (Rajagopalan & Shanthi, 2013)The same is later further enhanced andmodified in consideration to bring moreaccuracy, relevancy and precision. Themodified one follows like:“Input: U= {U1, U2… Ui} // the set of usersD= {t1, t2…, tk} //Database of sessions with UserIDS //SupportOutput: sequential patterns CkD`=sort D` on UserID and time of first page reference in each session;L1 with UserID= {large 1-itemsets};For (k=2; Lk-1! =null; k++) do


BeginCk=Apriori-gen (Lk-1,U);//new candidate setFor all transaction ti _ D` doBeginCi=subset(Ck, ti);For all candidate c _ Ci doc.count++;EndLk={c _ Ck, c.count>S};//S:supportEndFind maximal reference sequences from L;Procedure Apriori-gen(Lk-1,S,U)Ck=null;For each itemset Li _ Lk-1For each itemset Lj _ Lk-1BeginIf Li and Lj has same UbeginC=Li join Lj;If has infrequent-subset(c, Lk-1) Deletec; Else Add c to Ck;EndEndReturn Ck;Procedure has infrequent-subset(c, Lk-1)For each (k-1) subset s of cIf s Lk-1 then return False;Else True”. (Wang & He, 2005)

V. LIMITATIONSFor now we are having a workablesolution to the situation but thequestion is whether it is sufficient ornot. With reference to the referencelength aspect where users spends on acorrelated page that can be auxiliary innature, such is not forward linking us.

Similarly we are unable to get a goodresult with regard to maximal forwardreference, where the concentration is onthe number of pages in a user sessionfrom a backward page. (Yu, Park, &Chen, 1996)

Then there is another problem we canidentify which is about time stamp or

window where user sessions getconsidered for their time interval. Theabove approach is not giving us any clueof it. (Cooley, Srivastava, & Mobasher,1999)

The above discussion and analysis isbased on the association rule mining5

whereby the analysis concludes with somelimitations in our findings like

It is making one to produce too many rules

The analysis does not guarantee relevancy

Parameters are set on minimum confidence with minimum support

Higher chances for false discovery Increasing possibilities of wrong

prediction

VI. RECOMMENDATION We need a solution on the algorithm

that will have the innate capability torecognize a cluster from its even largerspatial data set and will examine thedatabase’s density for the elements andusable input or output parameters. Atthe same time it must perform to suggestusers on the selection of parameters.The DBSCAN clustering algorithm can helpwith such preciseness and it is alsoknown to have worked as classifierbetween qualified data and distortedones. DBSCAN keeps the focus on density, its

distribution using the nodes, worksfaster and can be scaled up. DBSCANclustering algorithms’ other majoradvantages include:

It can work with any cluster inthe priori data

It works in reverse of the k-means

It can spot out arbitrary5 ARM was perceived to group objects with similarity behavior enforcing less input data but to infer more relevant results.


clusters Works with single link effect Have MinPts parameter as work

style Has distortion emulator Support least number of

parameters

At first an arbitrary point will beassumed for visit that remainsunattended and from there theneighborhood will be formed and thus wewill get both the EPS ad the MinPtswhere EPS defines the start up forclusterization and MinPts refers theminimum members to form that cluster. Inthe algorithm noise or distortion standsfor those cases when the MinPts do nothave substantial size though later itcan be discovered as part of anothercluster. The unique aspect of the DBSCANis it continues the same way by checkingon the intra and inter clustersimilarities and density and retrievesthe neighborhood and considers havingdifferent EPS. So what does it all mean?The base line for forming cluster getsredefined – a single cluster can havemultiple EPS and the changing startpoint for clusterization will not onlyimprove our pattern browsing from thewmined data but will also ensure moreflexibility and control. You can put thewhole idea with the array concept whereyou apply one get your results and thenyou move for another. Here the databaseis assessed with varying capabilitiesand capacities with regard to ourpattern analysis. To put ourrecommendation in a summary form we mustgo with it.

Web logs collection andpreprocessing and transfer todatabase.

We can apply any algorithm ofour choice based on our purposesof pattern discovery however, we

can combine them all as well. Lastly the patterns will appear

from the data we analysed. The logical sequence will appear like

below:

(Langhnoja, Barot, & Mehta, 2013)

May be putting the whole in graphic willmake one clear about it, so at first youwill be cleaning your data log files:

(Langhnoja, Barot, & Mehta, 2013)Then the process for session and useridentification goes:



After data cleaning is done it is timefor clusterization, it follows theprocess we have described so far:


And

(Langhnoja, Barot, & Mehta, 2013)The last step is the algorithmic patternmining


Now we get closer to our pattern miningfrom the logs identifying the users andtheir sessions.

VII. CONCLUSIONOur findings from above analysis and

discussions conclude that browsingpattern mining is growing to be big areain computing and if data cleansing andpreprocessing can be ensured then it isvery much possible to apply a greateralgorithm to reach our researchobjectives. For instance here in thisreport, which is actually prepared tofulfill the partial course requirementof the course XXX, we detailed on everyaspect, involved DBSCAN as a strongconsideration.

ACKNOWLEDGMENTIn preparing this paper I would like to

thankfully acknowledge my deepestgratitude to my professor XXXX forgiving me the wonderful opportunity toget deep inside of modern computinginvolving web in order to get accountedwith the practical perspectives inassessing browsing patter mined fromuser generated data. I must admit thatthe paper preparation has greatlyenriched my knowledge and have made memore interested in my course.


VIII.WORKS CITEDCooley, R., Srivastava, J., & Mobasher, B. (1999). Data Preparation for Mining World Wide Web Browsing Patterns. Minneapolis: University of Minnesota.Dunham, M. H. (2003). Data Mining - Introductory and Advanced Topics. Beijing: Tshinghua University Press.Gao, W.-H. (2010). Client Behavior Pattern Recognition System Based on Web Log Mining. 9th International Conference on Machine Learning and Cybernatics (pp. 466 - 480). Qingdao: IEEE.H., H. I., K., T., & A, P. (2007). Rought Set based Feature Selection for Web Usage Mining. International Conference on Computational Intelligence and Multimedia Applications. Huiying, Z., & Wei.An, L. (2004). Intelligent Algorithm of Data Preprocessing in Web Usage Mining. 5th World Congress on Intelligent Control and Automation(pp. 15-19). Hangzhou: Intelligent Control and Automation.Jain, R., & Purohit, D. G. (2011). Page Ranking Algorithms for Web Mining . International Journal of Computer Applicationsq , 22 -25.Jiao, L., Zhang, H., & Dong, Y. (2006). Research on Application of User Navigation Pattern Mining Recommendation. 6th World Congress on Intelligent Control and Automation. Dalian: Intelligent Control and Automation.Jose, J., & Lal, P. S. (2013). Extracting Extended Web Logs to Identifythe Origin of Visits and Search Keywords. Intelligent Informatics Advanced in Intelligent Systems and Computing , 435 - 441.Lalithadevi, B., Ida, A. M., & Breen, W.A. (2013). A new approach for improving world wide web techniques in data mining. Interntational journal of advanced researchin computer science and software engineering , 243- 251.Langhnoja, S. G., Barot, M. P., & Mehta,D. B. (2013). Web Usage Mining Using Association Rule Mining on Clustered Data for Pattern Discovery. International

Journal of Data Mining Techniques and Applications ,141 - 150.Levene, M., & Broges, J. (n.d.). Data Mining of User Navigation Patterns. Retrieved October 11, 2014, from www.dcs.bbk.ac.uk: http://www.dcs.bbk.ac.uk/~mark/download/web_mining.pdfLorenzo, J., & Domenechl, J. M. (2007). A Tool for Web Usage Mining. Intelligent DataEngineering and Automated Learning - IDEAL , 695 - 704.Mele, I. (2013). Web usage Mining for Enhancing Search Result Delivery and Helping Users to Find Interesting Web Content. ACM WSDM , 765 - 769.Raheja, N., & Katiyar, V. K. (2014). Efficient web data extraction using clustering approach in web usage. International Journal of Computer Sciences , 216 -225.Rajagopalan, D. S., & Shanthi, R. (2013). An Efficient Web Mining Algorithm to Mine Web Log Information. International Journal of Innovative Research in Computer and Communication Engineering , 1490-1500.Thiyagarajan, V. S., & Venkatchalapathy,D. K. (2013). Web data mining - a research area in web usage mining. Journalof Computer Engineering , 22 - 26.Tomai, N., & Mican, D. (n.d.). Association rules based recommender system for personalization in adaptive web based applications. Retrieved October 11, 2014, from Babes-Bolyai University: http://gplsi.dlsi.ua.es/congresos/qwe10/fitxers/QWE10_Mican.pdfVajk, I., & Ivancsy, R. (2006). FrequestPatterns Mining in Web Log Data. Acta Polytechnica Hungarica , 77 - 90.Wang, T., & He, P.-l. (2005). Web Log Mining by an Improved AprioriAll Algorithm . World Academy of Science, Engineering and Technology , 97 - 100.Yang, Y., & Padmanabhan, B. (2005). A Hierarchical Pattern Based Clustering Algorithm for Grouping Web Transactions.


IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 9.Yang, Z. Z., & Li, W. Y. (2005). Mining Sequential Association Rule for Improving Web Document Prediction. Sixth International Conference on Computational Intelligence and Multimedia Applications. ICCIMA.Yu, P. S., Park, J. S., & Chen, M. S. (1996). Data mining for path traversal patterns in a Web environment. International Conference on Distributed Computing Systems (pp. 385 - 392). ICDCS.

Data Preparation For Mining Web Browsing Patterns

Documents

Transcript of Data Preparation For Mining Web Browsing Patterns