Travel mode detection method based on big smartphone ...

10
Special Issue Article Advances in Mechanical Engineering 2017, Vol. 9(6) 1–10 Ó The Author(s) 2017 DOI: 10.1177/1687814017708134 journals.sagepub.com/home/ade Travel mode detection method based on big smartphone global positioning system tracking data Chaoran Zhou 1 , Hongfei Jia 1 , Jingxin Gao 2 , Lili Yang 1 , Yixiong Feng 3 and Guangdong Tian 1 Abstract This article proposes a machine learning–based travel mode detection method using urban residents’ travel routes as the data source, collected via smartphone global positioning system modules. A data-driven machine learning strategy was chosen in the model construction. This study performed data cleaning and mining on over 4400 pieces of urban resident travel records containing several millions of global positioning system tracking points. Series of characteristic values of speed, travel distance, and direction are calculated, which reflect the travel mode of smartphone holders. In travel mode identification, first, the transition regions of travel segments of different travel modes are effectively distinguished; then, continuous tracking points for single-mode travel are connected into single-mode travel segments. The travel mode of the surveyed subjects is identified based on the calculated features of average speed, average acceleration, and average change of direction within each single-mode segment. The random forest method is chosen as the basis model to classify travel mode. Three-quarters of the travel records were used to construct the random forest classifier, and the detection accuracy of the established model for the remaining ¼ of the travel record reached 94.4%. The proposed method uses massive smartphone global positioning system tracking points as the basis; the detection results are consistent with manually collected prompted recall survey records. Keywords Global positioning system tracking data processing, travel mode detection, automotive traffic, machine learning, random forest Date received: 3 March 2017; accepted: 6 April 2017 Academic Editor: Tao Feng Introduction The popularity of smartphones laid the foundation for the development of cell phone global positioning system (GPS) traveling surveys. By utilizing cell phone GPS for travel surveys, huge amounts of travel data can be obtained that realistically reflect people’s increasingly complex travel patterns and satisfy the demands of complex models in the big data era. Compared with tra- ditional travel surveys, with or without computer assis- tance, GPS-collected data are more complete and comprehensive, with low misreporting and false- reporting rates, 1–7 and they can, to a great extent, lower the burden on the survey subject and survey cost. 1,8–14 Many achievements have been made using GPS track- ing data thus far. However, because of the limitations 1 School of Transportation, Jilin University, Changchun, China 2 Antai College of Economics & Management, Shanghai Jiao Tong University, Shanghai, China 3 State Key Laboratory of Fluid Power & Mechatronic Systems, Zhejiang University, Hangzhou, China Corresponding author: Hongfei Jia, School of Transportation, Jilin University, Changchun 130022, China. Email: [email protected] Creative Commons CC-BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/ open-access-at-sage).

Transcript of Travel mode detection method based on big smartphone ...

Special Issue Article

Advances in Mechanical Engineering2017, Vol. 9(6) 1–10� The Author(s) 2017DOI: 10.1177/1687814017708134journals.sagepub.com/home/ade

Travel mode detection method basedon big smartphone global positioningsystem tracking data

Chaoran Zhou1, Hongfei Jia1, Jingxin Gao2, Lili Yang1, Yixiong Feng3 andGuangdong Tian1

AbstractThis article proposes a machine learning–based travel mode detection method using urban residents’ travel routes as thedata source, collected via smartphone global positioning system modules. A data-driven machine learning strategy waschosen in the model construction. This study performed data cleaning and mining on over 4400 pieces of urban residenttravel records containing several millions of global positioning system tracking points. Series of characteristic values ofspeed, travel distance, and direction are calculated, which reflect the travel mode of smartphone holders. In travel modeidentification, first, the transition regions of travel segments of different travel modes are effectively distinguished; then,continuous tracking points for single-mode travel are connected into single-mode travel segments. The travel mode ofthe surveyed subjects is identified based on the calculated features of average speed, average acceleration, and averagechange of direction within each single-mode segment. The random forest method is chosen as the basis model to classifytravel mode. Three-quarters of the travel records were used to construct the random forest classifier, and the detectionaccuracy of the established model for the remaining ¼ of the travel record reached 94.4%. The proposed method usesmassive smartphone global positioning system tracking points as the basis; the detection results are consistent withmanually collected prompted recall survey records.

KeywordsGlobal positioning system tracking data processing, travel mode detection, automotive traffic, machine learning, randomforest

Date received: 3 March 2017; accepted: 6 April 2017

Academic Editor: Tao Feng

Introduction

The popularity of smartphones laid the foundation forthe development of cell phone global positioning system(GPS) traveling surveys. By utilizing cell phone GPSfor travel surveys, huge amounts of travel data can beobtained that realistically reflect people’s increasinglycomplex travel patterns and satisfy the demands ofcomplex models in the big data era. Compared with tra-ditional travel surveys, with or without computer assis-tance, GPS-collected data are more complete andcomprehensive, with low misreporting and false-reporting rates,1–7 and they can, to a great extent, lower

the burden on the survey subject and survey cost.1,8–14

Many achievements have been made using GPS track-ing data thus far. However, because of the limitations

1School of Transportation, Jilin University, Changchun, China2Antai College of Economics & Management, Shanghai Jiao Tong

University, Shanghai, China3State Key Laboratory of Fluid Power & Mechatronic Systems, Zhejiang

University, Hangzhou, China

Corresponding author:

Hongfei Jia, School of Transportation, Jilin University, Changchun 130022,

China.

Email: [email protected]

Creative Commons CC-BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License

(http://www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without

further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/

open-access-at-sage).

of earlier GPS devices, the sample size is usually small,and the sample density is usually sparse.15

In 2004, Asakura and Hato16 first used cell phonepositioning functions to collect individual travel pathdata, obtained high-quality positioning data, andproved the feasibility of using smartphones for travelsurveying. Thereafter, further studies were conducted,represented by Naphade, Douma, Gonzalez, andStenneth in the United States and by Nitsche andBierlaire in Europe, and the smartphone-based travelsurvey method became more mature.10,17–19 WiFi,Global System for Mobile (GSM), and accelerometerdata collected by other cell phone sensors are alsobroadly used.20,21 L Shen and PR Stopher analyzed theimpact on trip-end identification of different GPS sig-nal sampling intervals and signal disappearance times.The results indicated that the accuracy of trip-end iden-tification was highest when the GPS signal samplinginterval was 5 s and the signal dwell time was 60 s.22 HSafi et al. used a smartphone application, Atlas, fortravel surveying and compared the survey process andresults against online travel surveying and handheldGPS travel surveying. They discovered thatsmartphone-based travel surveying places the least bur-den on the survey subjects, the quality of the surveydata is better, and the survey completion rate ishigher.23

The GPS devices, including GPS modules of smart-phones, can only record a series GPS positioning pointsduring the travel, and other meaningful informationsuch as trip ends, travel modes, and purposes cannot beobtained directly from the GPS data.24,25 To obtain theabove information, prompted recall (PR) survey is usu-ally conducted. However, the smartphone-based GPSdata can easily collect thousands of or even more tripsdata in a very short period, while the manual PR surveycannot fully cover the more and more GPS data.

Thus, identifying travel modes solely from the large-scale GPS data becomes essential technique in currenttravel behavior studies. After data cleaning and pri-mary analysis such as trip-end identification,26 most ofprevious studies set certain conditions and rules to iden-tify the travel modes. These rules and conditions areusually based on transportation surveys accumulated inthe past and are closely related to geographic location,local city road conditions, transportation regulations,and resident living habits for the obtained data, in otherwords, subjectivity and arbitrariness. Lacking of uni-versality, these rule-based methods are difficult to beapplied to other research or practice scenarios.

To resolve the abovementioned problems, this articleproposes a new data-driven method for travel modedetection. Specifically, the travel mode detection con-tains five major steps: (1) features of GPS points suchas speed, acceleration, and direction are calculated; (2)PR survey data are input as a reference into the

machine learning model to train the classifier, enablingit to recognize the transition points between differenttravel modes; (3) the continuous GPS tracking pointsare divided into single-mode travel segments;26 (4) fea-tures of the single-mode travel segments are calculated;and (5) a travel mode classifier is constructed fromtraining samples. The proposed method can reducesubjectivity interference as much as possible and sum-marize/reflect the intrinsic features of the massive data.This article uses the random forest model as a basisclassifier.27 The model is a data-driven and non-parameterized method that has good robustness, avoidsover-fitting, and satisfies the accuracy and performancerequirements of this study. The method relies more onthe knowledge mined from the data, rather than on tra-ditional experiences, which is consistent with the trendsof the big data era.28

Data description

Data collection and format

In this study, the GPS travel survey data are collectedin Shanghai, China from 2014 to 2015. The trackingdata are recorded by GPS modules in volunteers’ smart-phones. In preparation, all the volunteers are requestedto fill in a short questionnaire, which includes somequestions about their socio-economic information. TheGPS tracking is started before the volunteers leavinghome in the morning and ended after the volunteersfinally arrives at home. In this way, all the GPS track-ing points of volunteers within the whole day are col-lected. After one day’s data collection of a volunteer, aninterview on the telephone is conducted by trainedinterviewers to verify the derived travel information ofthe day. Each volunteer is requested to take part in thesurvey for at least 5 days.

Continuous GPS tracking data of smartphone hold-ers is recorded based on self-developed Android/iOSsmartphone GPS software (Figure 1). A Web-basedsurvey platform is designed to store GPS data and col-lect static data such as the socio-economic attributes,demographic features, and recurring locations of thesurveyed subjects.

The tracking collection software installed on thesmartphones of the survey subjects collects GPS track-ing information including user ID number, positioningtime, latitude, longitude, and altitude. The user IDnumber is the serial number of the smartphone ownercorresponding to the relevant sampling tracking points,and positioning time/coordinates is the time and spatialcoordinates of a GPS tracking point, which are thebasic information in GPS data collection.

This study recruited 459 volunteers by online andon-site methods and collected a total of over 17millionpieces of GPS tracking data for 3766 total days. Among

2 Advances in Mechanical Engineering

the volunteers, 360 provided socio-economic informa-tion. A total of 318 volunteers participated in the PRsurvey.29 As an important supplement to the GPStracking data automatically collected via smartphonesoftware, the PR survey data, including starting/endingtime, starting/ending point, travel mode, and travel pur-pose, were manually verified based on the GPS trackingdata. Since in this study, the PR survey records is usedas the ground truth in model training, the manual cura-tion is able to correct misreporting and false-reportingby the surveyed subjects.

A total of 288 volunteers both provided their familysocio-economic attributes and participated in the PRsurvey. The PR survey of these volunteers recorded atotal of 14,702 ‘‘travel/activities,’’ where travel refers towhen a surveyed subject is in a mobile state, and thecorresponding tracking points are called ‘‘mobilepoints’’; activity refers to when a surveyed subject is ina state of working or resting, and the correspondingtracking points are called ‘‘stationary points.’’

Data cleaning

In the continuous process of collecting position infor-mation on the surveyed subjects based on the GPS

model installed in smartphones, the following factorsmight induce interruption or error in the GPS collec-tion process and generate invalid data records: unstableGPS signal, surveyed subjects entering no-signal orweak-signal regions (such as underground), and lowphone power or high load. Thus, the raw data mustundergo a cleaning process to filter out incomplete anderroneous data records before the GPS tracking datacan be used for any study.1,2,26,29,30

Based on the actual condition of the GPS trackingdata collection and raw data analysis, combined withthe practical requirements of the travel mode detection,this article cleaned the raw data by deleting the follow-ing: (1) GPS tracking points of the surveyed subjectsnot in a traveling state; (2) GPS tracking points missingspatio-temporal positioning information; (3) GPStracking points with no or incomplete PR survey recordinformation; (4) GPS tracking points that were notlocated in the area of Shanghai, China; (5) GPS track-ing points with an instantaneous speed that is not con-sistent with common sense; (6) GPS tracking points inlow sampling rate regions; and (7) GPS tracking pointscorresponding to a traveling period shorter than 240 s.

The valid data after cleaning include 271 surveyedsubjects, a total of 1630days (average of 5.8 days per

Figure 1. GPS positioning software for different smartphone operating systems: (a) a screenshot of the android application and(b) a screenshot of the iOS application.

Zhou et al. 3

person), and a total of 3593 travels (average of 2.2 tra-vels per day). These travels include 4454 single-modetravel segments. There are in total 2,617,789 trackingpoints; each single-mode travel segment contains anaverage of 588 tracking points, and the average traveltime is 1421 s. These single-mode travel segmentsinclude six travel modes, namely, walk, bicycle,e-bicycle, car, bus, and subway. The number of seg-ments, average number of tracking points, and averagetravel duration for each mode are listed in Table 1.

Methodology

Basic strategy

This article treats the PR-surveyed travel informationas manually verified reliable results, but not as the start-ing point of data analysis. The raw GPS tracking datado not contain the starting and ending point informa-tion of each travel or information on the transitionbetween different travel modes.30 Thus, travel modedetection includes two main parts: single-mode travelsegmentation and travel mode identification.

As shown in Figure 2, this article first calculates 24traveling tracking point features from the cleaned data,

including collection time, coordinates of collected track-ing points, instantaneous speed, and instantaneousacceleration. For every tracking point, this article calcu-lates the spatio-temporal relationship among the fewtracking points before and after the point in the tem-poral sequence to assess whether the features of thesetracking points are similar. A machine learning modelis used to identify the tracking points with large differ-ences in features from the points before and after themas transition points between different travel modes.

After the identification of transition points betweentravel modes, the continuous GPS tracking data can bedivided into different single-mode travel segments. Thefeatures such as the speed, acceleration, direction, andactivity range of the surveyed subjects in each single-mode travel segment were calculated to re-train themachine learning model and recognize each travelmode.

Every single-mode travel segment is subdivided; thisarticle treats the first and last 10 tracking points as thetravel mode transition region and treats other points asnon-transition regions, together forming the trainingsample set. In the travel mode identification stage, ł ofthe single-mode travel segments are used as the trainingsample set and the other ¼ as the testing sample set.

Feature selection and calculation

In a field-oriented application of machine learningmethod, feature selection is more important than spe-cific model choice or some slight model improvements,because the selected features must reflect the essentiallaws of the field.

In this study, the proposed travel mode identificationmethod tries to avoid setting subjective rules and condi-tions that could interfere with the results. However, thisavoidance does not mean the negation of commonsense, experience, and basic traffic theories. On the con-trary, the machine learning model–based identificationmethod fuses common sense, experience, and basic traf-fic theories in the construction process of the classifiermore objectively and scientifically. The trained machinelearning model not only can reflect properties and

Table 1. Basic information on travel mode data.

Travel mode Number oftravel segments

Averageduration (s)

Standarddeviation (s)

Average trackingpoints contained

Standard deviation(points)

Walk 1784 776 624 413 313Bicycle 237 1203 860 743 515e-Bicycle 141 1167 967 702 566Car 840 1931 1585 1023 797Bus 825 1627 1158 794 556Subway 627 2442 1760 144 311Total 4454 1421 1325 588 582

Figure 2. Global strategies.

4 Advances in Mechanical Engineering

information mined from the training sample data butalso is consistent with common sense, experience, andtraffic research results to date, as well as basic traffictheories and knowledge. During this process, the contri-bution of common sense, experience, current researchresults, and basic traffic theories to the construction ofthe model is mainly concentrated in the aspects of fea-ture selection and calculation. The rationality of featureselection and feature calculation efficiency determinesthe accuracy and performance of the model.

The information of the average speed, average accel-eration, direction, and tracking-point congregationdegree of a surveyed subject for a segment of continu-ous tracking points represents the movement state ofthe surveyed subject in the corresponding trackingregion, providing important features for judging thetravel mode.26 In this study, the search for differentsingle-mode travel segmentation points requires judgingwhether the segments before and after a candidatetracking point in a temporal sequence belong to differ-ent travel modes; this judgment also relies on the above-mentioned features.

The GPS tracking data record is a series of time–space coordinates of the surveyed subjects; every recordonly includes the tracking point attributer’s coordinatesat that particular moment. The features of the surveyedsubject in any single-mode travel segment must be cal-culated from the time and coordinate information ofevery point in the segment. Among these features, thefive basic features of distance between two trackingpoints, time difference, direction, speed, and accelera-tion are the foundation of each single-mode travel seg-ment feature, and the calculation method is as follows.

Let Ti be the sampling time of tracking point pi inthe single-mode travel segment, Ei be the longitude ofpi, and Ni be the latitude of pi. Let R be the radius ofthe earth, R=6,371,000m; the above are known data.

Then, denote the distance between tracking points piand pj as di,j, in units of meters31

di, j =R � arccos cosu1 � cosu2 + sinu1 � sinu2 � cosDlð Þð1Þ

in which u1= (90�2Ni)�p/180�, u2= (90�2Nj)�p/180�,and Dl=(Ej2Ei)�p/180�.

Denote the time difference between tracking pointspi and pj as ti,j, in units of seconds

ti, j = Ti � Tj

�� �� ð2Þ

Denote the instantaneous speed of tracking points pias vi, in units of meters per second

vi =di�1, i + di, i+ 1ð Þ

ti�1, i+ 1

ð3Þ

Denote the direction between pi and pj as the acuteangle formed with the north/south direction, Ai,j, inunits of degrees31

Ai, j =1808

p� arctan sinDl

cosu1 � tanu2 � sinu1 � cosDl

� ���������ð4Þ

in which u1= (90�2Ni)� p/180�, u2= (90�2Nj)� p/180�, and Dl=(Ej2Ei)�p/180�.

Denote the direction between pi and pj as the angleformed with the north direction, A0i, j, in units of degrees

A0i, j =

Ai, j if Ej � Ei, Nj.Ni

1808� Ai, j if Ej.Ei, Nj�Ni

1808+Ai, j if Ej�Ei, Nj\Ni

3608� Ai, j if Ej\Ei, Nj � Ni

8>><>>:

ð5Þ

when Ei=Ej and Ni=Nj, the surveyed subject isnot moved, let A0i, j =A0i�1, j.

Denote the acceleration of tracking point pi as ai, inunits of meters per second squared

ai =

di, i+ 1

ti, i+ 1� di�1, i

ti�1, i

ti, i+ 1

ð6Þ

Based on the above basic features and according tothe requirements of different travel modes, this articleuses five sets of 24 features to depict the different travelmodes of the surveyed subjects in single-mode travelsegments. These features include the following: (1) thespeed distribution features, (2) the global and extremevalue features, (3) the acceleration distribution features,(4) the stoppage features, and (5) the direction chang-ing features of the travel segment.

The first set of features consists of speed-related fea-tures, which is the most obvious feature for distinguish-ing travel mode. In practice, this study uses 95%quantile speed in the travel segment instead of the glo-bal maximum instantaneous speed, which helps toreduce the impact of data error on the classificationaccuracy. Average speed, speed standard deviation,and each speed quantile reflect the speed distribution ofa surveyed subject in each travel segment, including95%, 75%, 50%, and 25% quantile speeds.

The second set of features consists of global andextreme value features. These features reflect the tem-poral and spatial sampling distribution of the GPStracking points in the entire travel segment. Amongthese features, the maximum sampling time intervaland distance usually reflect the signal-loss properties ofsubway travel or the urban canyon zone. The stoppingpoint ratio can reflect the time proportion the surveyedsubject spends in a stationary state. The low speed ratioreflects the time proportion of the surveyed subject in a

Zhou et al. 5

low-speed state during the travel segment. The maxi-mum travel distance reflects the activity area of the sur-veyed subject during the travel segment. Table 2 liststhe meaning and calculation method of each feature inthe global and extreme value feature set.

The third set of features used in this article consistsof acceleration-related features, including average accel-eration and acceleration at the 95%, 75%, and 50%quantile points. Acceleration features reflect the degreeof change in instantaneous speed, which can effectivelydistinguish a tracking point state in an automobiletravel region from a non-automobile travel region. Thisability is especially useful in a traffic jam situation,where the surveyed subject would have a low speed andtime spent stopped in a relatively small region. Fromthe perspectives of average speed, low-speed-pointratio, and total travel distance, a traffic jam is difficultto distinguish from a walking travel region, but theacceleration features of starting and stopping an auto-mobile can indicate the difference quite well. Moreover,this article only considers the absolute value of accelera-tion in practice, that is, it does not consider the differ-ence between acceleration and deceleration, only themagnitude of acceleration/deceleration.

The fourth set of features comprises the stoppagefeatures. Regardless of traveling in automobile or non-automobile mode, urban travelers will inevitablyencounter situations of random stoppage such as beingstuck in a traffic jam, waiting for traffic-light signals atintersections, or waiting at bus stops. Different travelmodes typically have different stoppage frequenciesand durations. This article uses the statistics of the

occurrence frequency of different stoppage durations asthe stoppage features of a travel segment, includingnumber of stoppages within 0–5, 5–15, 15–30, 30–60,and above 60 s, per kilometer. These stoppages describethe frequency of short, medium, and long stops in asingle-mode travel segment. The total numbers of themare obtained from the statistics of the stoppage time inthe travel segment and then divide by the travel dis-tance of the segment for normalization.

The fifth set of features in this article, the directionchange features, only considers the features of averagedirection change in the travel segment to reflect howoften the direction changed in the travel segment of thesurveyed subject. Generally, changes of direction inwalking or bicycling are more random and frequent,while automobile travel must follow routes, and theaverage direction change is smaller. To obtain the aver-age direction change feature, the instantaneous direc-tions formed by neighboring tracking points arecalculated first using equations (4) and (5). The differ-ences between directions of a neighboring pair of pointsare the direction change. Then, we take the averagevalue of the direction change in the entire segment.

To sum up, although there is no explicit rule appliedin the modeling process, a total of 24 features coveredthe speed distribution features, the global and extremevalue features, the acceleration distribution features,the stoppage features, and the direction changing fea-tures of a travel segment will be integrated in the con-structed model by training. Thus, the data-drivenmethod is not only consistent with common sense,experience, current research results, and basic traffic

Table 2. Global and extreme value features.

Feature name Unit Physical meaning Calculation method

Total travel distance m Distance in the entire travel segment Accumulation of line distance betweenevery two neighboring points

Total time s Total travel time Difference between travel ending-pointand starting-point sampling times

Maximum sampling time interval s Longest time period without signal Maximum value of sampling timedifference between neighboring points

Maximum sampling distance interval m Longest distance without signal Maximum value of sampling linedistance between neighboring points

Distance to subway entrance m Shortest distance from travel starting/ending point to subway entrance

Line distance to the closest subwayentrance from travel starting/endingpoint

Maximum travel distance m Maximum distance between two pointsin the travel segment

Line distance between the two farthestpoints

Stationary-point ratio – Proportion of points whoseinstantaneous speed is less than 0.1 m/s

Ratio of number of points withinstantaneous speed less than 0.1 m/sto total sampling points

Low-speed-point ratio – Proportion of points whoseinstantaneous speed is less than 3 m/s

Ratio of number of points withinstantaneous speed less than 3 m/s tototal sampling points

6 Advances in Mechanical Engineering

theories but also more powerful, more comprehensivethan the existing rule-based methods.

Selection and application of classifier

Although in this article, we focus on the global strategyto develop the machine learning-based framework fortravel mode identification, and do not intend to discussclassifier selection in detail. The random forest27 modelis a data-driven, non-parametric classification method,with very few parameters necessarily to be assumedbased on prior experience. The model obtained fromtraining has good interpretability, which is consistentwith the purpose of this article in discovering the travelfeatures and modes of residents based on massive GPStracking data.

Besides, this article calculates five sets of 24 featuresfor each single-mode travel segment as the input to theclassifier to obtain the six classification results of walk,bicycle, e-bicycle, car, bus, and subway travel. Thereremain a few data points containing errors after thedata cleaning. Due to the limitations of the PR survey,there are also some erroneous registered records in thetraining sample set. Moreover, to describe the speed/acceleration distribution in the travel segment, this arti-cle used multiple speed/acceleration quantile pointssimultaneously as sample features; obviously, theselected features show a certain correlation. The ran-dom forest method can also well address the abovesample noise, incorrect registration and feature correla-tion problems, which are very difficult to be solved bythe rule-based methods.

The random forest method was proposed byBreiman et al.32 in 2001 and is an integrated learningmodel using a decision tree as the basic classifier. Theintegrated learning strategy of random forest is to selecta random training sample for each decision tree, givingthe method a good tolerance for abnormal values andnoise. It can effectively resolve the problem of over-fitting and is insensitive to data error and classificationregistration error. Through random selection of thefeature set, the random forest method is also insensitiveto the correlation between features.

In a nutshell, comparing to other machine learningmethods, such as linear discrimination method,Bayesian methods, simple decision tree, artificial neuralnetwork (ANN), or support vector machine (SVM),the random forest model is able to solve non-linearproblem, requires few prior knowledge, is insensitive tonoise, abnormal values, and correlated features, and iswell interpretable. These advantages make the randomforest model very suitable for our study. Besides, theparallel computation can be easily applied to the ran-dom forest model, which make it provides a better per-formance than other methods.

Results and discussion

Single-mode travel segmentation results

The samples in this article contain 3593 travels, ofwhich 687 are multi-mode travels, including 1548single-mode travel segments. Every multi-mode travelcontains two-to-five single-mode travel segments; theremaining 2906 travels are single-mode travels.

This study uses travel transition points containingthe travel mode registered in the PR survey and a fewneighboring tracking points as positive training sam-ples (transition points) and uses a few random trackingpoints far from the transition point in each single-modesegment as negative samples (non-transition points) totrain the random forest model. Then, this study usesthe trained model to recognize transition points fortravel modes and divide single-mode travel segments.

The proposed method can correctly recognize 97.9%of the travel mode transition points (sensitivity). Due tothe continuity of GPS tracking points, the region near atravel mode transition point can be recognized as tran-sition point in practice. Among all travel mode transi-tion points recognized by the proposed method, 73.2%are located within 15 s of the transition point registeredin the PR survey, while 85.3% are within 30 s. In furtherexperiments, we tried to reduce the area of the recog-nized travel mode transition region and to decrease thefalse-positive rate. When the travel mode transitionregion recognized by the random forest model wasmodified to 60% around the center, over 90% of recog-nized travel mode transition points were within 30 s ofPR survey–registered transition points. However, only88.3% of the (PR survey–registered) travel mode transi-tion points could be correctly recognized.

The single-mode travel boundary registered in thePR survey might not be completely accurate; thus, thisstudy not only selected accurate single-mode travel seg-ment boundaries as the positive training sample butalso allowed certain errors in assessing detection accu-racy. First, the transition of the travel mode of sur-veyed objects is a process; in the follow-up telephoneinterview, the surveyed subject could identify any track-ing point between ‘‘arrived at platform’’ and ‘‘got onthe bus’’ as the travel mode transition point. Second,the surveyed subject might not be able to accuratelyremember the time he or she got on the bus to the sec-ond. In other words, generally speaking, we can viewthe PR survey registration as accurate; but in practice,the transition point registered in the PR survey alongwith a few neighboring tracking points could all possi-bly be the actual travel mode transition point.

Travel mode detection results

This study randomly selected ł (3340) of the single-mode travel segments as the random forest model

Zhou et al. 7

training sample set, and the other ¼ (1114) were usedas the testing set. As shown in Table 3, this study ran-domly selected the training set and testing set 10 timesaccording to the above method. The detection accuracyof the trained random forest model for the testing setcan exceed 92%, with the highest reaching 94.4%.

This study uses the 10th detection result for furtheranalysis. Figure 3 shows the confusion matrix of thetravel mode detection results. Due to the obvious dis-tinction in speed and maximum time interval features,the detection accuracies of subway and walking travelare the highest, reaching 98.3% and 97.8%, respec-tively. The car travel mode and bus travel mode remaineasily confused; the detection accuracies of the two are90.2% and 84.4%, respectively. Approximately 8.3%of car travel was identified as bus travel, and approxi-mately 11.1% of bus travel was recognized as car travel.At the same time, 9.3% of e-bicycle (four instances)travel was wrongly recognized as bus travel. This typeof erroneous detection usually occurs in a single-mode

travel segment with a high low-speed-point ratio and ahigh stoppage frequency; it is speculated that the travelmay have occurred during the high-traffic period andalong routes with traffic jams, leading to indistinguish-able features.

Detection results and contribution of each feature set

This study used five sets of 24 features in total as theinput to construct the travel mode detection modelbased on the random forest method. To examine thedetection results of each feature set, this study first usedan individual feature set to identify the travel mode. Inidentifying the state of each individual tracking point,the best results are from set 2 (global and extreme valuefeatures) and set 1 (speed distribution features); thedetection accuracy from both groups reached approxi-mately 87%. Set 4 (stoppage features) can achieve 72%accuracy in travel mode detection. The individual detec-tion abilities of set 3 (acceleration distribution features)and set 5 (direction change features) are quite low, butthe detection accuracy in terms of the six travel modeclassification can still reach 58.4% and 52.4%, respec-tively. The above results proved that the five sets of 24features selected in this study can all effectively recog-nize different travel mode to a certain degree.

To further examine the role of each feature set intravel mode detection when they are simultaneouslyentered into the model, this study uses the meandecrease in the Gini coefficient in the random forestmodel to assess the contribution of each feature set in

Figure 3. Confusion matrix of travel mode detection results.

Table 3. Test accuracy of 10 randomly selected sets of trainingset and testing set.

Time Accuracy (%) Time Accuracy (%)

1 93.4 6 93.62 93.3 7 93.63 94.2 8 94.44 92.1 9 92.65 93.7 10 93.4

8 Advances in Mechanical Engineering

the model construction. The mean decrease in the Ginicoefficient is used to calculate the influence of each fea-ture on the heterogeneity of the observed values ofevery node in a decision tree through calculation of theGini coefficient and thus to compare the importance ofthe features. In the training process, when using a fea-ture to group the node data, a larger decrease in theheterogeneity of the observed values means a betterresult of that feature in distinguishing sample classifica-tion, indicating that the feature is more important.

Figure 4 shows the contribution rates of each featureset in the model construction, calculated according tothe mean decrease in the Gini coefficient. Among allfeature sets, the speed distribution feature set has thelargest contribution, followed by the global andextreme value feature sets as well as the stoppage fea-ture set. The contribution of the acceleration distribu-tion feature to the model construction is relatively low,and the contribution of the direction change feature iseven lower. The results in Figure 4 are basically consis-tent with the results from using an individual featureset to recognize the states of tracking points. This find-ing proves that each feature set contributed to modelconstruction and increased the accuracy of travel modedetection to a certain degree; among all feature sets, thespeed distribution set and global/extreme feature setcontributions are the highest.

Conclusion

This study used massive GPS tracking data to examineand achieve a random forest–based travel mode detec-tion method, obtained as high as 94.4% overall detec-tion accuracy, and conducted an in-depth discussion offeature selection. The main conclusions are as follows.

Based on the analysis of the data collection processand the raw data, this article proposes seven data clean-ing rules: the massive smartphone GPS tracking data

were cleaned according to the research requirements oftravel mode detection.

This study takes the random forest model as the coreand proposes a data-driven method for travel modedetection. Based on existing study results and the litera-ture, this study proposed five sets of 24 identificationfeatures for travel mode detection, assessed the individ-ual performance of each feature set in travel modeidentification, and assessed the contribution ratio ofeach feature set in the construction process of the ran-dom forest-based model. Among all feature sets, thecontribution of the speed distribution feature set is thehighest, followed by the global and extreme value set,stoppage set, acceleration set, and, finally, directionchange set. However, lacking of any set of featureswill lead to an accuracy reduction, which indicates allsets of features do contribute to the travel modeidentification.

This study used the machine learning model to iden-tify travel modes. The results indicate that the big data–based model construction not only achieved a detectionaccuracy up to 94.4% but also maintained basic stabi-lity in multiple random selections of training and testingsets. This finding indicates that the proposed methodhas good consistency among different travels and differ-ent surveyed subjects. Compared with traditional rule-based methods for travel mode detection, the proposedmethod is more objective, intuitive, and convenient.And in contrast to the rule-based methods previouslyused to identify travel mode, the random forest–basedmethod requires no empirical classification criteria, butrather, the self-learning and construction of the modelare data driven.

Smartphones become more and more prevalent inrecent years, enabling large-scale transportation surveysbased on GPS modules of smartphones. These surveyswill produce huge amount of data. To apply these datain residents’ travel behavior studies, manual PR verifi-cation is no longer capable. Machine learning basedmethods which can automatically mine travel informa-tion such as travel mode solely from the GPS data areessential basis in future transportation research. Thisstudy creates a paradigm of the new methodology.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest withrespect to the research, authorship, and/or publication of thisarticle.

Funding

The author(s) disclosed receipt of the following financial sup-port for the research, authorship, and/or publication of thisarticle: This research was supported by National NaturalScience Foundation of China (51278301, 51478266, and51405075).

Figure 4. Contribution of each feature set to travel modedetection.

Zhou et al. 9

References

1. Bohte W and Maat K. Deriving and validating trip pur-

poses and travel modes for multi-day GPS-based travel

surveys: a large-scale application in the Netherlands.

Transport Res C: Emer 2009; 17: 285–297.2. Auld J, Williams C, Mohammadian A, et al. An automated

GPS-based prompted recall survey with learning algo-

rithms. Transport Lett: Int J Transport Res 2009; 1: 59–79.3. Yao B, Chen C, Cao Q, et al. Short-term traffic speed

prediction for an urban corridor. Comput-Aided Civ Inf

2017; 32: 154–169.4. Forrest T and Pearson D. Comparison of trip determina-

tion methods in household travel surveys enhanced by a

global positioning system. Transport Res Rec 2005; 1917:

63–71.5. Zeng J, Li M and Cai Y. A tracking system supporting

large-scale users based on GPS and G-sensor. Int J Dis-

trib Sens N 2015; 11: 862184.6. Weng JC, Wang C, Huang HN, et al. Real-time bus

travel speed estimation model based on bus GPS data.

Adv Mech Eng 2016; 8: 10.7. Tang WY and Cheng L. Analyzing multiday route choice

behavior of commuters using GPS data. Adv Mech Eng

2016; 8: 11.8. Yu B, Song XL, Guan F, et al. k-nearest neighbor model

for multiple-time-step prediction of short-term traffic

condition. J Transp Eng: ASCE 2016; 142: 04016018.9. Barbeau SJ, Labrador MA, Georggi NL, et al. TRAC-IT:

software architecture supporting simultaneous travel beha-

vior data collection and real-time location-based services

for GPS-enabled mobile phones. In: Proceedings of the

transportation research board 88th annual meeting, Washing-

ton, DC, 11–15 January 2009. Washington, DC: TRB.10. Gonzalez PA, Weinstein JS, Barbeau SJ, et al. Automat-

ing mode detection for travel behaviour analysis by using

global positioning systems enabled mobile phones and

neural networks. IET Intell Transp Sy 2010; 4: 37–49.11. Nitsche P, Widhalm P, Breuss S, et al. A strategy on how to

utilize smartphones for automatically reconstructing trips in

travel surveys. Procd Soc Behv 2012; 48: 1033–1046.12. Calabrese F, Mi D, Lorenzo GD, et al. Understanding

individual mobility patterns from urban sensing data: a

mobile phone trace example. Transport Res C: Emer

2013; 26: 301–313.13. Zhang L, Wang YP, Sun J, et al. The sightseeing bus

schedule optimization under park and ride system in

tourist attractions. Ann Oper Res. Epub ahead of print 7

November 2016. DOI: 10.1007/s10479-016-2364-4.14. Lanza J, Sanchez L, Munoz L, et al. Large-scale mobile

sensing enabled internet-of-things testbed for smart city

services. Int J Distrib Sens N 2015; 2015: 157.15. Bolbol A, Cheng T, Tsapakis I, et al. Inferring hybrid

transportation modes from sparse GPS data using a mov-

ing window SVM classification. Comput Environ Urban

2012; 36: 526–537.16. Asakura Y and Hato E. Tracking survey for individual

travel behaviour using mobile communication instru-

ments. Transport Res C: Emer 2004; 12: 273–291.17. Stenneth L, Wolfson O, Yu PS, et al. Transportation

mode detection using mobile phones and GIS

information. In: Proceedings of the ACM SIGSPATIAL

international symposium on advances in geographic infor-

mation systems (GIS), Chicago, IL, 1–4 November 2011,pp.54–63. New York: ACM.

18. Bierlaire M, Chen J and Newman J. A probabilistic mapmatching method for smartphone GPS data. Transport

Res C: Emer 2013; 26: 78–98.19. Nitsche P, Widhalm P, Breuss S, et al. Supporting large-

scale travel surveys with smartphones—a practicalapproach. Transport Res C: Emer 2014; 43: 212–221.

20. Reddy S, Mun M, Burke J, et al. Using mobile phones todetermine transportation modes. ACM T Sensor Network

2010; 6: 1–27.

21. Hemminki S, Nurmi P and Tarkoma S. Accelerometer-based transportation mode detection on smartphones. In:Proceedings of the 11th ACM conference on embedded net-

worked sensor systems, Roma, 11–15 November 2013,pp.1–14. New York: ACM.

22. Shen L and Stopher PR. Should we change the rules fortrip identification for GPS travel records? In: Proceedingsof the 36th Australasian transport research forum

(ATRF), Brisbane, QLD, Australia, 2–4 October 2013.Washington, DC: TRB.

23. Safi H, Assemi B, Mesbah M, et al. A framework forsmartphone-based travel surveys: an empirical compari-son with alternative methods in New Zealand. In: Pro-ceedings of the 10th international conference on transport

survey methods, Leura, NSW, Australia, 16–21 Novem-ber 2014. Leura: Transportation Research Procedia.

24. Shen Y, Dong H, Jia L, et al. A method of traffic travelstatus segmentation based on position trajectories. In:Proceedings of the IEEE 18th international conference on

intelligent transportation systems, Las Palmas, 15–18 Sep-tember 2015, pp.2877–2882. New York: IEEE.

25. Vij A and Shankari K. When is big data big enough?Implications of using GPS-based surveys for traveldemand analysis. Transport Res C: Emer 2015; 56:

446–462.26. Shen L and Stopher PR. Review of GPS travel survey

and GPS data-processing methods. Transport Rev 2014;34: 316–334.

27. Breiman L. Random forests. Mach Learn 2001; 45: 5–32.28. Feng T and Timmermans HJ. Comparison of advanced

GPS data imputation algorithms for detection of transpor-tation mode and activity episode. In: Proceedings of the

transportation research board 93rd annual meeting, Washing-ton, DC, 12–16 January 2014. Washington, DC: TRB.

29. Xiao G, Juan Z and Zhang C. Travel mode detectionbased on GPS track data and Bayesian networks. Com-

put Environ Urban 2015; 54: 14–22.30. Zhou C, Jia H, Juan Z, et al. A data-driven method for

trip ends identification using large-scale smartphone-based GPS tracking data. IEEE T Intell Transp. Epubahead of print 8 December 2016. DOI: 10.1109/TITS.2016.2630733.

31. McGovern A. Geographic distance and azimuth calcula-tions, http://www.codeguru.com/cpp/cpp/algorithms/arti-cle.php/c5115/Geographic-Distance-and-Azimuth-Calcu-lations.htm (accessed 14 June 2005).

32. Breiman L, Friedman J, Stone CJ, et al. Classificationand regression trees. Boca Raton, FL: CRC Press, 1984.

10 Advances in Mechanical Engineering