Predicting Stock Market Price Direction with Uncertainty Using ...

49
U.U.D.M. Project Report 2020:50 Examensarbete i matematik, 15 hp Handledare: Robin Eriksson, institutionen för informationsteknologi Examinator: Erik Ekström November 2020 Department of Mathematics Uppsala University Predicting Stock Market Price Direction with Uncertainty Using Quantile Regression Forest Minna Castoe

Transcript of Predicting Stock Market Price Direction with Uncertainty Using ...

U.U.D.M. Project Report 2020:50

Examensarbete i matematik, 15 hpHandledare: Robin Eriksson, institutionen för informationsteknologi Examinator: Erik EkströmNovember 2020

Department of MathematicsUppsala University

Predicting Stock Market Price Direction with Uncertainty Using Quantile Regression Forest

Minna Castoe

Abstract

The ability to successfully and accurately forecast the trend in stock market price movementsis crucial for both traders and investors due to its importance in influencing traders’ futuredecisions to either buy or sell an underlying asset which could yield significant profit. Inrecent years, Machine Learning algorithms in general and ensemble learning algorithms inparticular have been successfully shown to generate high prediction accuracy of stock pricedirection.

However, a common direction in which the prediction is made is in the fashion of findingthe conditional mean prediction, but since the market can be seen as stochastic, there is anunderlying uncertainty that should be accounted for. The literature shows that Random Forestwhich generates the mean prediction, among all other ensemble learning methods has provedeffective in stock price forecasting. Hence, we use Random Forest to deal with the stock pre-diction problem as well as a generalization of this model where the output from the predictoris not only the mean but t-quantiles, called Quantile Regression Forest.

The main contribution of this paper is the study of the Random Forest classifier andQuantile regression Forest predictors on the direction of the AAPL stock price of the next30, 60 and 90 days. The stock prediction problem is constructed as a classification problemas well as a regression problem. The forecasting ability of the Random Forest classifier isaccessed using the confusion matrix where four parameters; accuracy, precision, sensitivityand specificity are computed from this matrix. On the other hand, the forecasting ability ofQuantile Regression Forest is accessed using the standard strategic indicators such as RMSEand MAPE. Using seven technical indicators and the historical time series data of AAPL whereall the data available for this company has been used starting from the day they went public,with time span ranges from December 12, 1980 to August 1, 2020, experimental results showthat both Random Forest and Quantile Regression Forest accurately predict the direction ofstock market price with accuracy over 90% in Random Forest and small error, MAPE between0.03% and 0.05% in Quantile Regression Forest.

Keywords— Random Forest Classifier, Quantile Regression Forest, Stock price prediction,Ensemble Learning algorithms, Technical indicators, Prediction intervals.

AcknowledgementMy sincerest gratitude goes to my supervisor, Robin Eriksson, who provided me with

continual support, guidance and recommendations regarding the topic of my interest. Hisgreat effort helped me to dispose of many of the difficulties that I encountered throughout the

progression of this thesis.I would also like to acknowledge with gratitude, the love and support of my family and all

my friends, this journey would not have been possible without them.

I

Contents

List of Figures III

List of Tables IV1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Data and methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . 103.1.2 Exponential Smoothing . . . . . . . . . . . . . . . . . . . 113.1.3 Technical Indicators . . . . . . . . . . . . . . . . . . . . . 11

3.2 Predication Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.1 Decision tress . . . . . . . . . . . . . . . . . . . . . . . . 153.2.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . 163.2.3 Quantile Regression Forest . . . . . . . . . . . . . . . . . 16

3.3 Data Labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Model Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . 21

4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.1 Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Random Forest with α = 0.0095 . . . . . . . . . . . . . . 234.1.2 Random Forest with α = 0.2 . . . . . . . . . . . . . . . . 264.1.3 Random Forest with α = 0.95 . . . . . . . . . . . . . . . . 28

4.2 Quantile Regression Forest . . . . . . . . . . . . . . . . . . . . . . . 304.2.1 Prediction Intervals . . . . . . . . . . . . . . . . . . . . . 32

4.3 Comparison between RFC and QRF . . . . . . . . . . . . . . . . . . 345 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Bibliography I

II

List of Figures

1 Daily AAPL closing prices from December 1980 to August 2020 . . . . . . . 102 Target 30 days . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Target 60 days . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Target 90 days . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Daily exponentially smoothed AAPL closing prices from 1980 to 2020, α =

0.0095 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 OOB error rate, α = 0.0095 . . . . . . . . . . . . . . . . . . . . . . . . . . 258 Daily exponentially smoothed AAPL closing prices from 1980 to 2020, α = 0.2 269 OOB error rate, α = 0.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2710 Daily exponentially smoothed AAPL closing prices from 1980 to 2020, α = 0.95 2811 OOB error rate, α = 0.95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2912 Mean Square Error, α = 0.0095 . . . . . . . . . . . . . . . . . . . . . . . . 3113 50% Prediction Interval for 30 days . . . . . . . . . . . . . . . . . . . . . . 3314 95% Prediction Interval for 30 days . . . . . . . . . . . . . . . . . . . . . . 3315 50% Prediction Interval for 60 days . . . . . . . . . . . . . . . . . . . . . . 3316 95% Prediction Interval for 60 days . . . . . . . . . . . . . . . . . . . . . . 3317 50% Prediction Interval for 90 days . . . . . . . . . . . . . . . . . . . . . . 3318 95% Prediction interval for 90 days . . . . . . . . . . . . . . . . . . . . . . 3319 Variable Importance for RFC and QRF . . . . . . . . . . . . . . . . . . . . . 34

III

List of Tables

1 Summary of the literature survey . . . . . . . . . . . . . . . . . . . . . . . . 92 Descriptive statistics of the response variable, α = 0.0095 . . . . . . . . . . 193 Confusion Matrix for α = 0.0095 . . . . . . . . . . . . . . . . . . . . . . . 244 Results of random forest classifier, α = 0.0095 . . . . . . . . . . . . . . . . 245 Confusion Matrix for α = 0.2 . . . . . . . . . . . . . . . . . . . . . . . . . 266 Results of random forest classifier, α = 0.2 . . . . . . . . . . . . . . . . . . 277 Confusion Matrix for α = 0.95 . . . . . . . . . . . . . . . . . . . . . . . . . 288 Results of random forest classifier, α = 0.95 . . . . . . . . . . . . . . . . . . 299 Results of Quantile Regression Forest . . . . . . . . . . . . . . . . . . . . . 3010 Accuracy with 3 different random splits of the AAPL data set . . . . . . . . . 3511 Results from Di (2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3612 Results from Khaidem et al. (2016) . . . . . . . . . . . . . . . . . . . . . . 3613 Results obtained using RFC with different alpha . . . . . . . . . . . . . . . . 3614 Results from Vijh et al. (2020) . . . . . . . . . . . . . . . . . . . . . . . . . 3715 Results obtained using our model . . . . . . . . . . . . . . . . . . . . . . . . 37

IV

List of AcronymsEMH Efficient market hypothesis

ML Machine Learning

SVM Support Vector Machine

MA Moving Average

SVR Support Vector Regression

ANN Artificial Neural Network

LR Logistic Regression

RF Random Forest

RFC Random Forest Classifier

QRF Quantile Regression Forest

AUC-ROC curve Area under the receiver operating characteristics curve

CART Classification and Regression Trees

MACD Moving Average Convergence Divergence

A/D OSC Accumulation/distribution Oscillator

RSI Relative Strength Index

ROC Price Rate of Change

OSCP Price Oscillator

CCI Commodity Channel Index

RMSE Root Mean Squared Error

MAPE Mean Absolute Percentage Error

MAE Mean Absolute Error

V

1 IntroductionThis section provides a short background on the topic, discusses the problem that is investig-ated in this paper, states the aim and also describes the structure of the paper.

1.1 BackgroundThe trends in stock market price refer to the future upward or downward movements of theprice series, also called bear and bull. Attempting to successfully and accurately predict thetrends in stock market price or index has spawned a variety of models and methods, two ofwhich have been commonly used, technical and fundamental analysis (Malkiel, 1999).

Fundamental analysis is based on the study of demand and supply. A decrease in demandor an increase in supply tends to reduce the price, while an increase in demand or a decreasein supply will lead to a rise in stock prices (Atiya & Abu-Mostafa, 1996). Technical analysisis however performed mainly on a chart. Therefore, the basis of the technical analysis is thepattern in the data. The past pattern of the stock price behaviour is assumed to be rich enoughin information to be used to determine the future behaviour of security (Malkiel, 1999; Fama,1965). This is the main assumption behind several technical theories which are also calledchartist theories. The past behavior of the stock market prices tends to recur itself, and hence,the history of the stock market price can be used to predict the future trends of the price.On the contrary, the random walk theory which supports the fundamental analysis contradictsthe chartist theories, it provides that the movements of the stock prices are random, i.e. themovements are a series of identically independent random variables and thus, the past cannotbe used to make a meaningful prediction of the future. (Fama, 1965; 1995).

In the 1960s, Fama propounded the theory of the efficient market, which later resulted inhim being awarded the Nobel Prize for economics sciences in 2013. Efficient market theoryor hypothesis (EMH) has been one of the most debated investment theories. It is linked withthe idea of random walk and it states that the stock market prices, at any point in time, fullyreflect all the available information; prices fully reflect all the known information, resultingchanges in stock price to be random and thus, unpredictable (Malkiel, 2003). In other words,asset price movements or fluctuations are unpredictable since all the sellers and buyers in themarkets have the same information available.

Moreover, a review of several theoretical and empirical studies on EMH has been providedby Fama (1970), which supported the random walk theory and showed that the financial mar-ket is random, and therefore it is hard to predict the future trends in the stock market price.Neither technical analysis which uses the past stock prices to predict the future prices, norfundamental analysis which attempts to analyze the financial information, would give higherreturns than this could be obtained from a portfolio of individual stocks that is randomly selec-ted (Malkiel, 2003). However, as there have been many studies supporting this theory, therehave also been many studies that rejected it. The stock prices do not follow random walks andtherefore the random walk theory is strongly rejected (MacKinlay & Lo, 1988; 2011).

The rejection of the EMH and random walk theory and the non-consensus on the validityof the EMH made the analysis of the stock market price movements a challenging and disputedtask. This led to different approaches being used in financial forecasting and stock directionprediction issues. Approaches that have commonly been used in stock analysis and direction

1

prediction have been classified according to Shah et al. (2019) into four categories: statistical,pattern recognition, machine learning (ML) and sentiment analysis (Shah et al., 2019).Ballings et al. (2015) however grouped the methodologies used to predict the behaviour ofthe stock price into three different categories: ML and data mining, technical analysis, andtime series forecasting. While Khaidem et al. (2016) classified the most used methodologiesin predicting the stock price behaviour into four categories, added modeling and predictingvolatility of stocks using differential equations to the previous three mentioned methodologies.Also, Masoud (2014) shad light on four approaches that can be used in order to forecast theprice trends: fundamental analysis, technical analysis, ML and time series forecasting.

Aside from technical and fundamental analysis, ML and its applications have come to playan integral role in financial analysis. During recent years, ML has had fruitful applicationsin finance and has become a useful tool in handling the issue of the stock market directionprediction. Among the major approaches, ML algorithms have later become prominent infinance and started to be used for financial forecasting and stock price analysis in the early1980s (Vachhani et al., 2019). This approach is a broad subgroup of Artificial Intelligence andbesides the stock price behaviour prediction and stock analysis, these algorithms have beenalso used in portfolio optimization, stock betting and credit lending (Vachhani et al., 2019).

ML algorithm in its turn can be divided into four broad groups: supervised, unsupervised,semi-supervised and reinforcement machine learning algorithms. Supervised ML consistsof two groups of algorithms: regression and classification algorithms, whereas unsupervisedML consists of clustering and association machine learning algorithms. Unsupervised MLalgorithms, particularly, clustering algorithms have been used mainly in finance for financialrisk analysis including any form of financial risks, e.g. credit risk, investment risk, businessrisk and operational risk (Kou et al., 2014).

Unlike the unsupervised learning algorithms which have been used mostly to determinethe connection in an unconnected or uncorrelated data set, the supervised learning algorithmshave become useful in providing efficient analysis of the stock market prices and trends usingthe historical data (Shan et al., 2019). Supervised ML algorithms among all other phenomena,have been playing an important role in financial forecasting and proved effective in predictingthe future trends of the stock market price. This type of ML algorithm is used for the purposeof making predictions of many phenomena ranging from simple predictions to complicatedpredictions.

1.2 Problem FormulationThe prediction of the direction of stock market prices has become a crucial, highly challengingand controversial task in financial forecasting and analysis. Predicting the trends in stockmarket prices, i.e. whether a stock price would rise or fall has been an area of interest formany researchers and also for investors due to its importance in influencing the traders’ futuredecision to either buy or sell an instrument which could yield significant profit.

The fact that the stock market is fundamentally random, dynamic, nonlinear, complicated,nonparametric and chaotic in its nature (MacKinlay & Lo, 1988; Atiya & Abu-Mostafa, 1996;Masoud, 2014; Khaidem et al.,2016), makes the prediction of the stock market movements adifficult task for researchers. However, accurate prediction of the trends in stock market pricescan help traders and investors adjust their strategies for better trading in the future and thus

2

increasing the opportunity of gaining profits and contrastingly reducing the chances of losses,i.e. maximize profits and minimize loss. On the other hand, stock market prices behavioursare assumed to be affected by different factors, such as economic, political and natural factors,movement of other stock markets, market psychology, traders’ expectations and choices andother unexpected events. Hence, all these factors should be taken into account in financialforecasting and analysis.

Furthermore, the problem of stock direction prediction has been studied as regression aswell as a classification problem. However treating this problem as a classification problemgives more accurate results than treating it as a regression problem, this is due to that theoutcome or the response variable in the classification algorithms is arranged to be binary, i.e.two-class, either upward trends (bull trends) or downward trends (bear trends).

Classification algorithm which falls under the supervised machine learning algorithms,has received good attention in recent years as an efficient predictive technique and one of thetop ML algorithms that can significantly predict the future trends of the stock market prices.Among the major classification algorithms used to predict the direction of the stock marketprice, the following are worthy of attention: Support Vector Machine (SVM), Artificial NeuralNetwork (ANN), Logistic Regression (LR), Decision trees and Random Forest (RF).

1.3 AimThe paper aims to propose and design an efficient method to predict the trend in the stockmarket price movements. Two ML algorithms are used for the purpose of making a signific-ant prediction with seven technical indicators served as input variables. Also, the long-termprediction is the main focus in this paper that represents 30, 60 and 90 days prediction respect-ively. As RF has been successfully shown to generate high forecasting accuracy, we considerfirst Random Forest Classifier (RFC) which focuses on the conditional mean of the responsevariable and then generalize this model to a model where the output from the predictor is notonly the mean but t-quantiles, this generalization of Random Forest model called Quantile Re-gression Forest (QRF). In addition, the predicting uncertainty in RF is also taken into accountvia prediction intervals and the robustness of the two prediction models are evaluated usingdifferent measures and parameters.To the best knowledge of the author, there is no study that deals with the prediction issue ofthe stock market price using QRF. However, RF has been widely used to capture both, shortand long-term predictions of the stock market price movements.

1.4 OutlineThe remaining portion of the paper is organized as follows: section 2 contains a literaturereview of various classification and regression ML algorithms that are related to time seriesforecasting and have been used in order to predict the trends in stock market price. Section 3describes the data set, the pre-processing of the data, computation of the technical indicatorswhich serve as input variables, methods and algorithms employed and also the model evalu-ation criteria. Thereafter, empirical results from the real data sets are shown in section 4, andfinally, the last section, section 5, discusses and compares the results obtained in this paperwith results obtained in other papers, followed by some concluding remarks.

3

2 Literature ReviewThis section contains a literature survey of various classification and regression ML algorithmsand data mining techniques that have been utilized for the purpose of predicting the trend in thestock market price movements. This section empathizes also the importance of the technicalindicators in machine learning.

Since the predictive model for the empirical study used in this paper is a type of ML al-gorithms, so reviewing existing related work on ML approaches that have been proved effect-ive in stock market price forecasting allows us to understand and conclude that stock marketprices are to some extent predictable, and ML is a useful technique that can be used in order tomake a significant prediction of the trends of the stock market prices, both short and long-termprediction. The aim of the literature survey is to consider the most used algorithms that havebeen applied in order to predict the trend of the stock market prices, and also to justify theselection of the variables that have been used as predictors or input variables for the empir-ical analysis. Taking into account the metrics and parameters that have been used in order toevaluate the robustness and the accuracy of these models, as well as the time span of the stockdata that is used to perform the process of the direction prediction issue.

There is considerable evidence showing that supervised algorithms, among all other al-gorithms, are effective tools in forecasting the stock market trends. Several studies havefocused on comparing different prediction algorithms in order to determine the superior al-gorithm. Ballings et al. (2015) used data from 5767 publicly listed European companies inorder to forecast the long-term stock price movements using several models. Ensemble meth-ods include RF, AdaBoost and Kernel Factory as well as single classifier models such as ANN,LR, SVM and K-Nearest Neighbor, have been compared, and results showed that RF has beenone of the top algorithms preceded by SVM. LR in its turn was the inferior algorithm.This study aimed to use different predictor variables in order to forecast the stock price move-ments one year ahead, these predictor variables have however been selected based on priorstudies, such as cash flow yield, book- to- market ratio and size, stock price index, price orearnings ratio, inflation rate and money supply as well as financial indicators include liquidityindicators (current ratio, collection period of receivable), profitability indicators (ROA, ROE,ROCE) and solvency indicators (gearing ratio, solvency ratio). In addition, the area under thereceiver operating characteristics curve (AUC-ROC curve) has been used as a performancemeasurement in order to check the performance of the models.

Kumar & Thenmozhi (2006) attempted to predict the daily movement direction of S&PCNX NIFTY Market Index of the National Stock Exchange by applying two different super-vised learning algorithms, RF and SVM to a sample including 1360 trading days. The resultsobtained in this study were compared to the other classifications models that have been usedin prior studies to predict the direction of the stock market prices, e.g. ANN, Logit Model andLinear Discernment Analysis.This study used 12 different technical indicators to perform the process of the short-termprediction: Relative Strength Index (RSI) and stochastic %K, Momentum, Commodity Chan-nel Index (CCI), Price Oscillator (OSCP), 5- and 10-day disparity, Accumulation/distributionOscillator (A/D OSC), Larry William’s in %R (William’s %R), Price rate-of-change (ROC),

4

Moving Average (MA) and MA of Stochastic (%D) and slow stochastic (Slow %D). Also,the hit ratio has been used in this paper as a measure of the performance of the models. Theexperimental results showed that SVM with a hit ratio of 68.44% outperforms both, RF witha hit ratio of 67.40%, and all other classifications models from the other studies.

Another study by Ou & Wang, (2009) explored the predictive ability of ten different datamining and ML algorithms in predicting stock price movements of the Hang Seng index ofthe Hong Kong stock market. Among all the ten approaches which included Linear- andQuadratic discriminant analysis, K-Nearest Neighbor, Naïve Bayes, Logit model, tree-basedclassification, Neural Network, SVM, Least Squares SVM and Bayesian classification withGaussian process, the results showed that SVM produces the best predictive performance ofthe stock price movements for in-sample prediction whereas LS-SVM is the best for out-of-sample prediction in terms of hit ratio and error rate criteria which both have been used asperformance measurements. Instead of technical indicators, stock trading data were used aspredictors including high and low price, closing price of S&P 500 index and currency changebetween Hong Kong dollar and US dollar, with a sample of 1732 trading days.

Similar work has been done by Zhang & Dai (2013) where four ML algorithms havebeen compared in order to determine the most effective algorithm in predicting both, shortand long-term stock price trends. Using 16 features including PE ratio, PX Ebitda, currententerprise value, PX volume, 10-day volatility, 2-day net price change, 10-and 50-day MA,alpha overridable, quick ratio, alpha for beta pm, beta row overridable, IS EPS, Risk premiumand the corresponding S&P 500 index, the results showed that SVM has the highest predict-ing accuracy (79.3%) in the long-term prediction case (44 days), compared to LR, QuadraticDiscriminant Analysis and Gaussian Discriminant Analysis with data sample of 1471 tradingdays for 3M stocks. However, the short-term prediction which represents a day or a weekprediction has shown very low accuracy.

Rodriguez, PN. & Rodriguez, A. (2004) predicted the short-term movement of stock mar-ket prices. A comparison of different ML algorithms has been performed, and seven differentclassifications algorithms have been applied to predict the daily movements of the stock pricesof three large emerging markets stock indices, IPC (Mexico), Bovespa (Brazil) and KLSEComposite (Malaysia) with a sample period from Jan-1990 to Dec-2003. The technical indic-ators which have been used in this paper are: 1- and 2-day ROC, 4-day Momentum, 14-dayRSI and Stochastic, OSCP, and 14- and 21-day Disparity. RF however was one of the best clas-sifications models used in predicting the direction of the stock market among all other models,including LR model, Neural Networks model, Gradient Boosting Machine, tree-based modeland PolyClass, where AUC-ROC curve was used in order to evaluate the performance of themodels.

The short -term prediction of stock price trend has also been the main focus for Di (2014).Di focuses in his paper on only the stock price trend in the near future, 1 to 10 days, by ap-plying SVM classifier to three well-known stocks, AAPL, MSFT and AMZN and two marketindexes: NASDAQ and S&P 500. In addition, 12 main technical indicators were used as fea-tures and have been applied to the time series data which contained a set of 4 years, fromJanuary 2010 to December 2014. Using 5-fold cross-validation, the results showed an accur-acy over 70% on predicting a 3 to 10-day average price trend and 56% on predicting the nextday price trend.

5

Milosevic (2016) has in its turn applied ML algorithms to predict the long-term movementof stock market prices. The long-term prediction in this paper represented a one-year aheadprediction of a total 1739 stocks selected from various indexes like S&P 1000, FTSE 100 andS&P Europe 350. Some of the stocks were discarded in order to balance the data set, thus,the data set ended up with 1298 data rows, 649 were labeled as good stocks and the other 649were labeled as bad stocks. Good stocks however explain the stocks that have 10% higherprice in one year period and the rest are classified as bad stocks.Using eight different ML algorithms, C4.5 decision trees, SVM with sequential minimal op-timization, JRip, LR, Naive Bayes, Bayesian Networks, Random trees and RF, and with 28different financial indicators, the results showed that the algorithm that performed the bestis RF with precision, recall and F-score of 75.1%. However, the performance of RF has in-creased when the number of features used was reduced to 11 instead of 28 financial indicators,the precision, recall and F-score obtained in this case with 11 financial indicators is 76.5%.

The effectiveness of the RF in forecasting the direction of the stock market prices has beenalso confirmed by Khaidem et al. (2016). RF has been used as the main predictive model inthis paper, it was applied to different stocks, AAPL and GE which are both listed in Nasdaq,and Samsung Electronics Co. Ltd. which is traded in Korean Stock Exchange.The robustness of the model has been evaluated by considering four parameters: accuracy,precision, recall and specificity, and also by plotting the ROC curve. With a high accuracy,in the range 85%-95% for long-term prediction, the author presented RF as one of the mosteffective prediction models that can be used to predict the movement of stock market prices.The technical indicators which have been used in this study are: RSI, Stochastic Oscillator%K, Williams %R, Moving Average Convergence Divergence (MACD), ROC and On BalanceVolume. In addition, Khaidem et al., (2016) suggested exponential smoothing in the historicalstock data to improve the model’s capabilities in producing better results and giving higheraccuracy.

On the contrary, Vijh et al. (2020) showed that a compression between two ML techniques,ANN and RF, indicates that ANN is a better technique to predict the next day closing priceof the stock compared to RF. This conclusion was drawn based on Root Mean Square Error(RMSE), Mean Absolute Percentage Error (MAPE) and Mean Bias Error (MBE), where thesethree parameters were used in order to evaluate the robustness of the models.

The two models were applied to five different sector companies: Nike, JP Morgan Chaseand Co, Johanson and Johanson, Goldman Sachs and Pfizer, with a data set included ten years,from 4-5-2009 to 4-5-2019. The prediction process of the stock closing price was completedby using six different variables including some technical indicators: stock high minus lowprice (H-L), stock close minus open price (C-O), 7-, 14- and 21-day MA and past 7-daystandard deviation (7-day STD DEV).

Added to RF, another ensemble learning algorithm that has been commonly used in orderto predict the direction of the stock market price is the Xtreme Gradient Boosting. This modelhas been classified as a non-metric classifier model. Dey et al. (2016) applied this model tothe historical data of two stocks, Apple Inc. and Yahoo! Inc. stock price to examine the short-and long-term prediction of the direction of the stock market price.Using six technical indicators: RSI, Stochastic Oscillator %K, Williams %R, MACD, ROCand On Balance Volume, results showed that Xtreme Gradient Boosting model gives betterresults than other forecasting models used in literature to predict the direction of stock market

6

price, with accuracy over 87% for the long-term prediction.This model has been proved to be much better than the traditional non-ensemble algorithm andalso the metric and non-metric classifiers in terms of accuracy. The historical data was alsosmoothed exponentially, and accuracy, precision, recall and specificity as well as the RMSEand ROC curve have been used to evaluate the robustness of the model.

Concerning the ensemble learning algorithms, two tree-based classifiers have been alsocompared to non-ensemble models by Basak et al. (2019). In that paper, Xtreme GradientBoosting and RF have been used for the purpose of predicting the direction of the stock marketprice. Same technical indicators have been used in this study and the results, based on tendifferent companies: AAPL, AMS, AMZN, FB, MSFT, NKE, SNE, TATA, TWTR and TYO,showed that both models give an effective prediction of the direction of the stock market pricewhere RF in its turn outperforms Xtreme Gradient Boosting with extremely high accuracy over90% for medium to long-term prediction. In addition, exponential smoothing was also usedhere and the efficacy of the performance of the two models was evaluated by using differentparameters including F-score, accuracy, precision, recall and specificity.

Prediction of the stock market index has been also studied by Patel et al. (2014a, 2014b)in different papers. The first paper focused on the prediction of the direction of the movementof the stock and stock market index. Ten years of historical data, from Jan 2003 to Dec 2012,has been used of two stocks chosen from the Indian stock markets: Infosys Ltd. and Reli-ance Industries and two stock price indices: CNX Nifty, S&P Bombay Stock Exchange (BSE)Sensex. Also, ten different technical indicators have been used as predictors: simple 10-dayMA, weighted 10-day MA, momentum, stochastic K%, Stochastic D%, RSI, MACD, LarryWilliam’s R%, A/D OSC and CCI.This study used two different approaches for input variables, the first approach used the tech-nical indicators which were computed using the stock trading data, and the second approachtreated these technical indicators as trend deterministic data. The comparison of four predic-tion algorithms, ANN, SVM, RF and Naïve-Bayes with these two approaches showed that RFprovided better prediction than the other prediction models in the first approach. Whereas inthe second approach, it has been found that this approach improves the results obtained in thefirst approach for all the four prediction models. Moreover, the performance of each of thesemodels was evaluated by using accuracy and F-measure.

The second paper however used completely the same historical data for ten years for thesame two indices: CNX Nifty and S&P Bombay Stock exchange (BSE) Sensex from theIndian stock markets. Same technical indicators were used also, and the prediction model usedhere was divided into two-stage approaches, single-stage approach where each of the threemodels is used single-handedly and two-stage fusion approach, uses ANN, RF and SupportVector Regression (SVR) resulting into SVR-ANN, SVR-RF and SVR-SVR fusion predictionmodels. In addition, evaluation measures that have been used to evaluate the performanceof these predictions models are: MAPE, Mean Absolute Error (MAE), Mean Squared Error(MSE) and relative Root Mean Squared Error (rRMSE). The predictions are made howeverfor 1-10, 15 and 30 days respectively, and the results showed that SVR-ANN performs thebest overall for both the stock market indices.

On the other hand, ANN has received much attention in forecasting the direction of thestock market prices, it has been used widely for the purpose of predicting the stock pricemovements. Senol & Ozturan (2009) applied the ANN model to historical data of 27 different

7

stocks from Istanbul stock exchange (ISE) with average trading days being 2250 days, andwith five different technical indicators: 14-day Stochastic %K, 14-and 37-day MA, StochasticMA %D and 14-day RSI. Different techniques of the ANN model with different technicalindicators have been tested where technical indicators were divided into seven different pre-diction systems. This prediction model has also been compared to the LR model where resultsshowed that ANN model outperforms LR model. Moreover, ANN model with three technicalindicators being 14-day RSI, 14-day Stochastic and Stochastic MA, gives the best results withthe lowest average MSE.

Another study similar to the study by Senol & Ozturan (2009) which focused on the move-ments of the Turkey stock market, Masoud (2014) focused on its turn on Libyan stock market.Statistical performance, using different measures such as MAE, MSE, RMSE, R-squared (R2)and MAPE as well as financial performance using the Prediction rate (PR) of the ANN modelhas been estimated in order to evaluate the forecasting ability and the accuracy of the model.Using a sample of 763 trading days and a mixture of 12 different technical and fundamentalindicators based on the previous studies including A/D OSC, CCI, Larry William’s %R,MACD, Momentum, ROC, RSI, 10-day simple and weighted MA, Stochastic %K, MA of%K (Stochastic %D) and MA of %D (Stochastic slow %D), with average prediction rate 91%,results showed that ANN model provides a significant prediction of the movements of thestock market price.

A third study by Qiu & Song (2016) has also confirmed that ANN model is an effectivemodel in stock price direction prediction. This study optimized the ANN model instead of us-ing genetic algorithms (GA), for better results and higher prediction accuracy. The model wasapplied to the most widely used market index (Nikkei 225) in order to predict the direction ofthe next day’s price of the Japanese stock market index using a sample of 1707 trading days.In addition, two different types of predictor variables were compared also, the first type in-cluded 13 technical indicators being: Momentum, Larry William’s %R, RSI, A/D OSC, CCI,ROC, Stochastic %K, MA of %K (Stochastic %D), MA of %D (Stochastic slow %D), OSCP,5-and 10-day Disparity, whereas the second type included only 9 of these 13 technical indic-ators. The results, using the hit ratio to evaluate the prediction performance of the model,showed that the second type of the technical indicators gives better results than the first typewith hit ratio being 81.27%. The ANN model was also compared to different models in pre-vious studies and it has been shown that, ANN model in this paper had higher predictionaccuracy than the other models.Below we summarize the literature survey in a table includes the work that has been done byeach author.

8

Table 1: Summary of the literature survey

Author Prediction Method Features Performance MeasurementsRodriguez, PN. & Rodriguez, A.,(2004)

RF, LR, NN, GradientBoosting Machine, tree-based model and PolyClass

8 Lagged technical indicators AUC-ROC curve

Kumar & Thenmozhi (2006) RF and SVM 12 Technical indicators Hit Ratio (68.44)Senol & Ozturan (2009) ANN 5 Technical indicators MSEOu & Wang, (2009) Linear- and Quadratic

discriminant analysis, K-Nearest Neighbor, NaïveBayes, LM, tree-basedclassification,NN, SVM,LS- SVM and Bayesianwith Gaussian process

5 Stock trading data Hit Ratio and Error Rate Criteria.

Zhang & Dai, (2013) Gaussian Discriminant Ana-lysis, Quadratic discrimin-ant analysis, LR and SVM

16 Features Accuracy (SVM=79.3%)

Masoud (2014) ANN 12 Technical and FundamentalIndicators

MAE, MSE, RMSE, R-squared andMAPE

Patel et al. (2014a) ANN, SVM, RF and NaïveBayes

10 Technical Indicators Accuracy and F-measure

Patel et al. ( 2014b) ANN, RF and SVR 10 Technical Indicators MAPE, MAE, MSE and rRMSEDi (2014) SVM 12 Technical indicators 5-fold cross-validation (Accuracy)Ballings et al. (2015) RF, AdaBoost, KF, NN, LR,

SVM, K-Nearest NeighborFinancial indicators, profitabil-ity indicators and solvency in-dicators

AUC-ROC curve

Milosevic (2016) RF, C4.5, SVM, JRip, LR,Naive Bayes, Bayesian Net-works, Random trees

28 and 11 Financial indicators Precision, Recall and F-score

Dey et al. (2016) Xtreme Gradient Boosting 6 Technical Indicators Accuracy, Precision, Recall, Spe-cificity, RMSE and ROC curve

Qiu & Song (2016) ANN 13 and 9 Technical indicators Hit Ratio (81.27)Khaidem et al. (2016) RF 6 Technical indicators Accuracy, Precision, Recall, Spe-

cificity and ROC curve.Basak et al. (2019) RF and Xtreme Gradient

Boosting6 Technical indicators Accuracy, Precision, Recall, Spe-

cificity and F-scoreVijh et al. (2020) ANN and RF 6 Features RMSE, MAPE and MBE

9

3 Data and methodologyThis section describes the data set used, provides the technical indicators with their formu-las which serve as input variables, shows how the data is labeled, and finally discusses theprediction models and their evaluation criteria.

3.1 Experimental design3.1.1 Data Description

This study is based on the historical data of one company, Apple Inc., all of the data availablefor this company has been used starting from the day they went public, with a time spanranges from December 12, 1980 to August 1, 2020. The data sample is obtained from YahooFinance. It consists of the daily closing index levels with a total number of samples includes9,993 trading days. The entire data thereafter was split into two sets, 80% of the entire datais used as in-sample data, i.e., training data set and the remaining 20% is considered as out-of-sample data or testing data set. The training data set is used to train the prediction modelwhereas the testing data set is used for the evaluation of the trained model. Moreover, thehistorical data is first exponentially smoothed, then the technical indicators are extracted fromusing the daily closing index levels.The total data points of the AAPL daily closing price are plotted in Fig 1 below.

0

30

60

90

jan1980

jan1990

jan2000

jan2010

jan2020

CLOS

ING

PRIC

E

Figure 1: Daily AAPL closing prices from December 1980 to August 2020

10

3.1.2 Exponential Smoothing

Time series data is first exponentially smoothed. Exponential smoothing is a way to smoothout time series data and it is used to smooth univariate data which contains a single variable.The main purpose of using this technique in this paper is to remove the noise and the randomvariation from the historical data, and hence allowing the prediction model to easily determinethe price trend in the stock market price behaviour for short-term prediction as longer-termprediction.

Unlike averaging methods e.g. simple averages and moving averages, which apply equalweights to the historical data, exponential smoothing methods apply an unequal set of weightsto the historical data. These weights are typically assigned in an exponential manner from themost recent to the most distant observations, therefore these methods are known as exponen-tial smoothing methods. In other words, exponential smoothing methods imply exponentialdecreasing weights, these weights are assigned unequally for newest to oldest observations.The most recent observations are assumed to be more relevant and thus they are given morepriority and assigned more weights.

The exponential smoothing method in its turn includes different types of methods, suchas Holt’s linear method, Holt- Winters’ method and Pegels’ classification. The simplest ex-ponential smoothing method however is the single exponential smoothing (SES), which canbe obtained as soon as two observations are available. This type of exponential smoothing isused in this paper. The smoothed statistics for the next period of a series “Y” is calculatedusing the following formula:

St+1 = St +α(Yt −St)

which can also be written asSt+1 = αYt +(1−α)St

andS0 = Y0

where,t is the time period (t > 0)Yt is the actual observationSt is the smoothed statisticsα is the smoothing factor, a constant between 0 and 1. The closer alpha to zero, the slower thesmoothing is, larger alpha however reduces the level of smoothing and alpha = 1 implies thatthe smoothed statistics is equal to the actual observation.

3.1.3 Technical Indicators

Technical indicators are useful tools that can be used to advance the technical analysis. Theseindicators help investors to make decisions regarding the buying and selling of the stocks andthus create a better understanding of the price action by determining what stocks to buy andwhat stocks to sell and more importantly when to do that.

The efficiency of the technical indicators in analyzing future trends has been agreed uponby many investors and financial managers. Technical indicators and their corresponding para-

11

meters are exploited by investors to check for bullish and bearish signals which can furtherhelp investors make decisions regarding entry and exit to the market.

The two main types of technical indicators are lagging and leading indicators. Laggingindicators, also called trend-following indicators, are these indicators which follow the priceaction and thus move after prices move, whereas leading indicators are those which changebefore prices change and therefore lead price movements (Larson, 2012). Technical indicatorscan also be grouped based on their functions into four important types: trend, momentum,volume and volatility indicators.

In the light of prior studies, technical indicators have been used as input variables in theconstruction of the prediction model to predict the direction of the stock market prices. Thus,the feature selection in this paper is based on the most used technical indicators that producedsignificant results with RF technique and the other ensemble learning techniques that are usedas prediction models in prior studies. Description of the technical indicators used in this paperas well as their formulas are given below.

• Moving Average Convergence DivergenceThe moving average convergence divergence (MACD) is defined to be a trend-followingmomentum indicator that helps investors understand whether the bearish or bullishmovement in prices is becoming stronger or weaker. This indicator was developed byGerald Appel and it turns two moving averages of prices into a momentum by compar-ing and subtracting one from another and thus shows the relationship between them. Itis computed by subtracting the 26-day exponential moving average which is the longermoving average from the 12-day exponential moving average of a security’s priceswhich is defined to be the shorter one. The line obtained from this calculation calledthe MACD line and the 9-day exponential moving average of the MACD line called thesignal line which can work as an incitement for buy and sell signals. However, MACDindicates a buy signal whenever it is above the signal line and a sell signal whenever itis below the signal line.The formula for calculating MACD is as follows:

MACD = EMA12(C)−EMA26(C)

SL = EMA9(MACD)

where,MACD stands for moving average convergence divergence or MACD line and SL standsfor the signal line.EMAn= n-day exponential moving averageC = closing price

• Relative Strength IndexRelative strength index (RSI) is a popular momentum oscillator that was developed byJ. Welles Wilder. It evaluates the conditions of overbought and oversold in the stockprices by measuring the extent of recent changes in prices. The RSI compares stock’saverage gains and losses over a specific period of time, typically 14 trading days. RSI

12

ranges between 0 and 100, and traditionally, RSI above 70 indicates that the stock isoverbought, while RSI below 30 indicates that the stock is oversold.In this paper, we use a 27-day time-frame to calculate the initial value of the RSI. Theformula for calculating RSI is:

RSI = 100− 1001+RS

RS = Average gain over past 27 daysAverage loss over past 27 days

where,RSI stands for relative strength index, and RS stands for relative strength.

• Price Rate of ChangeThe price rate of change (ROC) is another momentum oscillator that compares and cal-culates the percent change in price between the current price and the price n-periodsago. In other words, ROC measures the changes between the current price with respectto the earlier closing price in n days ago. It moves from positive to negative, and fluctu-ates above and below the zero-line. However, this oscillator can be used for determiningthe overbought and oversold conditions, divergences and also zero-line crossovers.We use a 21-day time-frame to calculate the initial value of the ROC. The formula forcalculating ROC is as follows:

ROC =(

Ct−Ct−21Ct−21

)×100

where,ROC stands for price rate of change at time tCt = closing price at time tCt−21 = closing price 21 periods ago

• Stochastic OscillatorThe stochastic oscillator which is often denoted by the symbol (%K), is a momentumoscillator that was developed by George Lane. The stochastic oscillator identifies thelocation of the stock’s closing price relative to the high and low range of the stock’sprice over a period of time, typically being 14 trading days. The stochastic oscillatorvaries from 0 to 100, a reading above 80 generally represents overbought while below20 represents oversold. We use a 14-day time-frame %K. The formula for calculatingthe stochastic oscillator is given below:

%K =(

Ct−L14H14−L14

)×100

where,Ct = the current closing price.L14 = lowest low over the past 14 daysH14 = highest high in the last 14 days

13

• Williams Percentage RangeWilliams percentage range which is also called Williams %R is a common indicatordeveloped by Larry Williams. This indicator is often denoted by the symbol (%R), itmeasures the overbought and oversold levels and it works inversely to %K. Whilst %Kranges between 0 and 100, %R ranges between 0 and -100. A Williams %R below -80indicates a buy signal, whereas a Williams %R above -20 indicates a sell signal.We use also a 14-day time frame %R, the formula used to calculate the Williams %R is:

%R =(

H14−CtH14−L14

)×−100

where,Ct = the current closing priceL14 = lowest low over the past 14 daysH14 = highest high in the last 14 days

• Commodity Channel IndexThe Commodity Channel Index (CCI) was developed by Donald Lambert, it is a usefuloscillator that is used to estimate the direction and the strength of the stock price trend.This indicator is also used to determine when stock prices reach the condition of eitheroverbought and oversold. The CCI is calculated by first determining the differencebetween the mean price of a stock and the average of the means, then comparing thisdifference to the average difference over a period of time, typically 20 days. The CCI isoften scaled by an inverse factor of 0.015. The formula used to calculate the CCI is:

CCI = Typical price−MA200.015×D

where,Typical price = average of low, high and close prices: ∑

20i=1 (H +L+C)÷3

MA20 = simple moving average over 20 daysD = mean deviation

• Disparity IndexThe Disparity Index (DIX) is another useful indicator that is used commonly in technicalanalysis. This indicator was developed by Steve Nison and it is a momentum indicatorthat compares the stock’s current price with its moving average (MA) over a particulartime period. DIX below 0 indicates that the stock’s current price is below the n-day MA,DIX above 0 indicates that the stock’s current price is above the n-day MA, whereas inthe case the DIX equals 0 indicates that the stock’s current price is equal to the n-dayMA. 14-day MA is used in this paper. The formula for calculating the DIX with 14-dayMA is as follows:

DIX = Ct−MA14MA14×100

where,Ct = current stock price.MA14 = moving average over 14 days

14

3.2 Predication ModelsIn order to predict the trend in the stock market price movement, we use ensemble learningalgorithms. Ensemble learning algorithms or techniques combine several machine learningalgorithms into one predictive model in order to produce better predictive performance thanthis could be obtained from using any model singly. The main goal of using these techniquesis to deal with modeling issues related to time series forecasting and hence, improve the sta-bility and accuracy of the machine learning algorithm, and produce better results as well.Furthermore, the main factors that are assumed to cause an error in machine learning modelsare: variance, bias and noise, and ensemble learning algorithms can be used to handle theover-fitting issue and thus improve the algorithm used by minimizing all these factors.

Time series forecasting problems can however be classified as classification problems orregression problems. There is no big difference between these two except that in the regressionpredictive modeling problems a quantity is predicted, i.e. regression algorithm produces anumerical or continuous output variable, whereas in the classification predictive modelingproblems a category is predicted, i.e. the output variable is discrete or categorical.

We turn our focus to the long-term prediction rather than short-term prediction since higherpredictive accuracy for long term prediction has been obtained in the prior studies. Two en-semble learning algorithms are used in order to make a significant long-term forecast of thedirection of the stock market price, RFC and QRF.

Since QRF is a generalization of RF, and RF in its turn is an ensemble learning algorithmthat is constructed from decision trees and used to improve the accuracy of the decision trees,it is worth considering the framework under which both decision trees and RF operate. Theprediction models employed in this paper are described in the following subsections.

3.2.1 Decision tress

Classification and regression problems can be constructed in a form of a tree structure calleda decision tree. Decision tree algorithm (Quinlan, 1986), which is also called classificationand regression tree (CART) in computer science, is essentially a type of machine learningalgorithm used to deal mostly with classification problems, rather than regression problemswhich both belong to the family of supervised learning algorithms.

The basic idea behind CART is to start with a root node where the entire data set is situated.The data then is split into two or more mutually exclusive child nodes depending on differentclasses, each child node is in turn split into grandchild nodes and so on.

The trees that are descended from the root node called sub-trees. Each node in the decisiontree acts as a test case for some attribute, and each sub-nodes descending from the node cor-responds to the possible responses to the test case. The child node that descends from the rootnode which provides the classification of the output variable is called the leaf or terminal node;the node that does not split further. While the one which is split into further sub-nodes calledinternal or a decision node. CART deals with different parameters called the predictors orinput variables, at each specific node, the final decision is reached by splitting data dependingon the response of each particular question that is asked over each parameter.

Furthermore, decision tree learning studies the training data to such an extent where this is

15

assumed to influence the performance of the model negatively and thus produce insignificantresults causing over-fitting. However, the more the data is split, the higher the risk of over-fitting, which explains why the accuracy under the decision tree algorithm is quite low. Hence,RF was introduced as an ensemble learning algorithm to the decision tree to deal with thisproblem and to give more effective results than those obtained from a single tree.

3.2.2 Random Forest

The fact that decision trees have a high variance which causes over-fitting in data when usingthe decision tree technique resulted in searching for a model that deals with over-fitting issueswhere the variance can be reduced. Random forest (Breiman, 2001) falls under the categoryof the ensemble learning algorithms, it builds multiple decision trees, often called forests.The name Random Forest comes from the fact that this model is a forest of randomly createddecision trees, used to overcome the issue of over-fitting that often occurs when using a singletree in the case of decision tree algorithms.

The primary difference between decision tree and RF is that in the decision tree learningthe entire data is used in order to construct a single tree containing all the parameters or thepredictors, that is, the entire training data set is considered as the root, whereas RF selects a setof the predictors randomly and thereafter builds a decision tree for each set of the predictorsselected, i.e. RF is not built on the entire data, each decision tree is however built on the partof the data where the data is recursively split into partitions.

The final outcome in the RF is however reached by simply combining the outcomes of themultiple decision trees that are created randomly and then taking the average of all the out-comes obtained from all these decision trees in the forests based on the respective parametersthat have been used in each tree. That is, after creating multiple random decision trees, eachtree votes the class depending on the poll created, and the class that receives the most votes isdefined as the predicted class. Furthermore, the more trees in the forest or the higher numberof trees created, the higher accuracy and more effective results obtained.

3.2.3 Quantile Regression Forest

Quantiles, in general, refer to dividing a sample or probability distribution into equal-sizedsubgroups where each subgroup contains the same fraction of the total population. In otherwords, it divides the range of a continuous random variable into subgroups of equal probabil-ity, therefore, quantiles can be considered as the points or values that describe the location ofthe distribution. The most used quantile is the median which corresponds to 50% percentileor 0.5-quartiles. Similarly, 0.25 and 0.75 quantiles or 25 and 75 percentiles are called the firstand the third quartiles respectively, 0.2, 0.4, 0.6, and 0.8-quantiles which correspond to 20, 40,60 and 80 percentile are called quantiles, and finally, 0.1, 0.2, . . . , 0.9 quantiles correspondingto 10, 20, . . . , 90-percentile are known as deciles.

Quantile regression (Koenker & Bassett, 1978; Koenker & Hallock, 2001) however is anextension of the classical least-square model, it is used whenever the simple linear regressioncannot be applied to study the effect of the predictor variables on a specific response variable,or in other words, it is used whenever the response variable has a non-linear relationship with

16

the predictor variables. It also helps to build an understating of the outcomes that do not followa normal distribution, i.e. the response variable is not normally distributed.

Quantile regression gives a more detailed overview of the relationship between the depend-ent variable and the independent variables. It approximates the whole conditional distributionof a response variable, and thus, allows us to understand the relationship between these vari-ables from another perspective than the conditional mean of the outcomes that simple linearregression focuses on.

Turning our attention to quantile regression models that have been used recently for clas-sification and regression problems based on the decision trees and RF, quantile regressionforest (Meinshausen, 2006) is a nonparametric regression that generalizes the RF algorithmand estimates the conditional median and other quantiles of the outcomes instead of restrictingattention to only conditional mean.

Quantile regression forest operates in the same manner as RF. However, QRF differs fromRF in the final outcome reached from the multiple decision trees. QRF keeps a row distri-bution of the dependent variable at each terminal node, and instead of focusing only on theconditional mean in the RF case, i.e. the center of the conditional distribution, QRF deals withthe center of the conditional distribution as well as the upper and lower tails of the conditionaldistribution.

Letting Y be a real-valued response variable and X a predictor variable, the prediction of asingle tree of RF for a new data point X = x is the estimates µ̂(x) of the conditional mean ofthe response variable Y given X = x which is defined as:

E(Y | X = x)

Random forest thereafter estimates the conditional mean by the averaged predictions of msingle tress, each constructed with an independent and identically distributed set of the inde-pendent variables. The prediction of random forest is then given by:

µ̂(x) =n

∑i=1

wi(x)Yi

However, for X = x, the conditional distribution function F(y | X = x) is given by the probab-ility that, Y is smaller or equal to y ∈ R, i.e.

F(y | X = x) = P(Y ≤ y | X = x)

Quantile regression forest is then estimates the conditional distribution function of the re-sponse variable Y , given X = x by:

F̂(y | X = x) =n

∑i=1

wi(x)1{Yi≤y}

Now the τ − th quantile Qτ(x), i.e the estimate of the conditional quantile for any τ , with0 < τ < 1 can be calculated as:

Qτ(x) = in f{y : F̂(y | X = x)≥ τ}

17

In addition to reporting the whole distribution of a response variable, QRF can be alsoused to build prediction intervals. Prediction intervals can be constructed from the conditionalquantiles of the response variable which are predicted from QRF. More specifically, the (1−τ)×100% prediction interval is constructed by using the following formula:

I(x) = [(qτ/2(Y | X = x), q(1−τ)/2(Y | X = x)]

3.3 Data LabellingDepending on the prediction model used, the target to be predicted after n days is labeleddifferently in each model. In RFC, the target to be predicted in the ith days is calculated fromusing the daily closing prices of AAPL with the following formula:

Targeti = Sign(Closei+d−Closei)

where,d is the number of days after which the prediction is to be made, i.e. ”d” represents the timewindows, 30, 60 and 90 days respectively.The target is labeled in that case with either −1 or 1, a negative value of the target indicatesthat the price of the stock has fallen where a positive value of the target indicates that the priceof the stock has raised after d days.

In QRF however, the time series historical stock data is grouped into smaller groups orintervals. As QRF deals with regression problems, i.e. the response variable in this case mustbe continuous rather than discrete or class label, we use a technique called "data binning". Databinning, also called discrete binning, is a pre-processing of the data used in order to reduce theeffects of minor observation errors. First, the target is calculated using the following formula:

Targeti = Closei+d−Closei

where,d is the number of days after which the prediction is to be made, i.e. ”d” represents the timewindows, 30, 60 and 90 days respectively.The values in the target are then replaced by intervals, each contains some values, eitherpositive or negative values. The negative values of the target belong to these groups whichcontain all the values that are less than zero and the positive values of the target belong tothe groups that contain all the values which are above zero. This technique is used wheneverthe variable is continuous to make the distinguishment of the values easier, and since we arepredicting the direction of the stock market price which can be either up or down, this wouldhelp us easily classify the groups that represent the positive shift in the prices and the groupsthat represent the negative shift in stock prices.

Table 2 below presents the descriptive statistics of the target for 30, 60 and 90 days re-spectively.

18

Table 2: Descriptive statistics of the response variable, α = 0.0095

30-day 60-day 90-dayMin -2.378699 -3.54087 -3.70988Max 6.290644 10.42402 12.98259

Mean 0.217078 0.41927 0.61424Median 0.009730 0.01211 0.02915

Table 8 shows that most of the target values are located around zero since the mean andthe median are quite close to zero. Now we try to bin our data so that instead of these valuesincluded in the target, we get smaller groups or intervals of data. Figures below show thehistograms of the target where the data has been binned into smaller groups or intervals, eachtarget has different groups that is generated in R using 10 bins.

Target

Fre

quen

cy

−2 0 2 4 6

010

0030

0050

00

Figure 2: Target 30 days

Target

Fre

quen

cy

0 5 10

010

0020

0030

0040

0050

00

Figure 3: Target 60 days

Target

Fre

quen

cy

0 5 10

010

0030

0050

00

Figure 4: Target 90 days

The target for 30 days has been created with 10 groups, values between −3 and −2 arelabeled as group 1, −2 to −1 are labeled as group 2, −1 to 0 are labeled as group 3 and so on.These three groups, the first, second and third group contain the negative values and thus theyrepresent the negative shift of the stock prices or the downward trend. Whereas, the remaininggroups (4 to 10) represent the positive shift of the stock price or the upward trend after 30days.

Figures 3 and 4 can be interpreted in the same way. The target for 60 days has been binnedinto 15 different groups and the target for 90 days has been binned into 9 different groups,each represents either if there is a positive or a negative shift in the stock prices after 60 or 90days.

Fig 5 below shows the framework under which the prediction of the direction of the stockmarket price is made using RFC and QRF,

19

Ran

do

m F

ore

st

Cla

ssif

ier

RSI

RO

C

Will

iam

s

%R

MA

CD

Sto

chas

tic

%K

CC

I

DIX

Dir

ecti

on

Up

/Do

wn

Pre

dic

tio

n

Mo

del

Ou

tpu

t

Var

iab

le

Inp

ut

Var

iab

les

Dat

a

Co

llect

ion

Exp

on

enti

al

Smo

oth

ing

Po

siti

ve R

etu

rns

Ta

rget

= 1

N

egat

ive

Ret

urn

s

T

arge

t= -

1

(a)P

redi

ctin

gw

ithR

FC

Qu

an

tile

Reg

ressio

n F

ore

st

RSI

RO

C

Will

iam

s

%R

MA

CD

Sto

chas

tic

%K

CC

I

DIX

Dir

ecti

on

Up

/Do

wn

Pre

dic

tio

n

Mo

del

Ou

tpu

t

Var

iab

le

Inp

ut

Var

iab

les

Dat

a

Co

llect

ion

Exp

on

enti

al

Smo

oth

ing

Dat

a B

inn

ing

(b)P

redi

ctin

gw

ithQ

RF

Figu

re5:

Pred

ictio

nM

odel

s

20

3.4 Model Evaluation CriteriaThe prediction of the direction of stock market price, whether it will go up or down, wouldaffect the trader’s decision of buying or selling a stock. Hence the effectiveness of the modelsused should be evaluated accurately, and to do that, different measures and parameters areused for both RFC and QRF.

In the case of RF, the performance of a binary classifier is usually described in a formof a matrix called the confusion matrix. The binary classifier is then evaluated by using fourparameters that are calculated from the confusion matrix: accuracy, precision, sensitivity andspecificity.

Accuracy is a ratio of correctly predicted observations to the total observations. Precisionis a ratio of correctly predicted positive observations to the total predicted positive. Whereassensitivity, also called recall, is a ratio of correctly predicted positive to the total observationsof the actual class, and finally specificity is a ratio of correctly predicted negative to the totalobservations of the predicted class. The formula for computing each of these parameters areas follows:

Accuracy =T P+NP

T P+T N +FP+FN

Precision =T P

T P+FP

Sensitivity =T P

T P+FN

Specificity =T N

T N +FPwhere,T P = number of true positive valuesT N = number of true negative valuesFP = number of false positive valuesFN = number of false negative values

For QRF however, we use the standard strategic indicators to skill our model. The mostcommonly used statistical metrics to evaluate the regression problem’s accuracy; regressionmodel accuracy metrics, are RMSE, MAE or MAPE. We focus on RMSE and MAPE to eval-uate the robustness of QRF. These two metrics are usually express the accuracy of regressionmodel as a ratio defined by the following formulas:

21

RMSE =

√1n

n

∑i=1

(Ci−Pi)2

MAPE =1n

n

∑i=1| Ci−Pi

Ci|

where,n = the total window sizeCi = closing pricePi = predicted closing price

RMSE is the standard deviation of the errors, or in other words is the error rate by thesquare root of the mean square error which in its turn represents the difference between theactual values and the predicted values extracted by square the average difference over thedata set used. While MAPE is commonly used as a loss function for regression models, itrepresents the difference between the actual values and the predicted values divided by theactual values extracted by averaging the absolute difference over the data set used.

22

4 Experimental ResultsThis section includes the implementation of the two ML algorithms which are used as predic-tion models. Results of the binary classifier that is used when the stock prediction is treatedas a classification problem as well as results of the regressor that is used to deal with the stockprediction problem as a regression problem are provided in this section.

With the aim of predicting the stock price direction using machine learning algorithms,time series data is used, focusing only on the daily closing prices of the stock. Our resultsconsist of two parts, classification and regression. In the first part, we treat the stock priceprediction issue as a classification problem using RFC. The second part however deals withthe stock price prediction issue as a regression problem using QRF. For RFC, we investigatethe effect of the smoothing factor (α), considering three different values, whereas, for QRF,we focus only on the value of (α) that gives the best results in the RFC model.

Using the historical data of AAPL Inc., we focus our attention on the long-term predictionrather than short-term prediction. We evaluate the robustness of each of the models usingthe different parameters that are discussed in the methodology, and ultimately, we provide areflective comparison between these two models by examining the importance of the seventechnical indicators which serve as input variables in both models.

4.1 Random Forest ClassifierThe response variable is first treated as a binary variable. RFC is used to predict the directionof AAPL stock price when we have a binary outcome. However, exponential smoothing hasbeen recommended in the prior studies to increase the accuracy of the ensemble learningalgorithms. Hence, since the smoothing factor is assumed to have a significant effect on theperformance of the model, the results of RFC are divided into three parts where we explorethe effect of the smoothing factor (α) with three different values.

A total number of trees equals 100 is used in this model and some of the parameters havebeen tuned to achieve better results such as the mtry which defines the number of variablesrandomly sampled as candidates at each split. Furthermore, we evaluate the performance ofthe RFC by considering the confusion matrix and the four parameters that are calculated fromthat matrix, and also by plotting the OOB error rate.

4.1.1 Random Forest with α = 0.0095

We choose the first value of (α) to be a very small value (α = 0.0095), since a small value of(α) is assumed to affect the long-term prediction. The daily exponentially smoothed closingprices of AAPL with smoothing factor (α = 0.0095) from 1980 to 2020 are plotted in Fig 6below.

23

0

20

40

60

jan1980

jan1990

jan2000

jan2010

jan2020

CLOS

ING

PRIC

E

Figure 6: Daily exponentially smoothed AAPL closing prices from 1980 to 2020, α = 0.0095

The results of RFC are presented in terms of confusion matrix containing the combinationsof the actual and predicted values for one, two and three months respectively as illustrated inTable 3 below.

Table 3: Confusion Matrix for α = 0.0095

Actual ValuesActual up Actual down

Pred

icte

dVa

lues 30- Predicted up 1214 55

day Predicted down 52 65460- Predicted up 1202 71day Predicted down 46 65690- Predicted up 1192 65day Predicted down 51 667

The four parameters: accuracy, sensitivity, precision and specificity are then calculatedfrom the confusion matrix. Results of RFC for 30, 60 and 90 days with smoothing factor(α = 0.0095) are illustrated in Table 4 below.

Table 4: Results of random forest classifier, α = 0.0095

Trading Window Accuracy Sensitivity Precision Specificity30-day 94.58% 0.9589 0.9567 0.922460-day 94.08% 0.9631 0.9442 0.902390-day 94.13% 0.9590 0.9483 0.9112

24

It can be seen that the accuracy of the RFC is quite high when the exponential smoothing isvery small. However, the 30 days prediction has the highest accuracy but generally, there is nobig difference in the accuracy of the RFC between the one, two and three months predictions.

The OOB error rate is also plotted in Fig 7 for 30, 60 and 90-days respectively. The OOBerror rate is the average error for each training sample in the original training set. It also showshow random forest performs as the number of the trees increases in the forest.From the plot, we can see that as the number of trees increases in the forest, random forestconverges and the OOB error rate decreases. However, the OOB error rate increases as thenumber of days after which the prediction is to be made increases, i.e. the very long-termprediction (90 days) has the highest OOB error rate followed by the 60 days prediction andlastly the 30 days prediction has the lowest OOB error rate.

0 20 40 60 80 100

0.00

0.05

0.10

0.15

0.20

Number of trees

OO

B e

rror

rat

e

OOB error rate for 30 daysOOB error rate for 60 daysOOB error rate for 90 days

Figure 7: OOB error rate, α = 0.0095

25

4.1.2 Random Forest with α = 0.2

A commonly used value of the smoothing factor is to choose it to be between 0.1 and 0.2,and by default (α) is assumed to be equal to 0.2. Hence we choose the second value of thesmoothing constant to be 0.2.Fig 8 shows how the daily closing price of AAPL looks like when we smooth our time seriesdata with an exponential factor of 0.2.

0

25

50

75

100

jan1980

jan1990

jan2000

jan2010

jan2020

CLOS

ING

PRIC

E

Figure 8: Daily exponentially smoothed AAPL closing prices from 1980 to 2020, α = 0.2

The confusion matrix and the four parameters are illustrated in Table 5 and Table 6. Theaccuracy of the RFC for 30, 60 and 90 days prediction is lower in the case when we choose(α) to be 0.2. However, it can be seen clearly that the accuracy increases as we predict thestock price direction for a longer time period, i.e. 60 days prediction has higher accuracy than30 days prediction which in its turn has lower accuracy than 90 days prediction. The accuracy,in this case, can be defined to be stable since the difference in the accuracy can be seen clearlyas we go for longer-term prediction.

Table 5: Confusion Matrix for α = 0.2

Actual ValuesActual up Actual down

Pred

icte

dVa

lues 30- Predicted up 983 300

day Predicted down 187 50560- Predicted up 1086 286day Predicted down 140 46390- Predicted up 1124 261day Predicted down 132 458

26

Table 6: Results of random forest classifier, α = 0.2

Trading Window Accuracy Sensitivity Precision Specificity30-day 75.34% 0.8402 0.7662 0.627360-day 78.43% 0.8858 0.7915 0.618290-day 80.10% 0.8949 0.8116 0.6370

Considering the OOB error rate with the smoothing factor (α = 0.2), we see a big differ-ence between this OOB and the first OOB error rate plotted with (α = 0.0095). The OOB errorrate in the case of (α = 0.2) for 30, 60 and 90 days ranges between 0.25 and 0.38, whereasin the case of (α = 0.0095) it is between 0.05 and 0.13. It is also decreasing as the numberof trees in the forest increases. The 90 days prediction has the lowest OOB error rate and the30 days prediction has the highest OOB error rate which is the opposite of what we have seenwhen (α = 0.0095). The OOB error rate for 60-days prediction initially oscillates above andbelow the OOB for the 90 days prediction when the number of the trees is between 0 and 15,thereafter takes on constant values between the OOB for the 30 and the 90 days.

0 20 40 60 80 100

0.20

0.25

0.30

0.35

0.40

Number of trees

OO

B e

rror

rat

e

OOB error rate for 30 daysOOB error rate for 60 daysOOB error rate for 90 days

Figure 9: OOB error rate, α = 0.2

27

4.1.3 Random Forest with α = 0.95

The third value of the exponential smoothing factor is chosen to be close to 1 in order to see ifsuch large values of the smoothing factor have an effect on the long-term prediction. The dailyexponentially smoothed AAPL closing prices with smoothing factor (α = 0.95) are plotted inFig 10.

0

25

50

75

100

jan1980

jan1990

jan2000

jan2010

jan2020

CLOS

ING

PRIC

E

Figure 10: Daily exponentially smoothed AAPL closing prices from 1980 to 2020, α = 0.95

As we follow the same procedure, we have the confusion matrix for 30, 60 and 90 daysprediction with the actual and predicted values illustrated in Table 7 below.

Table 7: Confusion Matrix for α = 0.95

Actual ValuesActual up Actual down

Pred

icte

dVa

lues 30- Predicted up 926 416

day Predicted down 241 39260- Predicted up 1023 423day Predicted down 207 32290- Predicted up 1050 424day Predicted down 198 302

The four parameters for the 30, 60 and 90 days predication are however illustrated in Table8. Accuracy at its lowest when (α) close to 1 with slightly small differences as we increase thedays after which the prediction is made. However, 90 days prediction has the highest accuracyand 30 days prediction has the lowest accuracy.

28

Table 8: Results of random forest classifier, α = 0.95

Trading Window Accuracy Sensitivity Precision Specificity30-day 66.73% 0.7935 0.6981 0.485160-day 68.10% 0.8317 0.7075 0.432290-day 68.51% 0.8415 0.7123 0.4160

The OOB error rate also decreases as the number of trees in the forest increases. The 90days prediction which has the highest accuracy has the lowest OOB error rate and the 30 daysprediction which has the lowest accuracy has the highest OOB error rate, 60-day predictionfall in between both predictions as shown in Fig 11 below.

0 20 40 60 80 100

0.30

0.35

0.40

0.45

0.50

Number of trees

OO

B e

rror

rat

e

OOB error rate for 30 daysOOB error rate for 60 daysOOB error rate for 90 days

Figure 11: OOB error rate, α = 0.95

29

4.2 Quantile Regression ForestAfter binning our data and obtaining different intervals using the data binning technique, QRFis now used with the value of the exponential factor that gave the best results in RFC, i.e.(α = 0.0095).The same technical indicators that have been used in RFC are used in QRF. We use also atotal number of 100 trees in this model and we tune some of the parameters to achieve betterresults. The quantiles that are predicted with this model are: 0.025-quantile, 0.25-quantile,0.5-quantile, 0.75-quantile and 0.975-quantile.The robustness of the model is evaluated using standard strategic indicators, RMSE andMAPE. The results of QRF are illustrated in Table 9 for 30, 60 and 90 days prediction re-spectively.

Table 9: Results of Quantile Regression Forest

000...000222555qqq 000...222555qqq 000...555qqq 000...777555qqq 000...999777555qqq MMMeeeaaannn

30-day RMSE 0.78 0.40 0.31 0.36 0.80 0.30MAPE 0.11% 0.03% 0.03% 0.04% 0.13% 0.04%

60-day RMSE 1.41 0.77 0.45 0.58 1.58 0.49MAPE 0.14% 0.04% 0.03% 0.06% 0.2% 0.06%

90-day RMSE 1.04 0.65 0.39 0.53 1.29 0.42MAPE 0.17% 0.06% 0.05% 0.09% 0.33% 0.09%

As it can be observed from Table 9, RMSE is quite low, lower than 1 for the mean andalso for all the quantiles predicted. However, the mean and the median has the smallest valuesof RMSE. Also, the median has the lowest value of MAPE for all the three predictions, 30,60 and 90 days prediction. This indicates that QRF is better than RF since the MAPE forthe median is lower than this for the mean. In addition, what can be observed from this tablein general and from MAPE in particular is that the quantiles: 0.025-quantile, 0.25-quantile,0.75-quantile and 0.975-quantile are to some extent wide.

The MSE is also plotted against the number of trees in Fig 12 below.

30

0 20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Number of trees

Mea

n sq

uare

d er

ror

MSE for 30 daysMSE for 60 daysMSE for 90 days

Figure 12: Mean Square Error, α = 0.0095

The plot above shows that the MSE decreases dramatically as the number of trees in theforest increases. The plot shows also that QRF converges as more trees are added to the model.The 30 days prediction has the lowest MSE and 60 days prediction has the highest MSE where90 days prediction fall in between.

31

4.2.1 Prediction Intervals

Making predictions through QRF returns not only the conditional mean but also the full distri-bution of the response variable. Prediction intervals can then be constructed from the distribu-tion that is recorded by QRF in each tree leaf in the forest. In this paper, prediction intervalsare used to quantify the confidence or the certainty in RF. We focus on building 50% and 95%prediction intervals.The 95% prediction interval is estimated by using the following formula:

I0.95(x) = [(q0.025(Y | X = x), q0.975(Y | X = x)]

And the 50% prediction interval is estimated by the following formula:

I0.5(x) = [(q0.25(Y | X = x), q0.75(Y | X = x)]

The figures below show the 50% prediction interval and the corresponding 95% predictioninterval for 30 , 60 and 90 days prediction respectively. The light brown points represent thelower prediction interval, the brown points represent the upper prediction interval, and theblack diamond represents the mean prediction which is obtained bu using RF.

From the below figures, we can see that the lower and upper intervals in the 50% predictioninterval are significantly smaller than these in the 95% prediction interval. However, in the50% prediction interval case, almost none of these upper and lower intervals becomes largewhen the target falls in the interval between 3 and 5 in the 30 days prediction, and when thetarget falls in the interval between 6 and 8 in the 60 days prediction. Whereas the smallestintervals in the 90 days prediction are shown when the target falls in the interval between -4and -2.

32

Predicted

True

0.25−th quantile prediction 0.75−th quantile predictionMean prediction

[−2,−1) [−1,0) [0,1) [1,2) [2,3) [3,4) [4,5)

[−2,

−1)

[−1,

0)[0

,1)

[1,2

)[2

,3)

[3,4

)[4

,5)

Figure 13: 50% Prediction Interval for 30 days

Predicted

True

0.025−th quantile prediction 0.975−th quantile predictionMean prediction

[−2,−1) [−1,0) [0,1) [1,2) [2,3) [3,4) [4,5)

[−2,

−1)

[−1,

0)[0

,1)

[1,2

)[2

,3)

[3,4

)[4

,5)

Figure 14: 95% Prediction Interval for 30 days

Predicted

True

0.25−th quantile prediction 0.75−th quantile predictionMean prediction

[−3,−2) [−1,0) [1,2) [3,4) [5,6) [7,8)

[−3,

−2)

[−1,

0)[1

,2)

[3,4

)[5

,6)

[7,8

)

Figure 15: 50% Prediction Interval for 60 days

Predicted

True

0.025−th quantile prediction 0.975−th quantile predictionMean prediction

[−3,−2) [−1,0) [1,2) [3,4) [5,6) [7,8)

[−3,

−2)

[−1,

0)[1

,2)

[3,4

)[5

,6)

[7,8

)

Figure 16: 95% Prediction Interval for 60 days

Predicted

True

0.25−th quantile prediction 0.75−th quantile predictionMean prediction

[−4,−2) [−2,0) [0,2) [2,4) [4,6) [6,8) [8,10)

[−2,

0)[2

,4)

[6,8

)[1

0,12

)

Figure 17: 50% Prediction Interval for 90 days

Predicted

True

0.025−th quantile prediction 0.975−th quantile predictionMean prediction

[−4,−2) [−2,0) [0,2) [2,4) [4,6) [6,8) [8,10)

[−2,

0)[2

,4)

[6,8

)[1

0,12

)

Figure 18: 95% Prediction interval for 90 days

33

4.3 Comparison between RFC and QRFThe last part of our results contains a comparison between RFC and QRF. The comparisonbetween these two models which have been used in this paper is made through the importanceof the technical indicators that have been served as features or input variables for both RFCand QRF. The importance of the variables is illustrated in the bar chart below for 30, 60 and90 days prediction.

MACD

RSI

ROC

Stochastic_K

Williams_R

CCI

DIX

0 20 40

QRF−30 daysQRF−60 dyasQRF−90 daysRFC−30 daysRFC−60 daysRFC−90 days

Figure 19: Variable Importance for RFC and QRF

From the figure above, we can see that stochastic oscillator (K%) and Williams (R%) haveshown less importance in the one, two and three months prediction in QRF and also in RFC.However, the top important feature is different in QRF and RFC models. For the 90 daysprediction, RSI is shown to be the top feature in both RFC and QRF. This variable is also thetop feature among all the seven variables in the 60 days prediction in both models. However,for 30 days prediction, RSI and CCI are close to each other of being the top feature in QRF,CCI is however slightly more important than RSI, and for RFC, the importance of RSI hasremained on the top.

34

5 Discussion and ConclusionThe robustness of the models proposed in this paper is discussed through a comparative ana-lysis between the results obtained in this paper and the results obtained in other papers. ForRFC we compare our results with the results found in Khaidem et al. (2016) and Di (2014),and for QRF we compare our results with results obtained in Vijh et al. (2020).

The results obtained in Khaidem et al. (2016) show an extremely large accuracy of theRFC algorithm. However, the authors did not give any guideline for the smoothing factor (α)and also for the data partition (training and testing sets). It is also not clear which data set hasbeen used in order to obtain an accuracy close to 90%.

Varying the smoothing factor (α) in our results has yielded large differences in the ac-curacy, which can also be confirmed from the plots of the daily exponentially smoothedAAPL closing prices (Fig 6, 8 and 10), where we see that the AAPL closing prices whichare smoothed with an exponential factor of (α = 0.0095) show a big difference compared toFig 5 which includes the original AAPL closing prices. Thus, the selection of the smoothingfactor can affect the performance of the algorithm used. It is one of the factors that should betaken into account when predicting the stock price direction.

The data partition is another factor that is assumed to affect the accuracy of the model usedto predict the trends in stock prices. We would however show the effect on the results of threedifferent random samples of the data in Table 10 below.

Table 10: Accuracy with 3 different random splits of the AAPL data set

ααα === 000...000000999555 ααα === 000...222 ααα === 000...999555

30-day95.04% 74.48% 65.57%94.68% 75.09% 65.72%95.44% 77.06% 67.14%

60-day93.67% 76.35% 67.54%93.87% 76.66% 68.05%94.13% 77.42% 70.78%

90-day93.62% 78.43% 68.35%93.52% 77.62% 69.06%93.72% 79.54% 69.32%

From Table 10 above, it can be seen that different splits of the data can also affect theperformance of the model, but the change in the accuracy of RFC is not that big compared tothe change of using different smoothing factor (α).

A third factor that is assumed to affect the performance of the algorithm is the value of n inthe technical indicators, which is also not clear in Khaidem et al. (2016). The trading days formost of the technical indicators are not specified. This has been also discussed in Github wherea reproduction of the results from this paper has been provided by Jose Martinez Heras. Thisreproduction consisted of the same data set used in Di (2014) since there is no informationabout the data set used in this paper. Di (2014) chose three stocks for the study including

35

AAPL with a time span represents 4 years of the stock historical data. The results from thereproduction showed that there is data leakage in the training set since there is a big differencein the accuracy obtained for 10-day prediction with the accuracy reported in the paper (58%compared to 92%). Data leakage however refers to a mistake that is made by the author or thecreator of the machine learning algorithm when splitting the data set into training and testingsets where some of the data is shared between the two sets.

The data set used in Di (2014) contained the daily closing prices ranges from 2010-01-04to 2014-12-10. Using our binary classifier model, i.e. RFC with the same data set, we computethe accuracy for various time windows, same time windows that have been used in the paper.The results from Di (2014) are illustrated in Table 11, and the results of Khaidem et al. (2016)are illustrated in Table 12, whereas results obtained from our model are shown in Table 13 fordifferent values of (α).

Table 11: Results from Di (2014)

Trading Window AccuracyNext day 56.00%

Next 3-day 73.40%Next 5-day 71.41%Next 7-day 70.25%Next 10-day 71.13%

Table 12: Results from Khaidem et al. (2016)

Trading Window AccuracyNext 3-day 85.20%Next 5-day 83.88%Next 7-day 88.11%Next 10-day 92.08%

Table 13: Results obtained using RFC with different alpha

Next day Next 3-day Next 5-day Next 7-day Next 10-dayα = 0.0095 96.04% 98.68% 99.01% 98.68% 98.34%α = 0.2 70.96% 79.21% 83.17% 83.44% 86.71%α = 0.95 49.50% 65.68% 70.30% 70.86% 79.04%

Comparing the results in the three tables above, the results that we have found show a veryhigh accuracy when the smoothing factor is very small (α = 0.0095). This can be seen as thebest results and also better than the results found in Di (2014) and Khaidem et al. (2016).

36

Vijh et al. (2020) used historical data of five companies with data set including 10 years ofdata from 4-5-2009 to 4-5-2019 in order to predict the next day stock closing price using ANNand RF. The stock price direction prediction is treated as a regression problem in this paperand the performance of the models used was evaluated by using RMSE, MAPE and MBE. Weuse the same data set and we pick two companies randomly of the five companies that havebeen used in this paper with our prediction model (QRF). We compare the results obtained byour prediction model with the results found in Vijh (2020) by considering RMSE and MAPE.The results found in Vijh (2020) are illustrated in Table 14, whereas the results obtained byour model (QRF) are illustrated in Table 15.

Table 14: Results from Vijh et al. (2020)

ANN RFRMSE MAPE RMSE MAPE

Nike 1.10 1.07% 1.29 1.14%JP Morgan and Co. 1.28 0.89% 1.41 0.93%

Table 15: Results obtained using our model

Median Mean

Nike RMSE 0.01 0.01MAPE 0.49% 0.68%

JP Morgan and Co. RMSE 0.01 0.01MAPE 0.87% 1.04%

As we can see from Table 15, RMSE is very low compared to this obtained in ANN and RFfor both companies, However, MAPE lets us conclude that QRF performs both ANN and RFsince it gives better results, lower MAPE for median prediction than mean prediction which isobtained using RF.

5.1 ConclusionThis empirical paper examined the extent to which the trends of the stock price of AAPLare predictable. A binary classifier as well as a regression model have been used in order topredict the long-term movements of AAPL stock price that represented 30, 60 and 90 daysrespectively. In addition, to improve the model performance and the prediction accuracy,we use the exponential smoothing method which was recommended in prior studies and wealso label the target differently in each model. Random forest classifier results showed thatexponential smoothing has a significant effect on the stock price direction prediction. Also,the selection of the smoothing factor can significantly influence and change the performanceof the model. For the long-term prediction however, the very small smoothing factor gave thebest results with an accuracy over 90%.

37

Furthermore, the main purpose of using a regression model is, for one reason, to obtaina complete picture of the distribution of the response variable where the whole conditionaldistribution of the response variable is considered, not only the conditional mean. Anotherreason for doing that is to quantify the uncertainty in RF via prediction intervals.

The uncertainty in model predictions is very important. This is often overlooked in appliedmachine learning and there is no prior study that deals with the uncertainty of RF. However,we have been successfully able to quantify this uncertainty in RF predictions via 50% and95% prediction intervals.

What precisely has been done in QRF can be called "classification by regression", to makeour results of QRF comparable with the results obtained by RFC. The fact that QRF dealswith continuous response variables rather than discrete or categorical response variables ledus to use the technique of data binning to push our data as possible to classifiable targets toget results similar to the results obtained by a classifier model.

QRF shows that resolving around zero is difficult as the quantiles width is large, and databinning technique can be harmful for the analysis. However as the model deals with con-tinuous response variables, more significant results can be obtained when treating the stockprediction problem as a regression problem rather than a classification problem. This has beenshown also as we compared QRF with RF and ANN without binning the data, where the me-dian prediction produced better results than the mean prediction. Thus, the median predictionobtained from QRF makes us conclude that this model outperforms RF which generates themean prediction, and therefore this model is proved to be effective in predicting the trend in thestock market price whenever we treat the stock prediction issue as a regression problem. Onthe other hand, as the predicting uncertainty in RF has been also taken into account via predic-tion intervals, the quantiles obtained by QRF are to some extent wide, and hence the varianceis high, meaning that actual prediction is uncertain even though the median prediction is good.

5.2 Further ResearchThe limitations of this paper that provide some suggestions for future research are mentionedin the following: a) Further studies on this topic may include looking at different companies,rather than focusing on one company. Also, in addition to decision trees, other ML algorithmscan be examined. b) Another area for further research would be to look at a solution thatwould be better resolve the "around zero" values, it might be training on more data, usingfeature engineering, or even using other technical indicators. c) Finally, it is worth consider-ing and constructing different prediction intervals that are related to Random Forest algorithmand compare them with quantile regression intervals, e.g. bag-of-observations intervals, split-conformal intervals, one-step boosted RF intervals, bias-corrected intervals, out-of-bag pre-diction intervals and high-density intervals.

38

Bibliography

[1] Yaser S Abu-Mostafa and Amir F Atiya. Introduction to financial forecasting. Appliedintelligence, 6(3):205–213, 1996.

[2] Michel Ballings, Dirk Van den Poel, Nathalie Hespeels, and Ruben Gryp. Evaluatingmultiple classifiers for stock price direction prediction. Expert Systems with Applica-tions, 42(20):7046–7056, 2015.

[3] Suryoday Basak, Saibal Kar, Snehanshu Saha, Luckyson Khaidem, and Sudeepa R Dey.Predicting the direction of stock market prices using tree-based classifiers. The NorthAmerican Journal of Economics and Finance, 47:552–567, 2019.

[4] Richard A Berk. Statistical learning from a regression perspective, volume 14. Springer,2008.

[5] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[6] Yuqing Dai and Yuning Zhang. Machine learning in stock price trend forecasting. Stan-ford University Stanford, 2013.

[7] Shubharthi Dey, Yash Kumar, Snehanshu Saha, and Suryoday Basak. Forecasting to clas-sification: Predicting the direction of stock market price using xtreme gradient boosting.PESIT, Bengaluru, India, Working Paper, 2016.

[8] Xinjie Di. Stock trend prediction with technical indicators using svm. Independent WorkReport, Stanford Univ., 2014.

[9] Eugene F Fama. The behavior of stock-market prices. The journal of Business, 38(1):34–105, 1965.

[10] Eugene F Fama. Efficient capital markets: A review of theory and empirical work. Thejournal of Finance, 25(2):383–417, 1970.

[11] Eugene F Fama. Random walks in stock market prices. Financial analysts journal,51(1):75–80, 1995.

[12] David Forsyth. Applied Machine Learning. Springer, 2019.

I

[13] Luckyson Khaidem, Snehanshu Saha, and Sudeepa R Dey. Predicting the direction ofstock market prices using random forest. CoRR-arXiv preprint arXiv, abs/1605.00003,2016.

[14] Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: journal ofthe Econometric Society, 46(1):33–50, 1978.

[15] Roger Koenker and Kevin F Hallock. Quantile regression. Journal of economic per-spectives, 15(4):143–156, 2001.

[16] Gang Kou, Yi Peng, and Guoxun Wang. Evaluation of clustering algorithms for financialrisk analysis using mcdm methods. Information Sciences, 275:1–12, 2014.

[17] Max Kuhn, Kjell Johnson, et al. Applied predictive modeling, volume 26. Springer,2013.

[18] Manish Kumar and M Thenmozhi. Forecasting stock index movement: A comparisonof support vector machines and random forest. SSRN Scholarly Paper. Rochester, NY:Social Science Research Network, 2006.

[19] Brett Lantz. Machine learning with R. Packt publishing ltd, 2013.

[20] Mark Larson. 12 Simple Technical Indicators: That Really Work, volume 69. John Wiley& Sons, 2012.

[21] Andy Liaw, Matthew Wiener, et al. Classification and regression by randomforest. Rnews, 2(3):18–22, 2002.

[22] Andrew W Lo and A Craig MacKinlay. Stock market prices do not follow random walks:Evidence from a simple specification test. The review of financial studies, 1(1):41–66,1988.

[23] Andrew W Lo and A Craig MacKinlay. A non-random walk down Wall Street. PrincetonUniversity Press, 2011.

[24] Burton G Malkiel. A random walk down Wall Street: including a life-cycle guide topersonal investing. WW Norton & Company, 1999.

[25] Burton G Malkiel. The efficient market hypothesis and its critics. Journal of economicperspectives, 17(1):59–82, 2003.

[26] José H Martínez. reproduce-stock-market-direction-random-forests. https://github.com/jmartinezheras/reproduce-stock-market-direction-random-forests/blob/master/2016_StockDirection_RF.ipynb, 2018.

[27] Najeb Masoud. Predicting direction of stock prices index movement using artificialneural networks: The case of libyan financial market. Journal of Economics, Manage-ment and Trade, 4(4):597–619, 2014.

II

[28] Nicolai Meinshausen. Quantile regression forests. Journal of Machine Learning Re-search, 7(Jun):983–999, 2006.

[29] Nikola Milosevic. Equity forecast: Predicting long term stock price movement usingmachine learning. Journal of Economics Library, 3(2), 2016.

[30] Phichhang Ou and Hengshan Wang. Prediction of stock market index movement by tendata mining techniques. Modern Applied Science, 3(12):28–42, 2009.

[31] Jigar Patel, Sahil Shah, Priyank Thakkar, and Ketan Kotecha. Predicting stock and stockprice index movement using trend deterministic data preparation and machine learningtechniques. Expert systems with applications, 42(1):259–268, 2015.

[32] Mingyue Qiu and Yu Song. Predicting the direction of stock market index movementusing an optimized artificial neural network model. PloS one, 11(5):e0155133, 2016.

[33] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.

[34] Pedro N Rodriguez and Arnulfo Rodriguez. Predicting stock market indices movements.WIT Transactions on Modelling and Simulation, 38, 2004.

[35] Dogac Senol and Meltem Ozturan. Stock price direction prediction using artificial neuralnetwork approach: The case of turkey. JOURNAL OF ARTIFICIAL INTELLIGENCE,1(2):70–77, 2009.

[36] Dev Shah, Haruna Isah, and Farhana Zulkernine. Stock market analysis: A review andtaxonomy of prediction techniques. International Journal of Financial Studies, 7(2):26,2019.

[37] Hrishikesh Vachhani, Mohammad S Obiadat, Arkesh Thakkar, Vyom Shah, Raj Sojitra,Jitendra Bhatia, and Sudeep Tanwar. Machine learning based stock market analysis: Ashort survey. International Conference on Innovative Data Communication Technologiesand Application, 46:12–26, 2019.

[38] Mehar Vijh, Deeksha Chandola, Vinay Anand Tikkiwal, and Arun Kumar. Stock clos-ing price prediction using machine learning techniques. Procedia Computer Science,167:599–606, 2020.

[39] Haozhe Zhang, Joshua Zimmerman, Dan Nettleton, and Daniel J Nordman. Randomforest prediction intervals. The American Statistician, pages 1–15, 2019.

III