Gradient Boosted Regression Trees for Forecasting Daily Solar Irradiance from a Numerical Weather...

16
1 Gradient Boosted Regression Trees for Forecasting Daily Solar Irradiance from a Numerical Weather Prediction Grid Interpolated with Ordinary Kriging Jonathan Miles Davis [email protected] March 28, 2014 Capstone paper submitted to the Graduate Faculty of the University of Alabama in Huntsville in partial fulfilment of the requirements for the Degree of Master of Science in Operations Research Abstract This paper presents an approach to daily solar irradiance forecasting using numerical weather predictions archived in the Earth Systems Research Laboratory's global medium range ensemble reforecast version 2 dataset. The data set includes 15 variables which are available on a uniform longitude latitude grid with one degree of resolution. The University of Oklahomas Mesonet monitoring stations provide empirical solar irradiance data at 98 locations across the state. I interpolated the numerical weather prediction variables to the Mesonet locations using ordinary kriging. The interpolated variables provide regressors with which to predict empirical solar irradiance. I developed four predictive models using an open source implementation of Friedmans stochastic gradient boosting machine. Each of these models is an additive series of regression trees, or a gradient boosted regression tree, identically parameterized but generated with different random seeds. Predictions from each model were aggregated by simple averaging to produce a final forecast. This approach performs among the top 5% of submissions to the American Meteorological Societys 2013-2014 Solar Energy Prediction contest, hosted on the Kaggle contest platform.

Transcript of Gradient Boosted Regression Trees for Forecasting Daily Solar Irradiance from a Numerical Weather...

1

Gradient Boosted Regression Trees for Forecasting Daily

Solar Irradiance from a Numerical Weather Prediction Grid

Interpolated with Ordinary Kriging

Jonathan Miles Davis

[email protected]

March 28, 2014

Capstone paper submitted to the Graduate Faculty of the University of

Alabama in Huntsville in partial fulfilment of the requirements for the Degree

of Master of Science in Operations Research

Abstract

This paper presents an approach to daily solar irradiance forecasting using

numerical weather predictions archived in the Earth Systems Research

Laboratory's global medium range ensemble reforecast version 2 dataset. The

data set includes 15 variables which are available on a uniform longitude latitude

grid with one degree of resolution. The University of Oklahoma’s Mesonet

monitoring stations provide empirical solar irradiance data at 98 locations across

the state. I interpolated the numerical weather prediction variables to the

Mesonet locations using ordinary kriging. The interpolated variables provide

regressors with which to predict empirical solar irradiance. I developed four

predictive models using an open source implementation of Friedman’s stochastic

gradient boosting machine. Each of these models is an additive series of regression

trees, or a gradient boosted regression tree, identically parameterized but

generated with different random seeds. Predictions from each model were

aggregated by simple averaging to produce a final forecast. This approach

performs among the top 5% of submissions to the American Meteorological

Society’s 2013-2014 Solar Energy Prediction contest, hosted on the Kaggle

contest platform.

2

1 Introduction The integration of solar energy sources into the global energy portfolio is desirable for many

economic and environmental reasons. Difficulties in predicting the near future capacity of solar

power plants deter large scale adaptation [1, 2, 3, 4] because forecast errors result in increased

operating costs and instabilities in the electrical grid [5]. The American Meteorological Society

(AMS) organized the 2013-2014 Solar Energy Prediction Contest, sponsored by EarthRisk

Technologies, to foster interest in this problem among the broader data analysis community. The

contest was hosted by Kaggle (www.kaggle.com) and concluded on November 15, 2013. Kaggle is

an internet platform for data analysis competitions that attracts international participation.

The specific problem presented by the contest organizers was to develop predictive models of

solar irradiance measured at monitoring stations in Oklahoma based on numerical weather

predictions provided by the Earth System Research Laboratory (ESRL) at the National

Atmospheric and Oceanographic Administration (NOAA). The contest problem is challenging in

three aspects. First, simply managing and processing the available data is nontrivial with over 800

million values across 32 files. Second, the data are not well suited for traditional linear modeling

techniques. Many values are replications of the same data point resulting from perturbed initial

conditions of the numerical weather prediction model, or the same variable estimated at different

times of the day. Many of the variables are correlated. The data are also highly seasonal and

geospatially correlated. The upshot is 825 potential regressors in the numerical weather predictions

exhibiting multicollinearity with seasonal and geographic trends. Finally, the independent data are

not available at the locations where the response data is available. The numerical weather

prediction data exist along a latitude longitude grid with one degree resolution. The response data,

solar irradiance, exist at 98 physical locations in the Oklahoma Mesonet, a networked series of

environmental monitoring stations covering the state [6]. These locations are interspersed among

the numerical weather prediction grid.

The approach presented here interpolates the numerical weather prediction grid with ordinary

kriging to estimate the independent data at each Mesonet station’s location. Gradient boosted

regression trees (GBRTs) use the interpolated data as regressors to forecast empirical solar

irradiance. The rest of this paper is organized into the following sections. Section 2 catalogues the

data available for analysis. Section 3 describes the analysis environment and provides instructions

for reproducing this work. Section 4 reviews ordinary kriging and describes procedures for its

application to this problem. Section 5 provides an overview of regression trees and gradient

boosting, and describes the implementation of GBRTs in this analysis. Section 6 reviews the

performance of this approach in comparison to other contest submissions. Finally, section 7 is a

discussion of interesting observations and lessons learned.

2 Data The independent data are numerical weather predictions from the 2nd generation NOAA Global

Ensemble Reforecast data set, which is a post-processed set of runs from the Global Ensemble

Forecast System (GEFS) [7]. This data set includes 15 variables, estimated at five times per day,

replicated eleven times. The replications reflect perturbed initial conditions of the numerical

weather prediction models, and are called ensemble members. Table 1 describes the variables. The

data are at a resolution of one degree and cover the entire planet. The contest organizers extracted

the grid from -106º to -91º east longitude and 31º to 39 º north latitude (Figure 1). The GEFS data

are contained in Network Common Data Form (NetCDF) files. NetCDF is a compressed binary

format for storing data and metadata [8]. Extracting a NetCDF files returns a five-dimensional

array, where the dimensions indicate latitude, longitude, time of day, ensemble member, and date.

The GEFS data are divided into a training set and a test set. There are 15 files containing the

training data, one for each variable, from January 1994 through December 2007. The test data are

in 15 different files, from January 2008 through November 2012. Viewing all raw values for one

3

day leaves 825 potential regressors (15 variables × 5 time steps × 11 ensemble members). I pre-

processed the data to generate the ensemble mean at each time step and used those ensemble

means as raw data (75 potential regressors) for all further procedures described in this paper. A

number appended to the variable name, like apcp_sfc1 indicates the time step.

Table 1 Numerical weather prediction variables.

Variable Description Units

apcp_sfc 3-Hour accumulated precipitation at the surface kg∙m-2

dlwrf_sfc Downward long-wave radiative flux average at the surface W∙m-2

dswrf_sfc Downward short-wave radiative flux average at the surface W∙m-2

pres_msl Air pressure at mean sea level Pa

pwat_eatm Precipitable Water over the entire depth of the atmosphere kg∙m-2

spfh_2m Specific Humidity at 2 m above ground kg∙kg-1

tcdc_eatm Total cloud cover over the entire depth of the atmosphere %

tcolc_eatm Total column-integrated condensate over the entire atmos. kg∙m-2

tmax_2m Maximum Temperature over the past 3 hours at 2 m above the ground K

tmin_2m Minimum Temperature over the past 3 hours at 2 m above the ground K

tmp_2m Current temperature at 2 m above the ground K

tmp_sfc Temperature of the surface K

ulwrf_sfc Upward long-wave radiation at the surface W∙m-2

ulwrf_tatm Upward long-wave radiation at the top of the atmosphere W∙m-2

uswrf_sfc Upward short-wave radiation at the surface W∙m-2

The response data are measurements of total solar irradiance per day in J∙m-2 at 98 monitoring

stations in the Oklahoma Mesonet. Mesonet stations record irradiance with a Li-Cor Pyranometer

every five minutes [9]. The response data are the sum of those recordings over the entire day.

Responses are publicly available for the period from January 1994 through December 2007 (the

training period) and consist of a single .csv file with 99 columns and 5,113 rows. The first column

is the date of measurement. Columns two through 99 are total daily solar irradiance in J∙m-2 for

the 98 Mesonet stations. Table 2 illustrates the format of the response data. Response data for the

period from January 2008 through November 2012 are not available publicly; submissions to the

Kaggle contest web page are evaluated using these data. All contest sites continue to function after

the contest concludes so that people such as myself can use old contests for practice, education,

and fun. Figure 1 illustrates the locations of the Mesonet stations in relation to the GEFS grid.

Figure 2 illustrates the response data for two Mesonet stations. One final .csv file contains

information for the Mesonet stations including a four letter identification code, latitude, longitude,

and elevation. So there are 15 NetCDF files containing the GEFS data for the training period, 15

NetCDF files containing the GEFS data for the test period, and two .csv files containing the

response data for the training period and Mesonet station information.

Table 2 The first three rows and nine columns of the response data file. The date column is formatted as yyyymmdd. Each column to the right of the date column contains data for a different Mesonet station. Each station is identified by a four letter abbreviation. The data indicated here are total daily solar irradiance in J∙m-2.

Date ACME ADAX ALTU APAC ARNE BEAV BESS BIXB

19940101 12384900 11930700 12116700 12301200 10706100 10116900 11487900 11182800

19940102 11908500 9778500 10862700 11666400 8062500 9262800 9235200 3963300

19940103 12470700 9771900 12627300 12782700 11618400 10789800 11895900 4512600

4

Figure 1. Relative positions of the Mesonet stations to the numerical weather prediction grid from the GEFS data. The grid in the figure includes all GEFS data locations made available through the contest website but the actual reforecast data set is global. The grid is 16 degrees longitude by 9 degrees latitude.

Figure 2. Daily solar irradiance at two different Mesonet stations for the first 2,000 days in the training period. Mean irradiance across all data is approximately 16.5 MJ∙m-2. Colors differentiate the two stations.

3 Analysis Environment and Reproducibility

I performed all analysis in the R environment version 3.0.2 [10] on two personal computers; a

64 bit AMD A8 3870K desktop and a 64 bit Intel Core i5 3337U laptop. The desktop operated on

Windows 7 and Ubuntu 13.10; for some processes I found parallelization to be unstable in the

Windows 7 environment. The laptop operated on Windows 8.1. R is free and open source software

for statistics and data analysis that evolved from the S language developed at Bell labs [11], and

is widely used in the scientific community. A search on Google Scholar at the time of this writing

indicated 12,387 citations for reference [10] above, compared for perspective to 9,537 citations for

Watson and Crick’s [12] seminal paper on the “Molecular Structure of Nucleic Acids”. (This paper

does not affect the relative tallies). Using R for research is appealing because it enhances

reproducibility, accountability, and transition—sharing code that does not require licensed software

5

allows other researchers to check work and rapidly expand on new ideas. It is appealing for

education because students who learn applications on proprietary software lose the capability to

apply those techniques when they no longer have access to an institution computer lab. Part of

the motivation for undertaking this topic was the opportunity to demonstrate the value of R as a

supplement to Minitab® exercises in engineering courses at the University of Alabama in Huntsville.

All code required to reproduce this work is available through the github repository linked below.

The source data are hosted on the Kaggle competition website. The code consists of six short

scripts described in Table 3. R loads all objects into memory when executing code, and there are

several memory intensive procedures in these scripts. A system with 12 GB of memory is ideal.

Table 3 Files required to reproduce the results of this paper.

File Action

Prepare_Data-1.R Generates filename lists and loads .csv source data

Perform_Kriging-2.R Performs kriging on the training data

Perform_KrigingTest-3.R Performs kriging on the test data

GBRT_Tuning-4.R Performs an experiment testing tuning parameters of GBRTs

Calculate_Predictions-5.R Builds GBRTs and creates predictions over the test period

Generate_Figures-6.R Produces the figures contained in this paper

Github repository:

https://github.com/jmdavis0352/UAHCapstone

Kaggle Competition page for data download:

http://www.kaggle.com/c/ams-2014-solar-energy-prediction-contest/data

The core deployment of R includes a broad and deep set of functions for analysis, but the real

power of the language is the community of users who extended its capabilities in the form of

downloadable packages. Citations for packages implementing statistical methods are located in the

relevant sections of this paper; names of R packages are indicated with bold font. Other packages

that provide convenient functionality used in this analysis include: reshape2 [13], ggplot2 [14],

RColorBrewer [15], rpart [16], rpart.plot [17], scales [18], maps [19], and rgl [20] for data

manipulation and plotting. The foreach [21] and doParallel [22] packages support simple constructs

for parallel processing, which was vital for timely results. The ncdf4 [23] package provides libraries

for interfacing with NetCDF version 4 files; this analysis would not have been possible in R without

this package. The RStudio [24] development environment provides a full featured IDE for scripting

and debugging; RStudio is also free and open source. Without this IDE, R is barely more than a

terminal, and very difficult to penetrate for beginners.

4 Interpolating the Numerical Weather Prediction Grid The numerical weather predictions in the GEFS data are not available at the Mesonet station

locations, so some form of interpolation or weighting seems necessary for accurate forecasts. I

applied the geoR [25] implementation of ordinary kriging to interpolate the GEFS grid to the

Mesonet locations. Kriging is a geostatistical method for interpolating spatially correlated data.

The method was formally developed by Matheron in 1963, based on the 1951 Master’s thesis of

Krige [26]. Kriging is similar to other interpolation methods in that the estimate of a response is a

weighted sum of other observed values. Also like other interpolation methods, the weights are a

function of the distances between the prediction location and the locations of the other

observations. Kriging is distinct in that it is a statistical estimate of the response following from

the assumption that the data are a single realization of a random field, like cloud cover over a

6

small region at a given time. Denoting the random field as 𝑍, 𝑍(�⃗�𝑖) represents the realization of

the random field at the location �⃗�𝑖, in this case the value of one of the GEFS variables on one day,

and one time of the day, at one grid point. The kriging estimate at a new location �⃗�∗ takes the

form:

�̂�(�⃗�∗) = ∑𝜆𝑖𝑍(�⃗�𝑖)

𝑛

𝑖=1

(1)

Where 𝜆𝑖 is the weight applied to observation 𝑍(�⃗�𝑖), and the set of 𝑍(�⃗�𝑖) is the neighbourhood of

𝑛 observations used in the estimate—in this case the 16 × 9 GEFS grid. In ordinary kriging these

weights are constrained to sum to one.

Kriging weights for calculating �̂�(�⃗�∗) result from the solution of the system of equations in (2).

Notation in this section is adopted from Barnes [27] and the geoR documentation [25].

Define:

ℎ𝑖𝑗 = the distance or “lag” between observations at locations �⃗�𝑖 and �⃗�𝑗

𝛾(ℎ) = a positive definite covariance function, called a variogram model

𝜔 = a Lagrangian multiplier, necessary in ordinary kriging to enforce constraints that weights

sum to one while minimizing estimate variance

[

𝛾(0) 𝛾(ℎ12) ⋯ 𝛾(ℎ1𝑛) 1𝛾(ℎ21) 𝛾(0) ⋯ 𝛾(ℎ2𝑛) 1

⋮ ⋮ ⋱ ⋮ ⋮𝛾(ℎ𝑛1) 𝛾(ℎ𝑛2) ⋯ 𝛾(0) 1

1 1 ⋯ 1 𝛾(0)]

[ 𝜆1

𝜆2

⋮𝜆𝑛

𝜔 ]

=

[ 𝛾(ℎ1∗)𝛾(ℎ2∗)

⋮𝛾(ℎ𝑛∗)

1 ]

(2)

The specific definition of 𝛾(ℎ) is arbitrary, but the function must be positive definite to ensure

the system in (2) is non-singular [27]. Since 𝛾(ℎ) defines the system of equations that generate the

kriging weights, the selection of 𝛾(ℎ) can greatly impact results. Generally with kriging a

practitioner chooses a variogram model by optimizing the parameters of several established

formulas against an empirical variogram and choosing the one with the best fit. The empirical

variogram measures the degree of spatial correlation between observations a given distance apart;

sometimes called semivariance [28]. The calculation of the empirical variogram ordinate at lag =

ℎ, denoted as 𝛾𝑒(ℎ), is defined in equation (3):

𝛾𝑒(ℎ) = 1

2𝑁ℎ∑[𝑍(�⃗�𝑘+ℎ) − 𝑍(�⃗�𝑘)]

2

𝑁ℎ

𝑘=1

(3)

Where 𝑁ℎ is the number of observations in the data set ℎ distance units apart, and [𝑍(�⃗�𝑘+ℎ), 𝑍(�⃗�𝑘)]

is one such pair of observations. There are many established variogram models; geoR offers 14

alternatives. Many of these produced nearly identical results on the same data set, so I selected

four models to evaluate against the empirical variograms of the GEFS grids. Figure 3A illustrates

these functions when fit to an empirical variogram. Table 4 provides function definitions with the

parameterization below.

7

Parameter definitions:

𝜙 = practical range: distance beyond which correlation between observations of Z is

theoretically less than a small tolerance, geoR uses 0.05.

𝜎2 = partial sill: represents the variance of the random field Z

𝜏2 = nugget variance: the intercept of the variogram model at ℎ = 0. Nugget variance generally

represents the potential for error in the value of the observations

𝜅 = smoothness parameter. Κ𝜅(𝑥) is the modified 𝜅𝑡ℎ order Bessel function of the third kind.

Γ(x) indicates the gamma function

Table 4 Variogram models evaluated for fit to empirical variograms of GEFS grids as defined in geoR [25].

Variogram Model Definition

Exponential 𝛾(ℎ) = 𝜏2 + 𝜎2 [1 − 𝑒(−ℎ𝜙

)]

Matern 𝛾(ℎ) = 𝜏2 + 𝜎2 [1 −1

2𝜅−1Γ(𝜅)(ℎ

𝜙)𝜅

Κ𝜅 (ℎ

𝜙)]

Spherical 𝛾(ℎ) = 𝜏2 + 𝜎2 ∙ {1.5 (ℎ

𝜙) − .5 (

𝜙)3

, ℎ < 𝜙

1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Cubic 𝛾(ℎ) = 𝜏2 + 𝜎2 ∙ {[7 (ℎ

𝜙)2

− 8.75(ℎ

𝜙)3

+ 3.5 (ℎ

𝜙)5

− 0.75(ℎ

𝜙)7

] , ℎ < 𝜙

1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Applying kriging to interpolate all of the GEFS regressors to the Mesonet locations is

computationally expensive. The GEFS grid must be interpolated for all 75 regressors on each day

for both the training period and test period, a total of 6,909 days. So there are 518,175 individual

grids to interpolate to the 98 Mesonet stations. For each of those grids I performed the following

procedures:

1. Computed the empirical variogram

2. Fitted Exponential, Matern, Spherical, and Cubic variogram models to the

empirical variogram (Figure 3A).

3. Randomly removed 10 data points from the region of the grid over Oklahoma to

be used for cross validating estimates with each of the four variogram models

(Figure 3B).

4. Computed kriging estimates at the 10 locations removed from the grid using each

of the variogram models fitted in step 1, and recorded the sum of squared residual

errors of the predictions.

5. Computed kriging predictions at the 98 Mesonet stations using the variogram

model yielding the minimum sum of squared residuals for the 10 cross validation

locations.

Variogram model fitting in step 2 above is achieved through the R implementation of

Nelder and Mead [29], which may not always converge completely. In some instances the

nugget variance would converge to a very small, negative value such as -2 × 10-17. The

nugget is greater than or equal to zero by definition and this circumstance caused errors in

the kriging function. I inserted an if-statement in my code to replace negative nugget values

with zero before passing the parameters to the kriging function and have logged this issue

with the authors of geoR.

8

Figure 3. A) The left plot illustrates an empirical variogram with four parametrically optimized variogram models. Kriging at the Mesonet stations was performed with the function exhibiting the smallest sum of squared residual errors for predictions at the cross validation points illustrated in B). The right plot exemplifies a Latin square design for data points to remove for cross validating the variogram models.

The random data removal in step 2 above results from the column-wise pair-wise algorithm of

Stocki [30], as implemented by Carnell [31]. The algorithm optimizes a Latin hypercube design

such that the mean distance between all design points is maximized. The design returns (x, y)

pairs with x and y on the interval [0, 1]. I transformed these pairs to be integer values of grid

points covering Oklahoma, since that is the area of interest for prediction accuracy. Unfortunately

this transformation requires binning, which results in two mildly undesirable circumstances. First,

the design is no longer guaranteed to be optimal with respect to mean distance between design

points. Second it is possible to assign design points to the same bin so that only 9 or even 8 data

points might be removed from the grid in some cases. While undesirable from an aesthetic point

of view, I do not believe either of these two circumstances poses any real issue in this context.

Interpolation performance is generally good, as one would expect with a reasonably dense

uniform grid of available data. Figure 4 illustrates a kriging surface for downward shortwave

radiative flux. The x-y plane is latitude and longitude. The dots indicate the GEFS grid (black)

and kriging predictions at Mesonet stations (red). The smooth surface represents the kriging

predictions on a fine mesh. The result of the kriging process is a set of 75 regresssors for each day,

estimated at the Mesonet station locations. This data set will support model development and

predictions in the next section.

Figure 4. Kriging surface and data points for downward shortwave radiative flux. Red data points indicate predictions at the Mesonet Stations. Black points indicate the GEFS grid locations.

9

5 Gradient Boosted Regression Trees for Irradiance Forecasting The term boosting represents the idea of taking individually weak models, like small regression

trees, and boosting them into a strong predictor by combining them intelligently. Kuhn and

Johnson [32] assert that Freidman’s [33] “stochastic gradient boosting machine” is presently the

most widely preferred boosting algorithm in practice. The top four submissions to the AMS Kaggle

contest described in this paper all used variations on the gradient boosting machine, and indeed

motivated its use here. Friedman’s algorithm uses regression trees as basis functions in an additive

model with the form in equation (4), where c is a constant, 𝛽𝑡 is a coefficient, and each 𝑔𝑡(𝑥) is a

simple regression tree on a vector of regressors, 𝑥.

𝑓(𝑥) = 𝑐 + ∑𝛽𝑡𝑔𝑡(𝑥)

𝑇

𝑡=1

(4)

A simple regression tree is essentially the same as a decision tree; used in the context of response

prediction. The branches of the tree represent splits on the value of a regressor, creating partitions

in the independent data where the dependent data is more homogenous, like a form of clustering.

The terminal nodes contain the prediction model for that branch of the tree. In this application

prediction models are constants; specifically the medians of observations within the partition

defined by that branch of the tree. The number of terminal nodes in a tree is called the interaction

depth. Figure 5 demonstrates a simple regression tree of interaction depth six predicting solar

irradiance at the ACME Mesonet station. Values of dswrf_sfc4 < 749 indicate the left path of the

first split. In Friedman’s algorithm, the constant c in equation (4) is initialized to be a predictor

of the response, such as the median of all values. The trees in the summation are defined in a

stagewise procedure. Each tree is a model of the negative gradient of residual errors from the

previous stage predictions. Hence the term gradient boosted regression trees. The actual algorithm

includes extra steps for regularization, described in detail in the next few paragraphs.

Figure 5. A simple regression tree of solar irradiance at the ACME Mesonet station with six terminal nodes. This tree partitions the data five times based on splits of the variables dswrf_sfc4, dswrf_sfc5, uswrf_sfc3, and uswrf_sfc5.

Figure 6 outlines the basic procedures of the gradient boosting machine, adapting notation from

Ridgeway’s [34] tutorial for gbm, an R package implementing this method, and from Friedman

[33]. The objective of the algorithm is to determine a function 𝑓(𝑥) that minimizes the expected

value of a loss function, 𝐿(𝑦, 𝑓(𝑥)), where 𝑦 represents observations in the training data, as in

equation (5).

10

𝑓(𝑥) = arg𝑚𝑖𝑛𝑓(𝑥) 𝐸𝑦,𝑥[𝐿(𝑦, 𝑓(𝑥))] (5)

The notation arg 𝑚𝑖𝑛𝑓(𝑥) means the identity of 𝑓(𝑥) that minimizes the condition. The indexing

variable T in Figure 6 indicates the number of trees to use for the definition of 𝑓(𝑥). The learning

rate 𝜆 and the bag fraction p are mechanisms to prevent over fitting.

Select the number of trees T to include in the additive model

Select a shrinkage parameter, or learning rate 𝜆

Select a random subsampling proportion, or bag fraction p

Select the interaction depth of each tree d

Initialize 𝑓(𝑥) = a constant 𝑐 such that 𝑓(𝑥) = arg𝑚𝑖𝑛𝜌 ∑ 𝐿(𝑦𝑖 , 𝑐)𝑁𝑖=1

Then for t in 1, …, T do:

1. Compute the negative gradient of the loss function as the working response

𝑧𝑖 = −𝜕

𝜕𝑓(𝑥𝑖)𝐿(𝑦𝑖 , 𝑓(𝑥𝑖)) |

′𝑓(𝑥𝑖) = 𝑓(𝑥𝑖)

(6)

2. Randomly select a subset of data of size p × N

3. Fit a regression tree 𝑔𝑡(𝑥) with d terminal nodes predicting 𝑧𝑖 from the covariates 𝑥𝑖

using only the randomly selected subset of data

4. Choose a gradient descent step size, as

𝜌 = arg𝑚𝑖𝑛𝜌 ∑𝐿(𝑦𝑖 , 𝑓(𝑥𝑖) + 𝜌𝑔𝑡(𝑥𝑖)

𝑁

𝑖=1

(7)

5. Update the estimate for 𝑓(𝑥) as:

𝑓(𝑥) ← 𝑓(𝑥) + 𝜆𝜌𝑔𝑡(𝑥) (8)

Figure 6. Freedman’s stochastic gradient boosting machine procedures. (reproduced from Ridgeway [34] with minor edits and notation changes).

The beauty of the gradient boosting machine is that the method is generalizable through the

selection of an appropriate loss function. Friedman [33] includes four loss criteria: least squares,

least absolute deviation, the Huber loss function for M-regression, and logistic negative binomial

log-likelihood. Ridgeway’s [35] implementation with gbm also includes loss functions for binary

classification, quantile regression, ranking problems, and Cox proportional hazards. The practical

value of this flexibility is that one function call, with one set of parameters and syntax to remember,

supports a vast array of modeling problems practitioners encounter. The Kaggle contest specified

Mean Absolute Error (MAE) as the metric for performance evaluation, so I used the least absolute

deviation loss criteria in all cases.

The tuning parameters of the algorithm have significant influence over the results. I left the

bag fraction at the default value of 0.5 per Ridgeway’s [34] recommendation. Generally a smaller

learning rate produces better results, but consequently requires more trees, and more computing

time. Cross validation provides a mechanism for selecting the number of trees that produces the

most robust out-of-sample predictions for a given learning rate. K-fold cross validation is

implemented in gbm in by dividing the data into k folds, building k models, and estimating out of

sample predictive performance with each of the k models. The function call to gbm() executes the

algorithm, returns a model object, and also returns the optimal number of trees. I call the object

11

returned from the function call a GBRT model. The results of an experiment in Figure 7

demonstrate how tuning parameters influence predictive capability in terms of MAE, and run time.

For the data in this experiment, learning rates of 0.05, 0.03, and 0.01 required 300, 800 and 2,000

trees to ensure that the optimal number was contained in the model. De’ath [36] reports that

aggregating smaller GBRT models with faster learning rates, through averaging their predictions,

produces better results with less computational burden. The x-axis of each panel in Figure 7

indicates the number of GBRT model predictions aggregated. Each model aggregated is identical

in terms of parameterization and data, but is built with a different random seed.

The examination of tuning parameters in Figure 7 generally confirms De’ath’s [36] assertion

that aggregating models with faster (larger) learning rates produces a given level of performance

more efficiently than a single, larger model. Each of these models use all 75 raw regressors described

in Section 2, plus a variable indicating the month. Other derived regressors I tried provided no

benefit including: the sum of variables over the day, the variance of a variable over a day, an

indicator of moderate atmospheric pressure, and standard errors of the kriging estimate. I also

tried removing regressors but this invariably degraded predictive performance. In early exploratory

analysis with linear models built on the principal components of the independent data, using only

variables at time steps three four and five prior to model selection with stepwise regression

algorithms produced better results.

Figure 7. Each panel illustrates predictive performance at the ACME Mesonet station as the number of models aggregated increased from one to eight (x axis), for learning rates of 0.05, 0.03, and 0.01 (indicated by shape). The color scale indicates run time in seconds to predict irradiance for this station; total solution generation run time is approximately 98 times the run time reported here. So a learning rate of 0.01 with six cross validation folds and eight aggregated models would require about 68 hours to generate a complete solution to the Kaggle problem.

12

Based on the results in Figure 7, I aggregated six GBRT models with learning rates of 0.05,

interaction depth of 15, and three cross validation folds for each Mesonet station. This procedure

achieved MAE of 2.28 MJ∙m-2, among the top 20% of contest submissions and an improvement

over the contest benchmark of 12.6%. Some contest participants indicated that a single model

provided better results than building models for each station; I found this to be true as well. The

single model for all stations included additional regressors for station latitude, longitude and

elevation, for a total of 79. Unfortunately the relationship between the learning rate and the

number of trees required is different with the full data set, and run times were too long for a full

experiment. Ultimately I used a learning rate of 0.1, 900 trees, bag fraction of 0.5, interaction

depth of 15, three cross validation folds, and averaged the predictions of four of these models. The

single model approach achieved MAE = 2.17 MJ∙m-2.

Interpreting a GBRT model is somewhat daunting given that the predictor in this case is the

sum of 900 trees that are actually modeling gradients of the residual errors. One of the convenient

outcomes in linear modeling is the direct and obvious way the model coefficients describe the

relationship between the regressors and the response. Freidman [33] anticipates this problem and

proposes a measure called relative influence. Relative influence indicates which variables provided

the most improvement in model performance when trees split on that variable. Figure 8 illustrates

the relative influence of the top 10 regressors in the single model described above. Downward

shortwave radiative flux dominates the prediction. The benchmark code used only the mean total

dswrf_sfc to predict solar irradiance.

Figure 8. Relative influence of each regressor in the overall prediction. The prediction is dominated by dswrf_sfc4, which is downward shortwave radiative flux at Earth’s surface, forecast at 3pm from the GEFS models. The contest benchmark code used only ensemble mean total dswrf_sfc.

6 Results

Over two thirds of contest participants improved on the benchmark MAE of 2.61 MJ∙m-2. The

approach described in this paper achieved MAE = 2.17 MJ∙m-2 (13.2 % error) across the full test

set, an improvement over the contest benchmark of 16.9%. This result is among the top 5% of

contest submissions, falling between the 4th and 5th placed results. David Gagne of the University

of Oklahoma, one of the contest administrators, indicated that the improvement realized in this

competition would result in multi-million dollar savings when applied to actual solar farms. Though

the statement was informal and did not identify a time frame or scope of application, it does

indicate that the contest administrators identify improvements on this scale as significant

achievements. Table 5 provides comparisons between approaches.

13

Table 5 Comparative performance between the contest benchmark, two approaches examined in this paper, and the contest winner’s solution.

Method MAE (MJ∙m-2) Mean absolute

% Error

% Improve MAE

from benchmark

-Contest benchmark 2.61 15.8% --

-Six aggregated GBRTs trained at each

Mesonet station

2.28 13.8% 12.6%

-Four aggregated GBRTs trained for the

whole dataset, with spatial regressors

2.17 13.2% 16.9%

-Best proposed solution in the contest 2.11 12.8% 19.2%

The best proposed solution came from a team of two people in Brazil [37], using a Python

implementation of GBRTs. They did not perform interpolation. Instead they used all 75 regressors

from the four nearest GEFS grid points plus: a variable for month, distances to the four nearest

GEFS grid points, and latitude differences between the Mesonet station and the four nearest GEFS

grid points. So these GBRT models contained 309 regressors. They trained 11 of these models

using the raw data from each of the 11 GEFS ensemble members, rather than taking the ensemble

means for each of the 75 GEFS regressors. They then built two additional GBRT models based on

daily totals of each of the 15 GEFS variables (Table 1). One of these used the ensemble median,

and the other used the ensemble maximum. Finally they ensembled these 13 GBRT models using

the same Nelder and Mead [29] algorithm discussed in Section 4 to minimize MAE, then performed

a small bias correction.

7 Discussion The purpose of the AMS contest was to bring a narrow topic, solar irradiance prediction, to the

attention of the more general data analysis community to determine if current best practices could

be improved. The top solutions from contest participants significantly improved on the baseline

values, and all employed various implementations of Friedman’s gradient boosting machine. The

authors of references [1, 3, 5] in this paper each conducted literature reviews of methods related to

solar power forecasting. Gradient boosting did not appear once; artificial neural networks seem to

be the preferred technique among solar energy forecasters. It seems plausible that within a

particular field of study, certain methods may rise in popularity through researchers building on

the work of other researchers in the field. If that really is the case and this competition illuminated

a method that solar irradiance forecast researchers had simply been unaware of or uninterested in,

then the endeavour is a fantastic success. Kaggle contests include participants with diverse

backgrounds and experience and may represent a fundamentally different way to advance the state

of the art in many narrow analysis fields in the future. NASA and the Royal Astronomical Society

sponsored a Kaggle competition last year requiring image analysis to determine peak positions of

dark matter halos. The contest winners improved the baseline code performance by 30% [38], and

the sponsors are currently running another one for classifying the morphology of galaxies.

The Kaggle platform also provides immense value for education. All contests and data remain

available for people to use at no cost. Several other people indicated through the contest forum

that they are using this contest problem for Masters level work. I encourage University of Alabama

in Huntsville Faculty to examine the potential for developing challenging course projects and

assignments based on some of these problems and data. My secondary goal in this analysis was to

demonstrate the value of R in an educational context. I have only been using it for about a year

and one half, but it is easily the most important tool I possess in the workplace. I was vaguely

aware of R as an undergraduate at Georgia Tech, mostly through my friends complaining if they

happened to get the professor who required it. It may be ambitious to require its use in course

14

work, but just making students aware of it through discussion and example would, I believe, be

immensely valuable to students who choose to learn it.

Finally I present a cautionary tale for any who might benefit from it. The most challenging

aspect of this analysis was data processing and management, and I made a critical mistake in that

respect. I wrote and tested code to perform each function necessary to generate the complete

solution early in the project; this was a good approach. My mistake was that I did not actually try

to generate the full solution until I was satisfied with a method that seemed to produce good results

on a few test cases. Scaling up to the full solution revealed numerous bugs that were very difficult

to track down, like the nugget value that converged to a negative number discussed in Section 4.

A better approach would be to first generate the full solution at scale with a simple method.

Acknowledgements I wish to extend my deep appreciation to American Meteorological Society, University of

Oklahoma faculty and researchers, and others for administering the contest and making the data

available; to the contest participants from whom I came to know of gradient boosting—my original

approach was a principal components regression, which turned out to not work at all; to Dr. Jeremy

Barnes for challenging me; to Dr. Ray Mitchell for so many years of guidance and friendship; to

the UAH faculty and staff for all of your instruction and support; and finally to my parents for

whom all words of appreciation seem woefully inadequate.

8 References

[1] R. Marquez and C. F. Coimbra, “Forecasting of global and direct solar irradiance using

stochastic learning methods, ground experiments and the NWS database,” Solar Energy,

vol. 85, pp. 746-756, 2011.

[2] J. Zhang, B.-M. Hodge, A. Florita, S. Lu, H. F. Hamann and V. Banunarayanan,

“Metrics for Evaluating the Accuracy of Solar Power Forecasting,” in 3rd International

Workshop on Integration of Solar Power into Power Systems, London, 2013.

[3] H. T. Pedro and C. F. Coimbra, “Assessment of forecasting techniques for solar power

production with no exogenous inputs,” Solar Energy, vol. 86, pp. 2017-2028, 2012.

[4] IBM Corporation, “Solar Anlaytics - Forecasting and Load Scheduling,” 2011. [Online].

Available: http://researcher.watson.ibm.com/researcher/files/us-kleinl/Solar_analytics-

Jun.pdf.

[5] C. Chen, S. Duan, T. Cai and B. Liu, “Online 24-h solar power forecasting based on

weather type classification using artificial neural network,” Solar Energy, vol. 85, pp. 2856-

2870, 2011.

[6] F. V. Brock, K. C. Crawford, R. L. Elliott, G. W. Cuperus, S. J. Stadler, H. L. Johnson

and M. D. Eilts, “The Oklahoma Mesonet: A Technical Overview,” Journal of Atmospheric

and Oceanic Technology, vol. 12, pp. 5-19, 1995.

[7] T. M. Hamill, G. T. Bates, J. S. Whitaker, D. R. Murray, M. Fiorino, T. J. Galarneau,

Y. Zhu and W. Lapenta, “NOAA's Second-Generation Global Medium-Range Ensemble

Reforecast Dataset,” American Meteorological Society, vol. October, pp. 1553-1566, 2013.

15

[8] R. Rew and G. Davis, “NetCDF: an interface for scientific data access,” Computer

Graphics and Applications, IEEE, vol. 10, no. 4, pp. 76-82, 1990.

[9] “Mesonet,” University of Oklahoma, [Online]. Available:

https://www.mesonet.org/index.php/site/about/other_measurements.

[10] R Core Team, R: A language and environment for statistical computing, Vienna, 2013.

[11] J. Chambers, Software for Data Analysis: Programming with R, New York: Springer,

2008.

[12] J. Watson and F. H. C. Crick, “Molecular Structure of Nucleic Acids,” Nature, vol. 171,

pp. 737-738, 1953.

[13] H. Wickam, “Reshaping Data with the reshape Package,” Journal of Statistical

Software, vol. 21, no. 12, pp. 1-20, 2007.

[14] H. Wickam, ggplot2: elegant graphics for data analysis, New York: Springer, 2009.

[15] E. Neuwirth, RColorBrewer: ColorBrewer palettes, R package version 1.0-5 : URL:

http://CRAN.R-project.org/package=RColorBrewer, 2011.

[16] T. Therneau, B. Atkinson and B. Ripley, rpart: Recursive Partitioning, R package

version 4.1-7: URL: http://CRAN.R-project.org/package=rpart, 2014.

[17] S. Milborrow, rpart.plot: Plot rpart models. An enhanced version of plot.rpart, R

package version 1.4-4: URL:http://CRAN.R-project.org/package=rpart.plot, 2014.

[18] H. Wickam, scales: Scale functions for graphics, R package version 0.2.3: URL:

http://CRAN.R-project.org/package=scales, 2012.

[19] Original S code by Richard A. Becker and Allan R. Wilks. R version by Ray Brownrigg.

Enhancements by Thomas P Minka ([email protected]), maps: Draw Geographical,

R package version 2.3-6: URL: http://CRAN.R-project.org/package=maps, 2013.

[20] Daniel Adler, Duncan Murdoch and others, rgl: 3D visualization device system

(OpenGL), R package version 0.93.991: URL: http://CRAN.R-project.org/package=rgl,

2013.

[21] Revolution Analytics and Steve Weston, foreach: Foreach looping construct for R., R

package version 1.4.1: URL: http://CRAN.R-project.org/package=foreach, 2013.

[22] Revolution Analytics and Steve Weston, doParallel: Foreach parallel adaptor for the

parallel package, R package version 1.0.6: URL: http://CRAN.R-

project.org/package=doParallel, 2013.

[23] D. Pierce, ncdf4: Interface to Unidata netCDF (version 4 or earlier) format data files,

R package version 1.9: URL: http://dwpierce.com/software, 2013.

[24] RStudio, Inc., [Software] RStudio Version 0.98.501, 2013.

[25] P. J. Ribeiro Jr and P. J. Diggle, “geoR: a package for geostatistical analysis,” R-NEWS,

vol. 1, no. 2, pp. 15-18, June 2001.

[26] G. M. Laslett, “Kriging and Splines: An Empirical Comparison of Their Predictive

Performance in Some Applications,” Journal of the American Statistical Association, vol.

89, no. 426, pp. 391-400, 1994.

[27] J. L. Barnes, “Test Planning and Validation through an Optimized Kriging

Interpolation Process in a Sequential Sampling Adaptive Computer Learning

Environment,” Auburn University Doctoral Dissertation, Auburn, 2012.

[28] P. J. Diggle and P. J. Ribeiro Jr., Model Based Geostatistics, New York: Springer, 2007.

16

[29] J. A. Nelder and R. Mead, “A simplex method for function minimization,” The

Computer Journal, vol. 7, pp. 308-313, 1965.

[30] R. Stocki, “A Method to Improve Design Reliability Using Optimal Latin Hypercube

Sampling,” Computer Assisted Mechanics and Engineering Sciences, vol. 12, pp. 87-105,

2005.

[31] R. Carnell, lhs: Latin Hypercube Samples, R package version 0.10: URL:

http://CRAN.R-project.org/package=lhs, 2012.

[32] M. Kuhn and K. Johnson, Applied Predictive Modeling, New York: Springer, 2013.

[33] J. H. Friedman, “Greedy Function Approximation: A Gradient Boosting Machine,” The

Annals of Statistics, vol. 29, no. 5, pp. 1189-1232, 2001.

[34] G. Ridgeway, Generalized Boosted Models: A guide to the gbm package, URL:

http://gradientboostedmodels.googlecode.com/git/gbm/inst/doc/gbm.pdf, 2007.

[35] G. Ridgeway and Contributions from others, gbm: Generalized Boosted Regression

Models, R package version 2.1: URL: http://CRAN.R-project.org/package=gbm, 2013.

[36] G. De'ath, “Boosted Trees for Ecological Modeling and Prediction,” Ecology, vol. 88,

no. 1, pp. 243-251, 2007.

[37] L. Eustaquio and G. Titericz Jr., “A Blending Approach to Solar Energy Prediction,”

in 12th Conference on Artificial and Computational Intelligence and its Applications to

the Environmental Sciences, Atlanta, 2014.

[38] D. Harvey, T. D. Kitching, J. Noah-Vanhoucke, B. Hamner and T. Salimans,

“Observing Dark Worlds: A crowdsourcing experiment for dark matter mapping,” ArXiv:

http://arxiv.org/pdf/1311.0704v1.pdf, 2013.