Exploratory landscape analysis

8
Exploratory Landscape Analysis Olaf Mersmann * TU Dortmund, Germany [email protected] dortmund.de Bernd Bischl TU Dortmund, Germany [email protected] dortmund.de Heike Trautmann TU Dortmund, Germany [email protected] dortmund.de Mike Preuss TU Dortmund, Germany mike.preuss@tu- dortmund.de Claus Weihs TU Dortmund, Germany [email protected] dortmund.de Günter Rudolph TU Dortmund, Germany guenter.rudolph@tu- dortmund.de ABSTRACT Exploratory Landscape Analysis (ELA) subsumes a number of techniques employed to obtain knowledge about the proper- ties of an unknown optimization problem, especially insofar as these properties are important for the performance of op- timization algorithms. Where in a first attempt, one could rely on high-level features designed by experts, we approach the problem from a different angle here, namely by using rel- atively cheap low-level computer generated features. Inter- estingly, very few features are needed to separate the BBOB problem groups and also for relating a problem to high-level, expert designed features, paving the way for automatic algo- rithm selection. Categories and Subject Descriptors G.1.6 [Optimization]: Global Optimization, Unconstrained Optimization; G.3 [Probability and Statistics]: Statistical Com- puting; I.2.6 [Learning]: Knowledge Acquisition General Terms Experimentation, Algorithms, Performance Keywords evolutionary optimization, fitness landscape, exploratory land- scape analysis, BBOB test set, benchmarking 1. INTRODUCTION The development of evolutionary and related optimization methods for real-valued problems is still a dynamic research area as it sees a large number of newly designed algorithms each year. These are usually evaluated on sets of benchmark problems such as the CEC’05 (Suganthan et al., 2005) and the * Corresponding author Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GECCO’11, July 12–16, 2011, Dublin, Ireland. Copyright 2011 ACM 978-1-4503-0557-0/11/07 ...$10.00. BBOB’09/’10 sets (Hansen et al., 2009), and conclusions are drawn based on majorities (algorithm A outperforms algo- rithm B on more problems and vice versa) or via consensus rankings (Mersmann et al. (2010)). However, in many cases, not much effort is put in detecting the strengths and weak- nesses of such algorithms in terms of problem properties, with the notable exception of dimensionality. The CEC’05 benchmark and to an even larger extent the BBOB bench- mark have been designed to enable exactly that: To find out which algorithm is good on which problem group. Meanwhile, the naive belief in a single optimization algorithm that dom- inates all others even for a certain domain (e.g. multi-modal problems) has lost ground and the evidence gathered from experimental analysis leads to another approach: Given that we possess a good overview over the properties that make a problem easy or hard for a specific algorithm, we may choose the ‘right’ algorithm from an ensemble to solve it efficiently. However, since knowledge about the interaction of prob- lem properties and algorithm efficiency is incomplete even for constructed benchmark problems, it is even more diffi- cult to choose an algorithm for a real-world problem since for these even less is known about the function. In order to find the ‘right’ optimization algorithm for such an optimiza- tion problem, we thus need two things: Improved knowl- edge about problem properties which capture the difficulty of an optimization problem for a specific algorithm, and a method to estimate these properties relatively cheaply. In this work, we concentrate on the second task, follow- ing the path termed exploratory landscape analysis (ELA) as suggested in Mersmann et al. (2010) and Preuss and Bartz- Beielstein (2011). In addition to the expert designed features of Mersmann et al. (2010), we suggest a number of low-level features of functions that can estimated from relativly few function evaluations. Both feature groups are introduced in section 2. The latter is first used to predict to which of the five BBOB problem groups each of the 24 BBOB test functions be- longs. In order to keep the number of features or rather the number of necessary function evaluations as low as possi- ble, we employ a multi-objective evolutionary algorithm (EA) to simultaneously optimize feature sets according to quality and cost, with exciting results as presented in section 4. The required machine learning methods are described briefly in section 3. In a second step, we try to relate the low-level features to the expert designed high-level features of Mersmann et al. (2010), which poses a much more demanding task. Again, a

Transcript of Exploratory landscape analysis

Exploratory Landscape Analysis

Olaf Mersmann∗

TU Dortmund, [email protected]

dortmund.de

Bernd BischlTU Dortmund, [email protected]

dortmund.de

Heike TrautmannTU Dortmund, Germany

[email protected]

Mike PreussTU Dortmund, Germany

[email protected]

Claus WeihsTU Dortmund, [email protected]

dortmund.de

Günter RudolphTU Dortmund, Germanyguenter.rudolph@tu-

dortmund.de

ABSTRACTExploratory Landscape Analysis (ELA) subsumes a number oftechniques employed to obtain knowledge about the proper-ties of an unknown optimization problem, especially insofaras these properties are important for the performance of op-timization algorithms. Where in a first attempt, one couldrely on high-level features designed by experts, we approachthe problem from a different angle here, namely by using rel-atively cheap low-level computer generated features. Inter-estingly, very few features are needed to separate the BBOBproblem groups and also for relating a problem to high-level,expert designed features, paving the way for automatic algo-rithm selection.

Categories and Subject DescriptorsG.1.6 [Optimization]: Global Optimization, UnconstrainedOptimization; G.3 [Probability and Statistics]: Statistical Com-puting; I.2.6 [Learning]: Knowledge Acquisition

General TermsExperimentation, Algorithms, Performance

Keywordsevolutionary optimization, fitness landscape, exploratory land-scape analysis, BBOB test set, benchmarking

1. INTRODUCTIONThe development of evolutionary and related optimizationmethods for real-valued problems is still a dynamic researcharea as it sees a large number of newly designed algorithmseach year. These are usually evaluated on sets of benchmarkproblems such as the CEC’05 (Suganthan et al., 2005) and the

∗Corresponding author

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.GECCO’11, July 12–16, 2011, Dublin, Ireland.Copyright 2011 ACM 978-1-4503-0557-0/11/07 ...$10.00.

BBOB’09/’10 sets (Hansen et al., 2009), and conclusions aredrawn based on majorities (algorithm A outperforms algo-rithm B on more problems and vice versa) or via consensusrankings (Mersmann et al. (2010)). However, in many cases,not much effort is put in detecting the strengths and weak-nesses of such algorithms in terms of problem properties,with the notable exception of dimensionality. The CEC’05benchmark and to an even larger extent the BBOB bench-mark have been designed to enable exactly that: To find outwhich algorithm is good on which problem group. Meanwhile,the naive belief in a single optimization algorithm that dom-inates all others even for a certain domain (e.g. multi-modalproblems) has lost ground and the evidence gathered fromexperimental analysis leads to another approach: Given thatwe possess a good overview over the properties that make aproblem easy or hard for a specific algorithm, we may choosethe ‘right’ algorithm from an ensemble to solve it efficiently.

However, since knowledge about the interaction of prob-lem properties and algorithm efficiency is incomplete evenfor constructed benchmark problems, it is even more diffi-cult to choose an algorithm for a real-world problem sincefor these even less is known about the function. In order tofind the ‘right’ optimization algorithm for such an optimiza-tion problem, we thus need two things: Improved knowl-edge about problem properties which capture the difficultyof an optimization problem for a specific algorithm, and amethod to estimate these properties relatively cheaply.

In this work, we concentrate on the second task, follow-ing the path termed exploratory landscape analysis (ELA) assuggested in Mersmann et al. (2010) and Preuss and Bartz-Beielstein (2011). In addition to the expert designed featuresof Mersmann et al. (2010), we suggest a number of low-levelfeatures of functions that can estimated from relativly fewfunction evaluations. Both feature groups are introduced insection 2. The latter is first used to predict to which of the fiveBBOB problem groups each of the 24 BBOB test functions be-longs. In order to keep the number of features or rather thenumber of necessary function evaluations as low as possi-ble, we employ a multi-objective evolutionary algorithm (EA)to simultaneously optimize feature sets according to qualityand cost, with exciting results as presented in section 4. Therequired machine learning methods are described briefly insection 3.

In a second step, we try to relate the low-level features tothe expert designed high-level features of Mersmann et al.(2010), which poses a much more demanding task. Again, a

multi-objective EA is used to determine sets of good and nottoo expensive features, as demonstrated in section 5. Ourreasoning is that if a human expert is able to select or designan algorithm suitable for a fixed function given the high-levelfeatures of that function, and if the high-level features can bepredicted by evaluating the function a few times for well cho-sen parameter settings to estimate the value of the low-levelfeatures, then it should be possible to automatically select agood algorithm for a fixed function. However, we have yetto achieve the last step and therefore it is more of a vision tobe worked on further in upcoming works.

To obtain a better idea of the usefulness of the low-levelfeatures, we lastly compare the feature sets resulting fromthe two experiments in section 5.2 before concluding.

2. EXPLORATORY LANDSCAPEANALYSIS

Exploratory Landscape Analysis originates from the obser-vation that once a problem is well known, one can employ amatching optimization algorithm to solve it. However, mostproblems encountered in practice (e.g. from the engineer-ing domain) are poorly understood. It is a common practiceto either run some initial tests with one or more algorithmsand then adapt the algorithm to the problem according to theobtained results, or to apply a ‘blind’ parameter tuning pro-cess to do the same automatically. If computing one fitnessevaluation is costly, both approaches are problematic. Thisoften results in insufficient algorithm comparisons in prac-tice. We follow a different reasoning by finding interactionsbetween problem properties and algorithms so that the prob-lem features defining a specific algorithm’s performance canbe gathered without actually running it. Instead, these prop-erties are estimated using a small sample of function valuescombined with statistical and machine learning techniques.These provide us with insights into the nature of the prob-lem and enable us to select a good candidate optimizationalgorithm.

This automated knowledge acquisition approach of courseheavily depends on a good feature set. We therefore startwith a short summary of already known expert designedhigh-level features, followed by suggesting a new low-levelfeature set which constitutes the basis of the experimental in-vestigation in this work.

2.1 Properties based on Expert KnowledgeMersmann et al. (2010) discuss eight properties to charac-terize the complexity of an optimization problem which canbe specified based on expert knowledge of the problem athand. However, other interpretations are possible and we donot assume incontrovertible general acceptance. These prop-erties are used to characterize the Black-Box OptimizationBenchmarking (Hansen et al., 2009, BBOB) test functions (seeTable 1) and are then compared and contrasted to the per-formance of the competing optimization algorithms in thecontest.

In addition to the self-explanatory aspect of multi-moda-lity, the existence of a global structure is an indicator for thechallenges of a given optimization problem. A lack of globalbasin structure severely increases the search effort of an algo-rithm. This holds as well for non-separable problems whichcannot be partitioned into lower-dimensional sub-problems.Thus, problem separability is an important property as well.

Variable scaling may lead to non-spherical basins of attrac-tion which require a different behavior of the algorithm ineach dimension especially with focus on algorithm step size.

Search space- and basin size homogeneity both relate tothe overall appearance of the problem. Problem hardness issurely dependent on the existence of phase transitions in thesearch space as well as on the variation of the sizes of allbasins of attractions. On the contrary, a high global to lo-cal optima contrast, which refers to the difference betweenglobal and local peaks in comparison to the average fitnesslevel, facilitates the recognition of excellent peaks in the ob-jective space. The feature characterizing the function land-scapes plateaus is omitted in this study because the test func-tion set lacks variation w.r.t. this feature.

Function multim. gl.-struc. separ. scaling homog. basins gl.-loc.

1 Sphere none none high none high none none2 Ellipsoidal separable none none high high high none none3 Rastrigin separable high strong none low high low low4 Bueche-Rastrigin high strong high low high med. low5 Linear Slope none none high none high none none

6 Attractive Sector none none high low med. none none7 Step Ellipsoidal none none high low high none none8 Rosenbrock low none none none med. low low9 Rosenbrock rotated low none none none med. low low

10 Ellipsoidal high-cond. none none none high high none none11 Discus none none none high high none none12 Bent Cigar none none none high high none none13 Sharp Ridge none none none low med. none none14 Different Powers none none none low med. none none

15 Rastrigin multi-modal high strong none low high low low16 Weierstrass high med. none med. high med. low17 Schaffer F7 high med. none low med. med. high18 Schaffer F7 mod. ill-cond. high med. none high med. med. high19 Griewank-Rosenbrock high strong none none high low low

20 Schwefel med. deceptive none none high low low21 Gallagher 101 Peaks med. none none med. high med. low22 Gallagher 21 Peaks low none none med. high med. med.23 Katsuura high none none none high low low24 Lunacek bi-Rastrigin high weak none low high low low

Table 1: Classification of the noiseless BBOB functions basedon their properties (multi-modality, global structure, separabil-ity, variable scaling, homogeneity, basin-sizes, global to local con-trast). Predefined groups are separated by horizontal lines.

2.2 Low-Level FeaturesWe consider six low-level feature classes in this work to char-acterize the structure of an unknown fitness landscape. Thestarting point is a data set of s different decision variable set-tings Xs with respective objective values Y s generated by arandom Latin hypercube (LH) design covering the decisionspace. We will denote the combination of parameter settingsand corresponding function values by Ds = [Xs, Y s]. Thesize of the LH desgin increases linearly with the dimensiond of the problem in order to account for the increasing prob-lem complexity, i.e. the number of initial points is chosen ass = c · d, where c is a predefined constant.

The basic concepts of the feature classes are described inthe following. It has to be noted that each class contains sev-eral lower level sub-features which, in each case, can be gen-erated using the same experimental data pool. All of these 50sub-features are only briefly summarized here, due to lack ofspace (see Table 2 for details)1

Convexity: Two random points from Xs are selected and alinear combination with random weights is formed. The

1R source code to calculate the features for a given func-tion can be obtained from http://ptr.p-value.net/gco10/ela.R. Additional plots and the raw data gener-ated by these experiments is available from http://ptr.p-value.net/gco10/suppl.zip.

difference of the objective value of the new point and theconvex combination of the objective values of the two sam-pled points is computed. This is repeated 1000 times, andthe average number of instances where the calculated dif-ference is less than a predefined negative threshold (here−10−10) related to the number of repetitions estimates theprobability of convexity. Linearity of the function is assumedin case all absolute differences are smaller than the absolutevalue of the specified threshold.y - Distribution: The distribution of Y s is characterized by

the skewness and kurtosis of the objective function values.The latter measures the degree of peakedness of the distri-bution, i.e. reflects whether the distribution is rather flat orpeaked when compared to a normal distribution. The num-ber of peaks of the distribution is estimated as well as anindicator for multi-modality.

Levelset: The initial data set Ds is split into two classes bya specific objective level which works as a threshold. Onepossibility is to use the median for this, which will result inequally sized classes. Other choices studied are the upperand lower quartiles of the distribution of y. Linear (LDA),quadratic (QDA) and mixture discriminant analysis (MDA)are used to predict whether the objective values Y s fall be-low or exceed the calculated threshold. Multi-modal func-tions should result in several unconnected sublevel sets forthe quantile of lower values, which can only be modeled byMDA, but not LDA or QDA. The extracted low-level fea-tures are based on the distribution of the resulting cross-validated mean misclassification errors of each classifier.

Meta-Model: Linear and quadratic regression models withor without interactions are fitted to the initial data Ds. Theadjusted coefficient of determination R2 is returned in eachcase as an indicator for model accuracy. Functions withvariable scaling will not allow a good fit of regression mod-els without interaction effects, and simple unimodal func-tions might be approximated by using a quadratic model.In addition, features are extracted which reflect the size re-lations of the model coefficients.

Local search: A local search algorithm (Nelder-Mead) is start-ed from a random sample of size N = 50 · d from Ds. Thesolutions are hierarchically clustered in order to identify thelocal optima of the function. The basin size is approximatedby the number of local searches which terminate in eachidentified local optimum. Extracted features are the num-ber of identified clusters as an indicator for multi-modalityas well as characteristics related to the basin sizes aroundthe identified local optima reflected by the sizes of the par-ticular clusters. In addition, summary statistics like the ex-trema and specific quartiles of the distribution of the num-ber of function evaluations spent by the local search algo-rithm runs until termination are computed.

Curvature: For 100 · d points from Ds, the gradient is nu-merically approximated in each point using Richardson’sextrapolation method (Linfield and Penny (1989)). Result-ing features are summary statistics of the Euclidean norm ofthe latter as well as of the relations of maximum and min-imum values of the respective partial derivatives. Further-more, the same statistics are applied to the condition num-ber of the numerical approximation of the Hessian (Linfieldand Penny (1989)).As visualized in Fig. 1 the low-level feature classes cover

different aspects of the defined high-level features. To empir-ically justify these relationships, the high-level features are

��������

���� �������

��������������������

��������� ���� �

��������

��� ���������

����������

����������

����� ��� ��

����������������

����������� ���

�� ����������������� ������

�������� ��

������������ ���� �

Figure 1: Relationships between high-level features (grey) andlow-level feature classes (white)

compared to the computed low-level features on the BBOBtest functions. Statistical classification techniques are usedto predict the level of each high-level feature (see Table 1),e.g. low, medium or high multi-modality, from the values ofthe low-level features. Representative results are presentedin section 5.

In addition, in section 4, the five predefined function groupsof the BBOB contest (1. separable, 2. low or moderate con-ditioned, 3. high conditioned and unimodal, 4. multi-modalwith adequate global structure, 5. multi-modal with weakglobal structure) as listed in Hansen et al. (2009) are predictedbased on the low-level features.

3. MACHINE LEARNING METHODSModeling: In order to solve the various classification tasksposed in this article we will always use a random forest asproposed by Breiman (2001). This model creates an ensem-ble of unpruned classification trees based on bootstrappedversions of the training data. Predictions of the individualtrees are aggregated by majority voting. As the followingexperiments will be complex and time-consuming enough,we do not want to burden ourselves with model selectionand therefore choose a classifier which is reliable in the fol-lowing sense: it can capture nonlinear relationships betweenfeatures and classes and is known to exhibit strong perfor-mance in many applications. Although it might be tunablefor specific problems, its performance usually does not de-grade drastically if the settings of its hyperparameters are notoptimal. This is in strong contrast to other competitive clas-sification methods, e.g. SVMs. Finally, it is invariant underall monotone transformations of the features and thereforedifferent or irregular scalings constitute no problem for theforest.

Performance estimation: Although the misclassificationerror of the forest can usually be assessed in a computation-ally cheap and statistically unbiased way by using its internalout-of-bag bootstrap estimator, we opt here for a different ap-proach. Since for each instance of a BBOB test function fiveidentical replications exist, we want to avoid that the classi-fier simply memorizes the functions it has already seen. Wetherefore choose a specific 5-fold cross-validation strategy,which forces the classifier to generalize to function instancesnot seen in the training set. This is ensured by either hav-ing all replications of a parametrized function completely inthe training or test set (stated differently we “cross-validateon the BBOB function instances”). Later on we study an evenharder setting by calculating the performance of the classifier

Feature group and name Description

Meta-model features:1 approx.{linear,lineari}_ar2 adjusted R2 of the estimated linear regression model without and with interaction1 approx.linear_{min,max}_coef minimum and maximum value of the absolute values of the linear model coefficients2 approx.{quadratic,quadratici}_ar2 adjusted R2 of the estimated quadratic regression model without or with interaction2 approx.quadratic_cond maximum absolute value divided by minimum absolute value of the coefficients of the quadratic terms in the quadratic model

Convexity features:3 convex.{linear,convex}_p estimated probability of linearity and convexity3 convex.linear_dev mean deviation from linearity

y distribution features:4 distr.skewness_y skewness of the distribution of the function values4 distr.kurtosis_y kurtosis of the distribution of the function values4 distr.n_peaks estimation of the number of peaks in the distribution of the function values

Levelset features:5 levelset.lda_mmce_{10,25,50} mean LDA misclassification error for function values split by 0.1, 0.25, 0.5 quantile (estimated by CV)5 levelset.lda_vs_qda_{10,25,50} levelset.lda_mmce_{10,25,50} divided by levelset.qda_mmce_{10,25,50}6 levelset.qda_mmce_{10,25,50} mean QDA misclassification error for function values split by 0.1, 0.25, 0.5 quantile (estimated by CV)7 levelset.mda_mmce_{10,25,50} mean MDA misclassification error for function values split by 0.1, 0.25, 0.5 quantile (estimated by CV)

Local search features:8 ls.n_local_optima number of local optima estimated by the number of identified clusters8 ls.best_to_mean_contrast minimum value of cluster centers divided by the mean value of cluster centers8 ls.{best,worst}_basin_size proportion of points in the best and worst cluster8 ls.mean_other_basin_size mean proportion of points in all clusters but the cluster with the best cluster center8 ls.{min,lq,med,uq,max}_feval 0, 0.25, 0.5, 0.75 and 1 quantile of the distribution of the number of function evaluations performed during a single local search

Curvature features:9 numderiv.grad_norm_{min,lq,med,uq,max} minimum, lower quantile, median, upper quantile and maximum of the euclidean norm of the estimated numerical gradient9 numderiv.grad_scale_{min,lq,med,uq,max} minimum, lower quantile, median, upper quantile and maximum of the maximum divided by the minimum of the absolute values of the estimated partial gradients10numderiv.hessian_cond_{min,lq,med,uq,max}minimum, lower quantile, median, upper quantile and maximum of the maximum divided by the minimum eigenvalue of the estimated hessian matrix

Table 2: Low-level features; summary of sub-features within the feature classes and assignment to feature groups used forpredicting BBOB function groups and high-level function properties by means of the SMS-GA (cf. sections 4 and 5).

using “leave-one-function-out” resampling, which partitionsthe observations into 24 test sets corresponding to the BBOBfunctions.

Feature selection: Forward search adds features iterativelyto the current set using a resampled estimate of a single per-formance measure, e.g. misclassification error. We initiallyconsidered this technique, but abandoned it very quickly, asfeatures will only be chosen w.r.t. their predictive power,without regard for their computational cost. Instead, we optto employ a multi-objective optimization approach, in whichwe try to minimize the misclassification rate while also min-imizing the cost of calculating the active features and thetotal number of feature groups used. We use an adaptedSMS-EMOA, as proposed by Beume et al. (2007), for our dis-crete parameter space. By replacing the simulated binarycrossover and polynomial mutation operators with the wellknown uniform binary crossover and uniform mutation op-erators we obtain the resulting algorithm, which we calledSMS-GA2. It generates not just one set of features, but manydiverse ones with different trade-offs w.r.t. the above threecriteria. The features are combined into feature groups (seeTable 2), and each bit in the genome of the GA representsone such feature group. The rationale for this is that thereare usually several ELA features which can be derived fromone set of calculations. Therefore the optimization processcan view them as one “feature”, since adding further featuresfrom the feature group will not increase the computationalcost. This also reduces the size of the discrete search space.

Feature selection and all required algorithms of machinelearning were implemented in R using mlr (Bischl, 2011).

4. FEATURE SELECTION FORBBOB GROUPS

4.1 Experimental Setup (Problem 1)A random forest (RF) is used to classify the 24 × 5 × 5 ×5 = 3000 function instances (# functions × # functions in-

2An R implementation of the SMS-GA can be obtained fromhttp://ptr.p-value.net/gco10/sms_ga.R

stances × # dimensions × # repetitions) of the BBOB contest(cf. to Hansen et al. (2009) for details) into the five prede-fined BBOB groups based on the low-level features describedin section 2.2. Features are selected with the purpose of ex-tracting the most relevant ones for characterizing the fitnesslandscape of a given problem. In order to investigate the in-fluence of the size s of the initial LH design, nine parallelcomputations of the low-level features have been conductedfor all elements of s = {5000, 2500, 1250, 625, 500, 400, 300,200, 100}. Each low-level feature group based on Ds is en-coded by one bit which can be active or switched off, result-ing in a bit string of length 90. Each group name is supple-mented by the elements of s, the first nine elements of thebit string encode the features (approx.linear_5000, . . . ,approx.linear_100).

The three quality criteria for the SMS-GA optimization arespecified as follows: (1) The random forest is cross-validatedon the function instances as described in section 3, and theMCE is used to evaluate its predictive quality. (2) In caseof extremely time-consuming fitness function evaluations, itis impossible to evaluate a huge number of sample pointswhich would rule out the use of e.g. the local search fea-tures. Thus, the computational complexity of the low-levelfeatures is considered as well. It is reflected by the number offitness function evaluations (NFE) required for the computa-tion of the features. Details are provided in Table 3. (3) Modelcomplexity is addressed by the number of selected features(NSF).

All three criteria are linearly transformed to the interval[0,1] by dividing each of them by its maximal value in or-der to ensure comparable scales and not to favor a criterionduring the multi-objective optimization.

The SMS-GA is utilized for simultaneously minimizing thethree criteria MCE, NFE and NSF operating on the 90 dimen-sional bit string as described above. This Pareto front ap-proximation facilitates the selection of a solution with respectto available computational resources and time constraints.

Ten runs of the SMS-GA are executed in parallel using apopulation size of µ = 20, 2000 fitness evaluations, nadirpoint nd = (1, 1, 1), uniform binary crossover with probabil-

Feature Group Cost of selected bits

convex∑l

i=1 s̃i + l · 1000ls

∑li=1(s̃i + 50 · d · lsF̄E(si))

numderiv.grad∑l

i=1 s̃i + l · 100 · d2

numderiv.hessian∑l

i=1 s̃i + l · 100 · d3

all others∑l

i=1 s̃i

Table 3: Cost (i.e. NFE) of the selected bits. The vectors̃ = (s̃1, . . . , s̃l) contains the different initial design sizeswithin the selected features; lsF̄E reflects the mean numberof FE of all local searches for the respective initial design;d equals the dimensionality of the test function.

# Function Evaluations

Dom

inat

ed H

yper

volu

me

0.40.50.60.70.80.91.0

0.40.50.60.70.80.91.0

Problem 1

500 1000 1500 2000Problem 2

200 400 600 800 1000

Figure 2: Plot of the hypervolume dominated by the active pop-ulation after each function evaluation for the two optimizationproblems (10D). The colors denote the different runs of the SMS-GA.

ity pc = 0.25 and a mutation rate of pm = 0.05. The lattertwo parameters are set in accordance with usual recommen-dations in the GA literature. All members of the archives ofnon-dominated solutions are merged and the non-dominatedsolutions within this set of points are determined, which rep-resent the final approximation of the Pareto front of the prob-lem.

In Mersmann et al. (2010) it was shown that the perfor-mance of the competing algorithms in the BBOB contest dif-fers widely for lower (2-3) and higher dimensions (5-20). Inorder to analyze the effect of increasing dimension on the se-lection of the low-level features, the analysis is performedseparately for the decision space dimensions d = (5, 10, 20),since these are of primary interest for practical applications.

4.2 ResultsThe combined non-dominated solutions of the 10 runs forthe 10D case are shown in Fig. 4. Looking at the plot, wesee that for the most part features based on small initial de-signs are chosen. This alone is an encouraging result becauseit implies that using just a few function evaluations we cancharacterize the 24 BBOB test functions to such an extent, thatwe can, in the extreme case, perfectly predict to which BBOBfunction group they belong. One surprising aspect is thatfor some individuals the chosen feature set seems redundant.For example, the first solution with a misclassification rate of0 and a cost of 10−2.4 uses both the convex.(...)_625 andthen convex.(...)_5000 features. We believe that, givenmore time, the SMS-GA would likely eliminate one of thetwo. From Fig. 2 it becomes obvious that the SMS-GA has notfinally converged with respect to the obtained hypervolume.We plan to analyze this further. Another important thing tonote is that there appears to be no feature which is requiredregardless of the chosen trade-off. Therefore the optimiza-

convex.linear_dev_5000

dist

r.kur

tosi

s_y_

5000

100

200

300

400

500

600

10_110_110_110_110_110_210_210_210_210_2

10_310_310_310_310_310_410_410_410_410_4

10_510_5

10_5

10_5

10_5

11_111_111_111_111_111_2

11_211_211_211_2

11_311_311_311_3

11_311_411_411_411_411_4

11_511_5

11_5

11_511_5

12_112_112_112_112_1

12_212_212_212_212_2

12_312_312_312_312_312_412_412_412_412_4

12_512_512_512_512_5

13_113_113_113_1

13_1

13_213_213_213_213_2

13_313_313_313_3

13_313_4

13_413_413_413_413_513_513_513_513_5

14_114_1

14_1

14_114_1

14_214_214_214_2

14_2

14_3

14_3

14_314_3

14_3

14_414_414_414_414_4

14_514_514_514_514_5

6_16_16_16_16_1

6_26_26_26_26_26_36_36_36_3

6_3

6_46_46_46_46_4

6_56_56_56_56_5

7_17_17_17_1

7_1

7_27_27_27_27_2

7_37_37_37_37_3

7_47_47_47_4

7_4

7_57_57_57_57_5

8_18_18_18_18_18_28_28_28_28_2

8_38_38_38_38_3 8_48_48_48_48_4

8_5

8_58_58_58_5

9_19_19_19_1

9_1

9_2

9_29_2

9_29_2

9_3

9_39_39_3

9_3

9_49_49_4

9_49_4

9_5

9_5

9_5

9_5

9_5 15_1

15_1

15_115_115_1

15_2

15_2

15_2

15_2

15_215_315_315_315_315_3

15_4

15_415_4

15_4

15_4

15_5

15_5

15_515_515_5

16_1 16_116_116_1

16_116_216_2

16_2

16_216_2

16_3

16_316_3 16_316_3

16_416_416_4

16_416_4

16_5 16_516_5

16_516_5

17_117_117_117_1

17_117_217_2

17_2

17_2

17_2

17_317_317_317_3

17_3

17_4

17_417_4

17_417_4

17_517_517_517_517_5

18_1

18_1

18_118_118_1

18_218_218_218_218_2

18_318_3

18_318_318_3

18_418_4

18_4

18_418_4

18_518_518_518_518_5

19_1

19_119_119_1

19_1

19_2

19_2

19_219_219_2

19_319_3

19_319_3

19_3

19_419_419_4

19_419_419_5

19_5

19_5

19_5

19_5

20_1

20_120_1

20_1

20_120_220_2

20_220_220_220_3

20_320_320_3

20_320_420_420_4

20_420_4

20_5

20_5

20_520_520_5

21_121_121_121_1

21_121_221_2

21_221_221_2

21_321_321_321_321_321_421_4

21_421_421_421_521_521_521_521_5

22_122_122_122_122_1

22_222_222_2

22_2

22_222_3

22_3

22_3

22_3

22_322_4

22_422_422_422_4

22_522_522_522_522_5

23_123_1

23_123_1

23_1

23_2

23_2

23_2

23_223_2 23_3

23_323_323_3 23_3

23_423_4

23_4

23_4

23_423_5

23_523_523_5

23_524_1

24_124_1

24_124_124_224_2

24_2

24_2

24_2

24_324_324_3

24_3

24_3

24_424_4

24_4

24_4

24_424_5

24_524_5

24_5

24_5

1_11_11_11_1

1_1

1_21_2

1_21_21_2

1_3

1_3

1_3

1_31_3

1_4

1_4

1_41_4

1_4

1_51_5

1_51_51_5

2_12_12_12_12_1

2_22_22_2

2_22_2

2_32_32_32_32_32_42_42_42_42_4

2_52_52_52_52_5

3_13_13_13_13_1

3_23_23_23_23_23_33_33_33_33_3

3_43_43_43_43_4

3_53_53_53_53_5

4_14_14_14_14_1

4_24_24_24_24_24_34_34_34_34_34_44_44_44_4

4_4

4_54_54_54_54_5

5_15_15_15_1

5_15_25_2

5_25_25_25_35_35_35_35_35_4

5_45_4

5_45_45_5

5_5

5_55_5

5_5

100 200 300 400 500 600

Figure 3: Scatter plot of two features, selected via forward selec-tion and cross-validation on the function instances (10D), used topredict the BBOB function group. The color encodes the functiongroup (labeled by function_instance) and the dark circles markmisclassified function instances during the cross-validation.

tion process seems to have generated truly diverse solutions,not just in the objective space but also in the parameter space.

Fig. 5 visualizes the normalized ranks of important fea-tures within the BBOB groups, i.e. those features for whichthe feature groups are active in at least ten of the solutionsgenerated by the SMS-GA. Though the rank structures arenot clearly distinguishable in most cases, the highest differ-ences can be detected for the “levelset” and “approx.linear”features, especially with respect to the fifth BBOB group. How-ever, the effect of the problem dimension on the feature val-ues is observable which justifies the separate analysis for thedifferent dimensions.

Finally, let us point out that the chosen scenario might re-sult in an optimistically biased misclassification rate. Giventhat we have differently parametrized instances of the sametest function in both the training and test set, the job of ex-trapolating from the “known” behavior of the training set oftest functions to the “unknown” test functions is much eas-ier than if we had opted to leave out the whole set of testinstances of a test function in each cross-validation fold. Toillustrate this effect, we can consider a simplification of theoptimization done in this section. We can use simple for-ward selection, as described in section 3, to find two relevantfeatures to predict the BBOB function group of a test func-tion. Fig. 3 shows such a pair of features for the 10D testfunctions. We clearly observe that the only misclassified in-stances, those with a dark circle behind them, are functionswhich are close to a cluster of other test function instances.But looking at the bottom left of the plot, we see an isolatedcluster of test function instances all belonging to test func-tion 2. If we had completely removed function 2 in the train-ing phase, how could the classifier possibly know that func-tions in this region belong to the BBOB function group 1? Amore likely scenario is that functions in this region of the plotwould be assigned to classes 2 or 3. Results in the 5D and20D case are similar and the same plots as shown here areprovided for these two cases in the supplementary material.

Individual

Fea

ture

Gro

up

convex.(...)_625approx.quad.(...)_400

distr.(...)_500levelset.qda.(...)_100levelset.qda.(...)_200approx.quad.(...)_100approx.quad.(...)_625

convex.(...)_2500levelset.lda.(...)_1250

convex.(...)_200approx.quad.(...)_200

distr.(...)_1250convex.(...)_1250

distr.(...)_200approx.linear.(...)_500

convex.(...)_400levelset.lda.(...)_300

levelset.mda.(...)_100levelset.lda.(...)_200levelset.lda.(...)_100levelset.lda.(...)_400

levelset.qda.(...)_300levelset.qda.(...)_400

convex.(...)_500levelset.mda.(...)_500

numderiv.grad.(...)_200numderiv.grad.(...)_400approx.linear.(...)_1250

distr.(...)_400levelset.mda.(...)_300approx.quad.(...)_300

approx.linear.(...)_2500distr.(...)_100

convex.(...)_100numderiv.grad.(...)_100

approx.quad.(...)_500levelset.mda.(...)_1250

numderiv.grad.(...)_2500approx.linear.(...)_400

approx.quad.(...)_2500approx.linear.(...)_300

distr.(...)_300levelset.lda.(...)_500

convex.(...)_300approx.linear.(...)_200

numderiv.grad.(...)_300convex.(...)_5000

distr.(...)_2500levelset.mda.(...)_2500

levelset.mda.(...)_625approx.linear.(...)_625

distr.(...)_625levelset.lda.(...)_2500

numderiv.grad.(...)_625levelset.mda.(...)_400levelset.qda.(...)_500

approx.linear.(...)_100distr.(...)_5000

levelset.mda.(...)_200numderiv.grad.(...)_500

numderiv.grad.(...)_1250levelset.qda.(...)_625

approx.quad.(...)_5000levelset.lda.(...)_625

levelset.mda.(...)_5000levelset.qda.(...)_2500

approx.linear.(...)_5000levelset.lda.(...)_5000

approx.quad.(...)_1250numderiv.grad.(...)_5000

numderiv.hessian.(...)_100ls..(...)_200

numderiv.hessian.(...)_2500numderiv.hessian.(...)_500

levelset.qda.(...)_1250ls..(...)_300ls..(...)_500

Problem 1

● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●

( 0.00, −2.4)

( 0.00, −3.3)

( 0.17, −1.0)

( 0.17, −3.4)

( 0.50, −3.4)

( 0.83, −3.0)

( 0.83, −3.5)

( 1.33, −3.3)

( 1.33, −1.0)

( 1.50, −3.4)

( 1.67, −2.8)

( 1.83, −3.5)

( 2.00, −3.5)

( 2.17, −3.2)

( 2.67, −3.3)

( 4.33, −3.4)

( 4.67, −3.5)

( 5.00, −2.5)

( 6.00, −3.6)

( 8.00, −3.7)

(22.50, −3.8)

Problem 2

● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

( 2.50, −2.5)

( 2.50, −1.0)

( 2.83, −2.7)

( 3.17, −2.5)

( 3.33, −2.4)

( 3.33, −2.6)

( 3.33, −2.7)

( 3.50, −2.6)

( 3.67, −2.7)

( 3.83, −2.5)

( 3.83, −2.8)

( 4.17, −2.6)

( 4.50, −2.7)

( 4.67, −2.7)

( 5.33, −2.8)

( 5.83, −2.8)

( 6.50, −3.1)

( 6.67, −2.8)

( 6.67, −2.9)

( 7.17, −3.1)

( 7.33, −2.9)

( 8.83, −3.1)

( 9.17, −3.2)

(12.33, −2.9)

Figure 4: Bitplot of the overall best solutions (10D). Feature groups which are never selected are not shown to save space. The featuregroups are ordered, from bottom to top, according to how frequently they occur in the solution sets. Each column represents one individualin the final set of non-dominated solutions. These solutions are again ordered according to the first two objectives. The axis label showsthe MCE (in percent) and the cost (on log10 scale) of the solution. Black dots denote active features, grey dots mean the feature is not parkof the solution vector.

Figure 5: Parallel Coordinate Plot of the 31 features present in at least 10 of the final non-dominated solutions to the problem studied insection 4. Each line represents one calculation of the features on one BBOB test function instance (red: 5D, green: 10D, blue: 20D). Thex-axis represents the ranks of the feature values within the groups scaled by the overall maximum rank.

Problem_1

Pro

blem

_2

0.0

0.1

0.2

0.3

0.4

●●●●

●●●

●●

● ●

●●

●●

●●●

●●

●●

● ●

●●

●●

●● ●

●●

●●●

ls..(...)_200

numderiv.hessian.(...)_2500

numderiv.hessian.(...)_500

approx.linear.(...)_625

approx.linear.(...)_400

numderiv.grad.(...)_200

convex.(...)_2500

distr.(...)_500

convex.(...)_400

0.00 0.05 0.10 0.15 0.20 0.25 0.30

Figure 6: Probabilities of belonging to a final solution of problem1 and 2 for all feature groups of Fig. 4.

5. FEATURE SELECTION FORHIGH-LEVEL FEATURES

In general, knowledge of specific characteristics of the fitnesslandscape of a given problem is desired independent fromthe BBOB function grouping. More specifically, a direct pre-diction of function properties as presented in Table 1 such ase.g. the level of multi-modality has to be provided. Espe-cially the ability to differentiate between multi-modal prob-lems with and without global structure will facilitate the se-lection of an appropriate algorithm for an optimization prob-lem at hand.

5.1 Experimental Setup (Problem 2)The experimental setup is very similar to the one of section4.1. The main difference is that in this case seven differentclassification problems are solved in parallel, i.e. the level ofeach high-level feature is predicted based on the low-levelfeature groups encoded in the 90-dimensional bit string.

The optimization criteria for the SMS-GA remain the sameexcept for the computation of the misclassification error. Thecross-validated MCE is computed for each of the seven clas-sification problems, and the maximum of these errors(MaxMCE) is used as the quality criterion reflecting modelaccuracy in the multi-objective optimization problem. There-fore the optimization goal is to find sets of low-level featureswhich offer optimal trade-offs for the simultaneous minimiza-tion of MaxMCE as well as NFE and NSF. Six runs of theSMS-GA with 1000 evaluations are conducted, and the anal-ysis is again carried out separately for 5, 10 and 20 dimen-sions.

5.2 Results and Comparison of Feature SetsThe optimization task posed in this section is much hardersince we are optimizing several classification problems si-multaneously. This is reflected by the higher MaxMCE of thefinal non-dominated individuals. At the same time, by look-ing at Fig. 4, we see that the number of feature groups per in-dividual is higher when compared to the problem in the pre-vious section. Maybe even more surprising is the fact, thatthe feature groups used are not necessarily the same ones

numderiv.grad_norm_max_5000

ls.b

est_

basi

n_si

ze_5

000

100

200

300

400

500

3

3

3

3

3

3

33

3

3

3

3

33

3

3

3

3

3

3

3

3

3

3

3

4

4

4

4

4

4

44

4

4

4

4

4

44

4

4

4 4

4

44

4

4

4

15

15

15

15

15

15

15

1515

15

1515

1515

15

15

15

15

15

15

15

15

1515

15

16

16

16

16

16

16

16

16

16

1616

16

16

1616

16

16

16

16

16

16

16

16

1616

17

17

17

17

17

17

17

17

17

17 17

17

17

1717

17

17

17

17

17

17

17

17

17

17

18

18

18

18

18

18

18

18

18

18

18

18

18

18

18

18

1818

18

18

18

18

18

18

18

19

19

19

19

1919

19

19

19

19

19

19

19

19

19

19

19

19

19

19

19

1919

19

19

23

23

23

23

23

23

2323

23

23

23

2323

23

23

23

23

23

2323

23

23

23

23

23

24

24

24

24

24

24

24

24

24

24

2424

24

24

24

24

24

242424

24

24

24

24

24

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

8

88

8

888

8

8

8

9

9

9

9

9

9

9

9

9

9

99

9

999

9

9

9

9

9

99

99

22222222 22

222222

22

222222

2222

2222

2222

222222 22

22 2222

2020

20

2020

20

20

2020

20

20

20

20

20

20

20

20

20

20

20

20

20

20

2020

21

21

2121

2121

21212121

21

2121

21

21

2121

2121

21

2121

212121

11111

11

11111

11

11

1

111

1

11

11

2

22

2

222

2

2

22

2

2

2

2

2

2

22

2

2

2

2

2

2

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

55

5

5

5

5

5

5

5

5

66666

6666666666

66666

66666

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

7

77

7

7

1010

10

10

10

10

1010

10

1010

10

10

10

10

10

10

1010

10

10

10

10

1010

11

11

11

11 11

11

11

11

11

1111

11

11

1111

11

1111

11

1111

1111

11

11

12

12

12

1212

12

12

12121212

1212

12

1212

12

12

12

12

12

12

12

1212

13131313131313131313

1313131313131313

13131313131313

14

141414141414

14

14

14

14

14

14

1414

14

14141414

14

1414 1414

100 200 300 400 500

Figure 7: Scatter plot of the rank transform of two variablesselected to predict the multi-modality property using leave-one-function-out. Each test function instance is represented by itsBBOB test function number, its color encodes the multi-modalityclass from Tab. 1. The colored circles mark instances which aremisclassified (red: incorrect in two class setting, blue: incorrect infour class setting, green: incorrect in both settings). Because thelocal search feature is equal to one for many of the functions, thefirst ≈ 350 ranks should be treated as equal. They are randomlyassigned here to aid in the visualization.

used in the other optimization problem. Since, given a func-tion’s high-level features, we can deduce the BBOB functiongroup it belongs to, we would have expected the two solu-tion sets to be more similar. This can also be seen from Fig. 6,in which the probabilities of belonging to a final solution foreach feature group of Fig. 4 are visualized for both problems.Only few feature groups, located close to the main diago-nal, have roughly equal probability of appearance in the so-lutions of both problems. The unbalanced number of activefeature groups for both problems becomes obvious as well asmuch less points are located below the diagonal than above.One reason for the large number of active bits in the solu-tions of problem 2 might be that, again, the SMS-GA has notconverged within the afforded number of function evalua-tions. Another possibility to further reduce the active bits ofthe solutions might be some kind of backward search at theend or even include such an operator in the GA. One shouldalso keep in mind that possibly very different feature groupsare helpful in predicting the seven properties.

Additionally, the feature groups with extreme probabilitiesare labeled for each problem. Whereas for problem 1 the fea-tures of the y - distribution group are of higher importancethan the convexity features which is contrary to the situationin the second problem.

6. DISCUSSION AND OUTLOOKA crucial aspect in optimization is the ability to identify themost suitable optimization algorithm for a (black-box) opti-mization problem at hand. This procedure can be based onkey characteristics of the fitness landscape such as the de-gree of global structure or the number of local optima, bothof which have proven to have a strong influence on the per-

formance of different optimizers and allow to differentiatebetween them. As those properties are usually unknown forpractical applications, we present an approach to automati-cally extract low-level features from systematically sampledpoints in the decision space. Examples of these are the prob-ability of convexity, the estimated number of local optimabased on local search techniques or the accuracy of fittedmeta models.

The relevance of the proposed low-level features is provenby successfully predicting the predefined function groups ofthe BBOB’09/10 contest for all function instances based onthe estimated low-level features with only marginal classi-fication errors. Next to the accuracy of the predictions, thecost for calculating the features is considered as well as themodel complexity measured by the number of chosen fea-ture groups. Using multi-objective optimization techniquesthe Pareto front of this problem can be approximated, of-fering different trade-off solutions to choose from. In addi-tion, and much more relevant for practical applications, ahighly accurate prediction of high-level function features de-rived from expert knowledge is achieved independently ofthe BBOB function groupings. Contrary to our initial expec-tation, much fewer sample points are required to calculatethe features needed for highly accurate classifications, mak-ing this approach, to a certain extent, applicable even for ex-pensive real-world optimization problems.

In future work we will study how the performance of theoptimization algorithms which took part in the BBOB con-tests relates to the proposed low-level features. The aim isto identify algorithms which perform well on different sub-sets of functions or function groups from the test set whichin turn can be characterized by the computed low-level fea-tures. From there, a generalization to a new optimizationproblem should be possible by computing its low-level fea-tures and selecting the best performing algorithm for thesecharacteristics in the BBOB competition.

As discussed in section 5, the scenario chosen for our opti-mization does not have to hold necessarily. It may in fact bedesirable to extrapolate to completely new functions insteadof just new instances of the same test function. We have donesome preliminary work in this direction, using leave-one-function-out cross-validation. As an example of the resultsone can expect here, we again chose the two best variablesto predict the multi-modality feature using a random forestwith forward feature selection. Only the 50 features withs = 5000 are used, as we disregard their costs in this smallexperiment. We both consider the original 4-class problemand a 2-class version, where we try to predict whether themulti-modality property is “none” or not. The results for the5D case are shown in Fig. 7. While we only achieve an ac-curacy of ≈ 73% on the 4-class problem, we can offer someinsights into why this is the case. Consider, for example, thefunction 5. It lies isolated in the bottom left corner of the plot.We conjecture that there are not enough “similar” functionsin the BBOB test set for the classifier to possibly correctly as-sign this function to the “no multi-modality” class. On theother hand, the 2-class problem can be solved with an errorof less than 4%. Surprisingly, using all 50 features in the ran-dom forest does not really improve the performance for the4-class problem and even leads to degradation in the 2-classtask.

While it is desirable for a test set used in a benchmarkingsetting to be as small as possible, it is a necessary requirement

to have a few generally similar functions in the set from a ma-chine learning perspective. We feel that finding such a “spacefilling” set of test functions is one of the great challenges ofbenchmarking. Using the low-level features proposed in thisarticle as a tool to judge the distribution of the chosen testfunctions in the space of all possible test functions might bea viable way to approach this problem.

A promising perspective would therefore be the extensionof the BBOB test set by functions with specific characteris-tics so that a more balanced distribution with respect to thelevels of the high-level features is provided. The introducedlow-level features could be used as the basis for a system thatautomatically generates test functions with certain desiredproperties, e.g. by means of genetic programming. Usingsuch a system, methods from design of experiments could beused to systematically sample functions for predefined levelsof the low-level features, resulting in a well balanced set oftest functions. Lastly, the inclusion of real-world benchmark-ing scenarios is also highly desirable in order to minimizethe gap between artificial test functions and the challengesof practical optimization tasks.

Acknowledgements.This work was partly supported by the Collaborative Re-search Center SFB 823 (B4), the Graduate School of EnergyEfficient Production and Logistics and the Research TrainingGroup “Statistical Modelling” of the German Research Foun-dation.

ReferencesBEUME, N., NAUJOKS, B., AND EMMERICH, M. 2007. SMS-EMOA:

Multiobjective selection based on dominated hypervolume.European Journal of Operational Research 181, 3 (September),1653–1669.

BISCHL, B. 2011. mlr: Machine learning in R,http://mlr.r-forge.r-project.org.

BREIMAN, L. 2001. Random forests. Machine Learning 45, 1, 5–32.

HANSEN, N., AUGER, A., FINCK, S., AND ROS, R. 2009.Real-parameter black-box optimization benchmarking 2009:Experimental setup. Tech. Rep. RR-6828, INRIA.

HANSEN, N., FINCK, S., ROS, R., AND AUGER, A. 2009.Real-Parameter Black-Box Optimization Benchmarking 2009:Noiseless Functions Definitions. Research Report RR-6829,INRIA.

LINFIELD, G. R. AND PENNY, J. E. T. 1989. Microcomputers inNumerical Analysis. Halsted Press, New York.

MERSMANN, O., PREUSS, M., AND TRAUTMANN, H. 2010.Benchmarking evolutionary algorithms: Towards exploratorylandscape analysis. In PPSN XI: Proceedings of the 11thInternational Conference on Parallel Problem Solving from Nature,R. Schaefer et al., Eds. Lecture Notes in Computer Science 6238.Springer, 71–80.

PREUSS, M. AND BARTZ-BEIELSTEIN, T. 2011. Experimentalanalysis of optimization algorithms: Tuning and beyond. InTheory and Principled Methods for Designing Metaheuristics,Y. Borenstein and A. Moraglio, Eds. Springer.

SUGANTHAN, P. N., HANSEN, N., LIANG, J. J., DEB, K., CHEN,Y.-P., AUGER, A., AND TIWARI, S. May 2005. Problemdefinitions and evaluation criteria for the CEC 2005 specialsession on real-parameter optimization. Tech. rep., NanyangTechnological University, Singapore.http://www.ntu.edu.sg/home/EPNSugan.