Fast estimation of Gaussian mixture models for image segmentation

Machine Vision And Applications manuscript No.(will be inserted by the editor)

Fast Estimation of Gaussian Mixture Models for Image Segmentation

Nicola Greggio · Alexandre Bernardino · Cecilia Laschi · Paolo Dario · JoseSantos-Victor

Received: date / Accepted: date

Abstract The Expectation-Maximization algorithm has beenclassically used to find the maximum likelihood estimates ofparameters in probabilistic models with unobserved data,for instance, mixture models. A key issue in such problemsis the choice of the model complexity. The higher the num-ber of components in the mixture, the higher will be thedata likelihood, but also the higher will be the computa-tional burden and data overfitting. In this work we proposea clustering method based on the expectation maximizationalgorithm that adapts on-line the number of componentsof a finite Gaussian mixture model from multivariate data.Or method estimates the number of components and theirmeans and covariances sequentially, without requiring anycareful initialization. Our methodology starts from a singlemixture component covering the whole data set and sequen-tially splits it incrementally during expectation maximiza-tion steps. The coarse to fine nature of the algorithm reducethe overall number of computations to achieve a solution,which makes the method particularly suited to image seg-mentation applications whenever computational time is anissue. We show the effectiveness of the method in a seriesof experiments and compare it with a state-of-the-art alter-native technique both with synthetic data and real images,

N. Greggio⋆ , C. Laschi, P. DarioARTS Lab - Scuola Superiore S.Anna, Polo S.Anna ValderaViale R. Piaggio, 34 - 56025 Pontedera, ItalyTel.: +39-049-8687385Fax: +39-049-8687385E-mail: [email protected]

N. Greggio, Alexandre Bernardino, Jose Santos-VictorInstituto de Sistemas e Robotica, Instituto Superior Tecnico1049-001 Lisboa, Portugal

Paolo DarioCRIM Lab - Scuola Superiore S.Anna, Polo S.Anna ValderaViale R. Piaggio, 34 - 56025 Pontedera, Italy

including experiments with images acquired from the iCubhumanoid robot.

Keywords Image Processing· Unsupervised Learning·Self-Adapting Gaussians Mixtures· Expectation Maximiza-tion ·Machine Learning· Clustering

1 Introduction

Image segmentation is a key low level perceptual capabilityin many robotics related applications, as a support functionfor the detection and representation of objects and regionswith similar photometric properties. Several applications inhumanoid robots [27], rescue robots [3], or soccer robots[12], [14] rely on some sort on image segmentation. Addi-tionally, many other fields of image analysis depend on theperformance and limitations of existing image segmentationalgorithms: video surveillance [6], medical imaging [8], facerecognition [23], and database retrieval [31] are some exam-ples. In this work we study and develop a fast method forfitting gaussian mixtures to a set of data, with applicationsto image segmentation.

Unsupervised image segmentation is a typical cluster-ing problem: Image pixels must be grouped into ”classes”according to some similarity criteria (pixel color, proxim-ity, etc.). The classes, also calledclusters, are detected auto-matically, i.e. without any input by an external agent. Thismethod finds applications in many fields, such as in im-age processing (e.g. face-color segmentation [11] or posturerecognition [2]) , sound analysis [18], segmentation of mul-tivariate medical images (a wide overview can be fond here[20]).

The techniques for unsupervised learning range from Ko-honen maps [21] [22], Growing Neural gas [10], [15], k-means [25], to Independent component analysis [5], [16],

https://www.researchgate.net/publication/222489418_Independent_Component_Analysis_a_New_Concept?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/224839970_Some_Methods_for_Classification_and_Analysis_of_MultiVariate_Observations?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/6749335_Robust_segmentation_of_cerebral_arterial_segments_by_a_sequential_Monte_Carlo_method_Particle_filtering?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/222701793_Simulation_of_small_humanoid_robots_for_soccer_domain?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/220464836_An_adaptive_focus-of-attention_model_for_video_surveillance_and_monitoring?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/2428368_A_Growing_Neural_Gas_Network_Learns_Topologies?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/14034187_The_EM_algorithm_in_medical_imaging?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/225335260_Classification_of_face_images_using_local_iterated_function_systems?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/225267314_Analysis_of_a_Simple_Self-Organizing_Process?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/3450542_Learning_Object_Affordances_From_Sensory--Motor_Coordination_to_Imitation?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/222570068_Mixture_model_for_face-color_modeling_and_segmentation?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/232650500_Constrained_EM_estimates_for_harmonic_source_separation?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/10976779_Measurement_of_the_Distribution_of_Red_Blood_Cell_Deformability_Using_an_Automated_Rheoscope?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/220797763_An_Application_of_Gaussian_Mixtures_Colour_Segmenting_for_the_Four_Legged_League_Using_HSI_Colour_Space?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/224058237_Integration_of_Tracking_and_Adaptive_Gaussian_Mixture_Models_for_Posture_Recognition?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/221669764_Self-organised_formation_of_topologically_correct_feature_map?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/237067978_Independent_Component_Analysis?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/220797294_Bridging_the_Gap_Between_Simulation_and_Reality_in_Urban_Search_and_Rescue?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

2

etc. Particularly interesting is the Expectation Maximiza-tion algorithm applied to Gaussian mixtures which allowsto model complex probability distribution functions. Fittinga mixture model to the distribution of the data is equivalent,in some applications, to the identification of the clusters withthe mixture components [26].

1.1 Related Work

Expectation-Maximization (EM) algorithm is the standardapproach for learning the parameters of the mixture model[13]. It is demonstrated that it always converges to a localoptimum [7]. However, it also presents some drawbacks.For instance, EM requires ana-priori selection of modelorder, namely, the number of components to be incorpo-rated into the model, and its results depend on initialization.The higher the number of components within the mixture,the higher will be the total log-likelihood. Unfortunately, in-creasing the number of gaussians will lead to overfitting andto an increase of the computational burden.

Particularly in image segmentation applications, wherethe number of points is in the order of several hundred thou-sand, finding the best compromise between precision, gener-alization and speed is a must. A common approach to choosethe number of components is trying different configurationsbefore determining the optimal solution, e.g. by applying thealgorithm for a different number of components, and select-ing the best model according to appropriate criteria.

Different approaches can be used to select the best num-ber of components. These can be divided into two main classes:off-lineandon-linetechniques.

The first ones evaluate the best model by executing in-dependent runs of the EM algorithm for many different ini-tializations and number of components, and evaluating eachestimate with criteria that penalize complex models (e.g. theAkaike Information Criterion (AIC) [30] and the RissanenMinimum Description Length (MDL) [29]). These, in orderto be effective, have to be evaluated for every possible num-ber of models under comparison. Therefore, it is clear that,for having a sufficient search range the complexity goes withthe number of tested models as well as the model parame-ters.

The second ones start with a fixed set of models andsequentially adjust their configuration (including the num-ber of components) based on different evaluation criteria.Pernkopf and Bouchaffra proposed a Genetic-Based EM Al-gorithm capable of learning gaussians mixture models [28].They first selected the number of components by means ofthe minimum description length (MDL) criterion. A combi-nation of genetic algorithms with the EM has been explored.

Uedaet Al. proposed a split-and-merge EM algorithmto alleviate the problem of local convergence of the EM

method [32]. Subsequently, Zhanget Al. introduced anothersplit-and-merge technique [34]. Merge an split criterion isefficient in reducing number of model hypothesis, and it isoften more efficient than exhaustive, random or genetic al-gorithm approaches. To this aim, particularly interestingisthe method proposed by Figueiredo and Jain, which goes onstep by step until convergence using only merge operations[9].

A different approach has been explored by Ketchantanget Al., by using Pearson mixture model (PMM) in conjunc-tion with Gaussian copula instead of pure Gaussian mixturedistributions [19] The model is combined with a Kalman fil-ter in order to predict the object’s position within the nextframe. However, despite its validity in tracking multiple ob-jects and its robustness against different illumination con-texts, the required computational burden is higher than thecorresponding Gaussian mixture based models.

1.2 Our contribution

We propose an algorithm that simultaneously determines thenumber of components and the parameters of the mixturemodel with only split operations. The particularity of ourmodel is that it starts from only one mixture componentprogressively adapting the mixture by splitting componentswhen necessary.

Our formulation guarantees the following advantages.First, the initialization is very simple. Diversely from thestandard EM algorithm, or any EM technique that starts withmore than one component, where initialization is often ran-dom, in our case the initialization is deterministic: The ini-tial component is a single gaussian that best fits the wholedata set. Moreover, the splitting criterion is also determinis-tic, as it will be explained later. Hence, the whole algorithmis deterministic: By applying the same algorithm to the sameinput data we will always get the same results. Second, it isa technique with computational cost lower than other ap-proaches. The algorithm complexity will be analyzed laterin the paper and compared to the alternatives.

In a sense, we approach the problem in a different wayto Figueiredo and Jain. They start the computation with themaximum possible number of mixture components. Althoughthat work is among the most effective to date, it becomestoo computationally expensive for image segmentation ap-plications, especially during the first iterations. It starts withthe maximum number of components, decreasing it progres-sively until the whole space of possibilities has been ex-plored, whereas our method starts with a single componentand increases its number until a good performance is at-tained.

https://www.researchgate.net/publication/242567994_Maximum_Likelihood_Estimation_from_Incomplete_Data?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/225922941_Pearson-based_Mixture_Model_for_Color_Object_Tracking?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/229193262_Akaike_Information_Criterion_Statistics_D?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/222559855_EM_algorithms_for_Gaussian_mixtures_with_split-and-merge_operation?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/220183069_Unsupervised_Learning_of_Finite_Mixture_Models?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/230675486_Stochastic_Complexity_in_Statistical_Inquiry?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/7641987_Genetic-based_EM_algorithm_for_learning_Gaussian_mixture_model?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/221995817_Maximum_Likelihood_from_Incomplete_Data_Via_EM_Algorithm?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/291214882_SMEM_algorithm_for_mixture_models?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

3

1.3 Real-Time Applications

The algorithm described in this paper was developed keep-ing in mind the need for real-time operation, having imagesegmentation for robots as our first objective. Due to thelarge resolution of modern frame grabbers and cameras, im-age segmentation requires fast techniques in order to satisfythe real-time demands of robotic applications, ranging fromsimple tracking to autonomous vehicle guidance and videosurveillance. Since there are no efficient global solutionstothe general problem of unsupervised estimation of mixturemodels, we have developed a greedy approach where thedesign choices are taken to specifically address the imagesegmentation problem achieving simultaneously fast perfor-mance and results competitive with the state-of-the-art. Forevaluation purposes, we also perform experiments on 2Dsynthetic data, where the method’s performance can be moreeasily visualized and compared. However, we stress that themain aspect considered in writing the algorithm’s specifica-tion is a fast performance in image segmentation. Severalresults on this domain, both with generic images and withimages taken in our robotic platform, are presented and com-pared with the state-of-the-art.

1.4 Outline

The paper is organized as follows. In sec. 2 we summarizethe main results of the classical Expectation Maximizationalgorithm. In sec. 3 we introduce the proposed algorithm.Specifically, we describe its formulation in sec. 3.1, the ini-tialization in sec. 3.2, the component split operation in sec.3.4, and the decision thresholds update rules in sec. 3.5. Fur-thermore, in sec. 5 we describe our experimental set-up fortesting the validity of our new technique and in sec. 6 wediscuss our results. Finally, in sec. 7 we draw the main con-clusions of this work.

2 Expectation Maximization Algorithm

The Expectation-Maximization algorithm serves to find themaximum likelihood estimates of a probabilistic model withunobserved data. A common usage of the EM algorithm isto identify the”incomplete, or unobserved data”:

Y =(

y1, y2

, . . . , y j) (1)

given the couple(X ,Y ) - with X defined as:

X = {x1, x2

, . . . , xN} (2)

also called”complete data”, which has a probability density(or joint distribution)p

(

X ,Y |ϑ)

= pϑ (X ,Y ) dependingon the parameterϑ . More specifically, the”complete data”

are the given input data setX to be classified, while the”incomplete data”are a series of auxiliary variables in thesetY indicating for each input sample which mixture com-ponent it comes from. We defineE

′(·) the expected value

of a random variable, computed with respect to the densitypϑ (X ,Y ).

We define:

Q(

ϑ (n), ϑ (n−1)

)

= E′L(

ϑ (n−1))

(3)

with L(

ϑ (n−1))

being the log-likelihood of the observed

data at stepn−1:

L(

ϑ (n−1))

= logp(

X ,Y |ϑ (n−1))

(4)

The EM procedure repeats the two following steps untilconvergence, iteratively:

– E-step: It computes the expectation of the joint probabil-ity density:

Q(

ϑ (n), ϑ (n−1)

)

= E′[

logp(

X ,Y |ϑ (n−1))]

(5)

– M-step: It evaluates the new parameters that maximizeQ; this, according to the ML estimation, is:

ϑ n = argmaxϑ

Q(

ϑ n, ϑ (n−1)

)

(6)

The convergence to a local maxima is guaranteed. How-ever, the obtained parameter estimates, and therefore, theac-curacy of the method greatly depend on the initial parame-tersϑ 0.

2.1 EM Algorithm: Application to a gaussians Mixture

When applied to a gaussian mixture density we assume thefollowing model:

p(x) =nc

∑c=1

wc · pc(x)

pc (x) =1

(2π)d2 |Σc|

12

e−12(x−µc)

T |Σc|−1(x−µc)

(7)

wherepc(x) is the component prior distribution for the classc, and withd, µc andΣc being the input dimension, the meanand covariance matrix of the gaussians componentc, andncthe total number of components, respectively.

Let us considerncclassesCnc, with p(x|Cc) = pc (x) andP(Cc) = wc being the density and thea-priori probability ofthe data of the classCc, respectively. Then:

p(x) =nc

∑c=1

P(Cc) · p(x|Cc) =nc

∑c=1

wc · pc(x) (8)

4

In this setting, the unobserved data setY =(

y1, y2, . . . , yN)

contains as many elements as data samples, and each vectoryi =

[

yi1,y

i2, · · · ,y

ic, · · ·y

inc

]Tis such thatyi

c = 1 if the datasamplexi belongs to the classCc andyi

c = 0 otherwise. Theexpected value of thecth component of the random vector ¯yis the classCc prior probability:

E′(yc) = wc (9)

The algorithm main stages are:- a) consider the whole data setD = (X ,Y ), where ¯xi areknown butyi are unknown;- b) E-step:for each data sample evaluate its class posteriorprobabilitiesP

(

yic = 1|xi

)

,c = 1· · ·nc:

P(

yic = 1|xi) = P

(

Cc|xi)

=p(

xi |Cc)

·P(Cc)

p(xi)=

wc · pc(

xi)

∑ncc=1wc · pc(xi)

, π ic

(10)

For simplicity of notation, from now on we will refer toE′ (

yc|xi)

asπ ic. This is probability that ¯xi belongs to class

Cc.- c) M-step: re-estimate the parameter vectorϑ , which at

the n+ 1 iteration will beϑ (n+1). This, in case of a gaus-sians mixture distribution, i.e.pc (x) is a gaussians density,we haveϑ = (wc, µc,Σc). Then, we evaluate the means andthe covariances by weighting each data sample by the degreein which it belongs to the class as:

µ (n+1)c =

∑Ni=1 π i

cxi

∑ii=1 π i

c

Σ (n+1)c =

∑Ni=1 π i

c

(

xi− µ (n+1)c

)(

xi− µ (n+1)c

)T

∑Ni=1 π i

c

(11)

Finally, we re-estimate thea-priori probabilities of the classes,i.e. the probability that the data belongs to the classc as:

w(n+1)c =

1N

N

∑i=1

π ic, with c= {1,2, . . . ,nc} (12)

2.2 EM Algorithm: Model Selection Criteria

Deciding the best number of components in a mixture is acritical issue. The more components there are, the higherthe log-likelihood will be. However a high number of com-ponents increases the computational cost of the algorihtmand the risk of a data overfitting. On the other side, a lownumber of components may not allow for a good enough de-scription of the data. Several information criteria have beendeveloped in order to evaluate the best mixture complex-ity (number of components) subject to the specific inputdata. They evaluate the best model by executing independent

runs of the EM algorithm for many different initializationsand number of components. Then, after each run they pro-duce a number, the value of the information criterion itself,characterizing the overall description. Each criterion has itsown characteristics. For a detailed overview about the mostused ones, refer to [24]. A common feature is that all ofthem tend to penalize complex models. Some examples arethe Akaike Information Criterion (AIC) [30], the RissanenMinimum Description Length (MDL) [29], and the WallaceMinimum Message Length [33]. However, in order to be ef-fective, they have to be evaluated for every possible num-ber of models under comparison. Therefore, it is clear that,for wide enough search range, the complexity goes with thenumber of tested models as well as the model parameters.The following section describes our approach to perform anefficient exploration of the model space, not only avoidingexaustive search but also reducing as most as possible themodel computational costs.

3 FASTGMM: Fast Gaussian Mixture Modeling

In this section we describe the rational of our algorithm. Ithas been mainly designed to perform an efficient search forthe number of mixture components. Whereas the classicalapproach is to perform an exhaustive search of the numberof components, doing independent EM runs with differentinitializations, a new class of algorithms has been developedin the last decade in the attempt to speed up the computa-tions. The basic idea is to incrementally estimate the mixtureparameters and the number of components simultaneously.The number of components is incremented or decrementedat certain stages of the optimization procedure but the val-ues of the mixture parameters are incrementally changedand not reinitialized. In this class of algorithms we refer toFigueiredo and Jain [9] (only decrement the number of com-ponents) and to Ueda [32] (increment/decrement). Our al-gorithm starts with a single component and only incrementsits number as the optimization procedure progresses. Withrespect to the other approaches, our is the one with the min-imal computational cost.

We distinguish two main features in our algorithm: Thesplitting and the stopping criteria. The key issue of our tech-nique is looking whether one or more gaussians are not in-creasing their own likelihood during optimization. Our al-gorithm evaluates the current likelihood of each single com-ponentc as (13):

Λcurr(c) (ϑ) =k

∑i=1

log(wc · pc(xi)) (13)

In other words, if their likelihood has stabilized they willbe split into two new ones and check if this move improvesthe likelihood in the long run. To each Gaussian component


https://www.researchgate.net/publication/2460476_Schwarz_Wallace_and_Rissanen_Intertwining_Themes_in_Theories_of_Model_Order_Estimation?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/239032578_Estimation_and_inference_by_compact_coding?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==




5

we also associate aagevariable in order to control ow longthe component’s own likelihood does not increase signifi-cantly (see sec. 3.1). The split process is controlled by thefollowing adaptive decision thresholds:

– One adaptive thresholdΛTH for determining a signifi-cant increase in likelihood (see sec. 3.5);

– One adaptive thresholdATH for triggering the split pro-cess based on the component’s own age (see sec. 3.5);

– One adaptive thresholdξTH for deciding to split a gaus-sian based on its area (see sec. 3.4).

It is worth noticing that even though we consider threethresholds to tune, all of them are adaptive, and only requirea coarse initialization.

These parameters will be fully detailed within the nextsections.

3.1 FASTGMM Formulation

Our algorithm’s formulation can be summarized within threesteps:

– Initializing the parameters;– Splitting a gaussian;– Updating decision thresholds.

Each mixture componentc is represented as follows:

ϑc = ρ(

wc, µc,Σc,ξc,Λlast(c),Λcurr(c),ac)

(14)

where each element is described in tab. 3.1. In the rest of thepaper the index notation described in tab. 3.1 will be used.

Two important elements are the area (namely, the covari-ance matrix determinant) and the age of a Gaussian compo-nent, respectivellyξc andac.

Symbol Element

wc a-priori probabilities of the classcµc mean of the gaussian componentcΣc covariance matrix of the gaussian componentcξc area of the gaussian componentc

Λlast(c) log-likelihood at iterationt−1 of the gaussian componentcΛcurr (c) log-likelihood at iterationt of the gaussian componentc

ac ageof the gaussian componentcc single mixture componentnc total number of mixture componentsi single input pointk total number input pointsd single data dimensionD input dimensionality

Table 1 Symbol notation used in this paper

During each iteration, the algorithm keeps in memorythe previous likelihood (Λlast(c)). Once the re-estimation ofthe vector parameterϑ has been computed in the EM step, a

key decision is taken: If a component’s own likelihood doesnot increase more thanΛTH for a predetermined number oftimes and its area exceedsξTH, this component will be splitin two. Both thresholdsΛTH andξTH are time varying, likean annealing schedule, to promote gradual splits. If the splitsdo not improve the whole log-likelihood significantly, thealgorithm stops.

The whole algorithm pseudocode is shown in Algorithm3.1.

Algorithm 3.1 FASTGMM: Pseudocode1: - Parameter initialization2: while (stopping criterion is not met)do3: Λcurr (c), evaluation, forc = 0,1, . . .,nc4: Whole mixture log-likelihoodL

(

ϑ)

evaluation5: Re-estimate priorswc, for c = 0,1, . . .,nc

6: Recompute centerµ (n+1)c and covariancesΣ (n+1)

c , for c =0,1, . . .,nc

7: - Evaluation whether changing the gaussians distribution struc-ture -

8: for (c = 0 tonc) do9: if ](ac > ATH) then

10: if ((Λcurr (c)−Λlast(c)) < ΛTH) then11: ac+ = 112: - General condition for changing satisfied; now check-

ing those for each component -13: if (Σc > ξTH) then14: if (c < maxNumComponents)then15: split gaussians→ split16: nc+ = 117: resetξTH←

ξTH−INITnc

18: resetΛT H← LTH−INIT

19: reset aA,aB← 0, with A, B being the new twogaussians

20: return21: end if22: end if

23: ΛTH = ΛTH ·(

1− λnc2

)

24: ξTH = ξTH ·(

1− αMAXnc2

)

25: end if26: end if27: end for28: end while29: Optional: Optimizing selected mixture

3.2 Parameters initialization

The decision thresholds(·)INIT will be initialized as follows:

ξTH−INIT = ξdata

LTH−INIT = kLTH

ATH−INIT = kATH

(15)

with kLTH andkATH (namely, the minimum amount of like-lihood difference between two iterations and the number of

6

iterations required for taking into account the lack of a like-lihood consistent variation) relatively low (i.e. both in theorder of 10, or 20). Of course, higher values forkLTH andsmaller forkATH give rise to a faster adaptation, howeveradding instabilities.

At the beginning, before starting with the iterations,ξTH

will be automatically initialized to the Area of the wholedata set - i.e. the determinant of the covariance matrix rela-tive to all points, as follows:

µdata,d =1k

k

∑i

xid

Σdata,i = 〈xi− µdata〉〈xi− µdata〉T

(16)

wherek is the number of input data vectors ¯x.

3.3 Gaussian components initialization

The algorithm starts with a single gaussian. Its mean will bethe whole data mean, as well as its covariance matrix will bethat of the whole data set.

That leads to a unique starting configuration.

3.4 Splitting a gaussian

When a component’s covariance matrix area overcomes themaximum area thresholdξTH it will split. As a measure ofthe area we adopt the matrix’s determinant. This, in fact,describes the area of the ellipse represented by a gaussiancomponent in 2D, or the volume of the ellipsoid representedby the same component in 3D.

It is worth noticing that the way the component is splitgreatly affects further computations. For instance, consider a2-dimensional case, in which anelongatedgaussian is present.Depending on the problem at hand, this component may beapproximating two components with diverse configurations:Either covering two smaller data distribution sets, placedalong the longer axis, or two overlapped sets of data withdifferent covariances, etc. So, splitting a component is a illposed problem and the best way depends on the problemat hand. In this paper we make a choice suited to applica-tions in color image segmentation whose purpose is to ob-tain components with lower overlap. For this case a reason-able way of splitting is to put the new means at the two ma-jor semi-axis’ middle point. Doing so, the new componentswill promote non overlapping components and, if the ac-tual data set reflects this assumption, it will result in fasterconvergence. In fact, in image segmentation applications,points are composed of the R,G,B color values (or othercolor space) and spatial (x,y) coordinates. Whereas the colorcomponents can overlap (there may be many pixels with

the same color), spatial components are intrinsically non-overlapping ((x,y) coordinates are unique). Thus, in the con-text of image segmentation applications it makes sense tofavor non-overlapping components, thus empirically justi-fying the proposed rule.

To implement this split operation we make use of the sin-gular value decomposition. A rectangularn x p matrixA canbe decomposed asA = USVT , where the columns ofU arethe left singular vectors,S(which has the same dimension asA) is a diagonal matrix with the singular values arranged indescending order, andVT has rows that are the right singularvectors. However, we are not interested in the whole set ofeigenvalues, but only the biggest one, therefore we can savesome computation by evaluating only the first column ofUand the first element ofS.

More precisely, A gaussian with parametersϑOLD willbe split in two new gaussiansA andB, with means:

ΣOLD = USVT

uMAX = U∗,1; sMAX = S1,1

µA = µOLD +12

sMAXuMAX

µB = µOLD−12

sMAXuMAX

(17)

whereuMAX is the first column ofU , andsMAX the first ele-ment ofS.

The covariance matrices will then be updated as:

S1,1 =14

sMAX

ΣA = ΣB = USVT(18)

while the newa-priori probabilities will be:

wA =12

wOLD wB =12

wOLD (19)

It is worth noticing that the first split occurs immediatelyafter the initialization. Otherwise the algorithm would haveto wait until the first Gaussian gets old before adding thenew components, which is clearly unnecessary.

The decision thresholds will be updated as explained insec. 3.5. Finally, their ages,aA andaB, will be reset to zero.

3.5 Updating decision thresholds

The decision thresholds are updated in two situations:

A. When a mixture component is split;B. When each iteration is concluded.

These two procedures will be explained in the following.

7

- Single iteration.The thresholdsΛTH, andξTH vary at each step with the fol-lowing rules:

ΛTH = ΛTH−λ

nc2 ·ΛTH

= ΛTH ·

(

1−λ

nc2

)

ξTH = ξTH−αMAX

nc2 ·ξTH

= ξTH ·(

1−αMAX

nc2

)

(20)

with nc is the number of current gaussians,λ , andαMAX

are the coefficients for the likelihood and area change eval-uation, respectively. Using high values forλ and forαMAX

the corresponding thresholdsΛTH andξTH decrease. There-fore, since a component splits when its covariance matrixarea overcomes the thresholdξTH, decreasingαMAX will in-crease the split frequency and promote a faster convergence.In an analogous form,λ controls the minimum increase inlog-likelihood a component must have to be split. There-fore, the higherλ is, the easier the system will promote asplit. In tandem,αMAX andλ control the convergence speedtogether. However, fast convergence is often associated toinstability around the optimal point, and may even lead to adivergence from the local optimum.

Notice thatΛTH andξTH decrements are inversely pro-portional to the square of the number of components. Therationale is to promote splits in models with low complexityand prevent the fast growing of the number of componentsin models with high complexity.

Finally, every time a gaussians is added these thresholdswill be reset to their initial value (see next section).

- After gaussian splitting.The decision thresholds will be updated as follows:

ξTH =ξTH−INIT

ncΛTH = LTH−INIT

(21)

wherencOLD andnc are the previous and the current num-ber of mixture components, respectively. Substantially, thisupdates the splitting threshold to a value that goes linearlywith the initial value and the actual number of componentsused for the computation.

All these rules, although empirical, have proven success-ful in the estimation of mixture components, as will be shownin the results. The sole parameters to cotrol the process areαMAX and λ . A practical rule to tune these parameters isgiven in the following section.

3.5.1Tuning αMAX and λ

Tuning paramentersαMAX andλ depend mostly on the di-mensionality and type of data and the application require-ments for precisionvs computation time. According to our

experience, these parameters can be tuned only once foreach specific type of data. Particular instantiations of datawith the same characteristics do not require retuning. For in-stance, the 2D input data of sec. 5 uses the same paramenterand the same happens for the image data sets. Therefore wepropose a methodology to tune the parameters depending onthe demands of the application: focussing more on the sta-bility or on the accuracy. If precision is preferred, startingwith low values assures a low frequency of split and a betterconvergence to a local optimum. Then,αMAX andλ shouldbe gradually increased until a satisfactory performance. Thebest way is to use a test input set, and comparing the out-put of our algorithm with “ground truth” for different runs,increasingαMAX andλ each time. In the second case, it isbetter to follow the opposite way: starting with higher valuesof αMAX andλ than before, and then reducing them at eachrun. It is not possible to establish a priori the best startingvalues, because they depend on the type of data (dimension-ality, scale). In our experiments we usedλ = 20 for all thetests, whileαMAX = 1.5 for the 2D input data, andαMAX = 1for the images.

3.6 Ill-Conditioned components

We experienced that during the EM steps, the computationsometimes leads to a ill-conditioned component. This hap-pens when a Gaussian becomes too “enlongated”, whichcan be tested through its conditioning number (i.e. the ra-tio between the biggest and smallest eigenvalue) becomingtoo large. It may happen in several situations with differentkinds of input data, from the classical 2-dimensional points,to 3 dimensional (e.g. xyz cartesian points, or RGB colorimages), and this may “crash” the algorithm. However, thisproblem happens more frequently with images than withsimple input points. Many images we tried with the Figueiredoand Jain’s algorithm cannot have been segmented due to thisproblem.

To address this problem we evaluate the class’ covari-ance matrix conditioning number, and if it is higher thana limit (e.g. 10e+10) we stop the EM computation. In fact,since that component does not contribute enough to the in-put data description, we directly reject this component.

3.7 Optimizing the selected mixture

After FASTGMM algorithm stops, we may keep the chosenmixture as the final result or we may perform an aditionalEM step to refine the solution. This is an optional proce-dure. The former choice is the fastest but less accurate, whilethe latter one introduces new computations but ensures moreprecision.

8

Why to choose the second possibility?It may happen that FASTGMM decides to increase the num-ber of components even though the EM has not reached itslocal maximum, due to the splitting rule. In this case currentmixture can still be improved by running the EM until itachieves its best configuration (the log-likelihood no longerincreases).

Whether applying the first or second procedure is a mat-ter of what predominates in the”number of iterations vs.solution precision”compromise at each time.

3.8 Stopping criterion

The algorithm stops when the log-likelihood does not in-crease over a minimum value. This is a common used ap-proach [9] [32]. However, rather than consider the abso-lute value of the log-likelihood, as in the two previous tech-niques, we considered a percentage variation. In fact, de-scribing different input data sets with theirbestmixture giverise to different values for the final log-likelihood. There-fore, instead of fixating an absolute value as the minimumincrement, a percentage amount allows a more general ap-proach. In this work, we stop the EM computation when thelog-likelihood does not increase more than 0.1%.

3.9 Computational complexity evaluation

We refer to the pseudocode in algorithm 3.1, and to the nota-tion presented in sec. 3.1.The computational burden of eachiteration is:

– the original EM algorithm (steps 3 to 6) takesO(N ·D ·nc)for each step, for a total ofO(4 ·N ·D ·nc) operations;

– our algorithm takesO(nc) for evaluating all the gaus-sians (step 8 to 27);

– our split (step 15) operation requiresO(D).– the others takeO(1).– the optional procedure of optimizing the selected mix-

ture (step 29) takesO(4 ·N ·D ·nc), being the originalEM.

Therefore, the original EM algorithm takes:

– O(4 ·N ·D ·nc), while our algorithm addsO(D ·nc) onthe whole, orO(4 ·N ·D ·nc), giving rise toO(4 ·N ·D ·nc)+ O(D ·nc) = O(4 ·N ·D ·nc+D ·nc)= (nc·D · (4N+1))in the first case;

– 2·O(4 ·N ·D ·nc) + O(D ·nc) = O(8 ·N ·D ·nc+D ·nc)= (nc·D · (8N+1)) in the second case, with the opti-mization procedure.

Considering that usuallyD << N andnc << N, and thatthe optimization procedure is not essential, our procedure

does not add a considerable burden, while giving an impor-tant improvement to the original computation in terms ofself-adapting to the data input configuration at best. More-over, it is worth noticing that even though the optimizationprocedure is performed, this starts very close to the optimalmixture configuration. In fact, the input mixture is the resultof the FASTGMM computation, rather than a generic ran-dom or k-means initialization (as it happens with the simpleEM algorithm, generally).

4 Application to Color Image Segmentation

This section mainly focuses on the application of our ap-proach for segmenting real colored images. Since the algo-rithm is suitable for this kind of task, it is relevant to dedicatea brief section about it.

Each pixel of the image is represented as a 5-dimensionalvector, where the dimensions are the(R,G,B) color spaceand the(x,y) pixel coordinates. The input data set is com-posed by all the pixels of the image. After the Gaussianmixture estimation process, we will have several compo-nents, each one representing a Gaussian distribution in the5D joint (R,G,B,x,y) space, represented by a mean color, amean spacial coordinate and associated covariances. Eachpixel vector is assigned a likelihood of belonging to eachof these distributions. We choose the maximum likelihoodassignment. In the experimental results we show segmenta-tion results (see Fig. 2). In those images, each pixel is col-ored with the mean RGB value of its maximum likelihoodcluster.

5 Experimental Validation

The experiments performed in this paper aim at a carefulcomparison with a state-of-the-art unsupervised learningtech-nique [9] which is based on similar principles (mixture modelestimation) and makes publicly available the correspond-ing source code. We compare both approaches in syntheticdata (artificially generated with a known mixture), and tosome real images (taken by a webcam or by our robotic plat-form). The comparison criteria include the computationalcost (both algorithms have MATLAB implementations), thenumber of components, and robustness to operating condi-tions. In the experiments performed with real images we alsoshow the output of a successful image segmentation algo-rithm based on themeanshift algorithm[4], using the EDI-SON binary package1. Although based on different princi-ples, this method is one the best with respect to image seg-mentation. However it is a non adaptive method: it relies ontwo main parameters, the spatial and the color bandwidth,

1 http://coewww.rutgers.edu/riul/research/code/EDISON/




https://www.researchgate.net/publication/284040965_Mean_shift_A_robust_approach_toward_feature_space_analysis?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

9

that rougly define the neighborhood in space and color to de-fine clusters [4], and that we have manually tuned to achive anumber of clusters similar to ours. The available implemen-tation is coded in C and therefore we cannot compare theperformance in terms of the computation time. We used itas an example of one of the best quality image segmentationmethods currently available.

5.1 Synthetic data

In order to evaluate our algorithm’s performance with groundtruth data, we tested it by classifying different input datasets randomly generated by a known gaussians mixture. Thesame input sets have been proposed in [9]. Each distributionhas a total of 2000 points, but arranged with different mix-ture distributions. Even though our procedure can be appliedto any input dimensionality, we choose to show the resultsfor 2-dimensional input because they are easier to represent.As the ratio(# gaussians)/(# data points)increases, it be-comes harder to reach a good solution in a reasonable num-ber of steps. Therefore, we are interested in evaluating howour algorithm behaves when the model complexity gradu-ally increases.

Since [9] depends on the initialization, to have a faircomparison, we adopted a commonly used approach: se-lect 10 different initial random conditions and report thosegiving the highest likelihood. The output of the two algo-rithms is shown in Fig. 2. Each subplot set is composed bythe graphical output representation for the 2-D point distri-bution (top) and the 3-D estimation mixture histogram (bot-tom). Here we show the representation for different mixturesof 3, 4, 5, 8, 12, 14, and 16 gaussian components. The dataplots show the generated mixture (blue) and the evaluatedone (red). The data on the left results from our approach,while on the right we show the results of [9], relative to thesame input data set. The 3D plots at the bottom in each sub-figure represent, respectivelly, the generated mixture, our al-gorithm’s estimation, and the estimation given by the algo-rithm in [9].

We can see that our algorithm is capable of identifyingthe input data mixture starting from only one componentwith an accuracy comparable with that of [9].

Here we performa a detailed quantitative comparison with[9]. In table 2 we show a detaile quantitative comparisonwith [9] (from now on denoted FIGJ). The table contains:

– The number of initial mixture components;– The number of detected components;– The actual number of components, i.e. that of the gener-

ation mixture;– The number of total iterations;– The elapsed time;

– The percentage difference in time for our algorithm withthe optimization process;

– The final log-likelihood;– The percentage difference in final log-likelihood for our

algorithm with the optimization process;– The normalized L2 distance to the generation mixture

without optimization;– The normalized L2 distance to the generation mixture

with optimization (only for FASTGMM).

In the following we discuss the main differences be-tween the two algorithms.

5.1.1 Evaluated number of components

There are substantially no differences in the selected numberof components. Both our approach and [9] perform well onlow number of mixture components, while having the ten-dency of underestimating them when the number increases.An exception is our approach that overestimates the 8-componentscase but correctly estimates the 16-components case. How-ever, it is worth considering that even though it estimateswell the number of components, the parameters may differ.For instance, two components may be regarded as only one,while a single one can be considered as a multiple one. Thisbehavior is present in both algorithm (see Fig. 2), suggestingthat a perfect algorithm is hard to find.

5.1.2 Elapsed time

It is important to distinguish the required number of itera-tions from the elapsed time. FASTGMM employs fewer it-erations than FIGJ without making use of the optimizationprocess, while more in the other case. At a first glance, thismay suggest a whole FASTGMM slower computation thanFIGJ. However, the whole elapsed time that occurs for run-ning our procedure is generally less than FIGJ’s, becausemost iterations are performed with lower number of compo-nents. The elapsed time of the algorithm in [9] strongly de-pends on the initial number of components. The more theyare, the slower it is. [9], on the other side, starts in the op-posite way, i.e. with the maximum allowed components. Therule to define this number relies on a desired minimum prob-ability of successful initialization of 1− ε, with the limit ofhaving at least the following amount of starting componentscmin (sec. 6.1 of [9]):

cmin >lnε

ln(1−αmin)(22)

whereαmin = min{α1,α2, · · · ,αc} is the probability of thecomponent that will be more probability left out of the ini-tialization, for its excessively small probability. Then,claim-ing a probability of successful initialization of 90%=⇒ 1−








10

3-Gaussians 4-GaussiansFSAEM FIGJ FSAEM FIGJ

−6 −4 −2 0 2 4 6 8

−6

−4

−2

0

2

4

6

Reference (blue) and estimated (red) covariances

−6 −4 −2 0 2 4 6 8

−6

−4

−2

0

2

4

6


−10 −8 −6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6


−10 −8 −6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6



−10 −8 −6 −4 −2 0 2 4 6

−8

−6

−4

−2

0

2

4

6Reference (blue) and estimated (red) covariances

−10 −8 −6 −4 −2 0 2 4 6

−8

−6

−4

−2

0

2

4


−6 −4 −2 0 2 4 6

−4

−3

−2

−1

0

1

2

3

4


−6 −4 −2 0 2 4 6

−4

−3

−2

−1

0

1

2

3

4



−4 −2 0 2 4 6

−4

−3

−2

−1

0

1

2

3

4

5

6


−4 −2 0 2 4 6

−4

−3

−2

−1

0

1

2

3

4

5

6


−4 −2 0 2 4 6

−4

−3

−2

−1

0

1

2

3

4

5


−4 −2 0 2 4 6

−4

−3

−2

−1

0

1

2

3

4

5


Fig. 1 For each plot set: Generation mixture (blue) and the evaluated one (red) for FASTGMM and FIGJ on the same input sets.

Input Algorithm# Initial # Detected Actual gaussian

# IterationsElapsed Time Diff time FASTGMM

Log-likelihoodDiff lik FASTGMM Normalized L2 Distance Normalized L2 Distance

Crashedgaussians gaussians number [s] with opt vs FIGJ% with opt vs FIGJ % without optimization with optimization

3-gau:

FASTGMM 3

3

76 3.99716

-53.89844289

-8420.917867

0.495867477Optimization 1 3 130 6.151567 -8379.161274 5.770135 3.918034 no

FASTGMM + Opt. 3 206 10.148727 -8379.161274FIGJ 16 3 277 29.433288 -9524.692099 3.670464 3.670464 no

4-gau:

FASTGMM 4

4

101 5.615204

-123.166389

-7573.101881

2.218687212Optimization 1 3 186 12.531248 -7405.078438 10.670613 0.07519 no


8-gau:

FASTGMM 9

8

276 5.750431

22.99575458

-8599.51

0.015582283Optimization 1 8 199 4.428076 -8598.17 0.196817 1.971166 no


12-gau:

FASTGMM 11

12

376 18.496127

22.99575458

-7536.534442

0.049105156Optimization 1 12 117 5.593468 -7532.833615 1.069315 0.284287 no


14-gau:

FASTGMM 12

14

351 17.651956

16.07843346

-8008.490211

0.157717218Optimization 1 14 280 14.813798 -7995.859443 5.906032 1.884532 no


16-gau:

FASTGMM 16

16

501 26.667825

51.82061529

-8165.436422

0.057038433Optimization 1 16 202 12.848394 -8160.778985 0.251515 1.033934 no


Table 2 Experimental results on synthetic data.

11

ε = 0.9 =⇒ ε = 0.1 we getcmin = 26,4∼= 27 as the mini-mum number of component, i.e. more than double the actualnecessary components.

Nevertheless, we made FIGJ starting with a reasonablenumber of components, just a few more than the optimum,so that they do not affect its performance negatively. Thisis 16 for the 3 to 12 mixture components, and 20 for the 14and 16 ones (see tab. 2). FASTGMM’s better performance isdue to the fact that our approach, growing in the number ofcomponents, computes more iterations than FIGJ but with asmall number of components per iteration. Therefore it runseach iteration faster, while slowing only at the end due to theaugmented number of components.

Finally, FASTGMM stops when the best amount of com-ponents to cover the whole data distribution is found. FIGJ,instead, once the optimum has been found, continues to tryall the remaining configurations, to that with just one com-ponent (which is our starting mixture), so that exploiting allthe possible solutions. That contributes to slow the FIGJ ’sperformance. Generally, our approach performs faster than[9].

5.1.3 Mixture precision estimation

It is possible to see that FASTGMM usually achieves a higherfinal log-likelihood than FIGJ. This suggests a better ap-proximation of the data mixture. However, a higher log-likelihood does not strictly imply that the extracted mixturecovers the data better than another one. A deterministic ap-proach is to adopt a unique distance measure between thegenerated mixture and the evaluated one. In [17] JensenetAl. exposed three different strategies for computing such dis-tance: The Kullback-Leibler, the Earth Mover, and the Nor-malized L2 distance. The first one is not symmetric, and canonlyy be evaluated in closed form for unidimensional gaus-sians. The second one suffers from analogous problems. Thethird choice is symmetric, obeys to the triangle inequalityand it is easy to compute. We ue the latter to perform thecomparison. Its expression states [1]:

zcNx(

µc, Σc)

= Nx(

µa, Σa)

·Nx(

µb, Σb)

where

Σc =(

Σ−1a + Σ−1

b

)−1and µc = Σc

(

Σ−1a µa + Σ−1

b µb)

zc = |2πΣaΣbΣ−1c |

12 e−

12 (µa−µb)

T Σ−1a ΣcΣ−1

b (µa−µb)

= |2π(

Σa + Σb)

|12 e−

12 (µa−µb)

T(Σa+Σb)−1

(µa−µb)

(23)

5.2 Colored real images

Here we evaluate the performance of our algorithm when ap-plied to the image segmentation problem. We perform a de-

tailed comparison with [9] (to our knowledge this is the firsttime the algorithm of [9] is applied to image segmentation)and a qualitative comparison with an implementation of themeanshiftalgorithm [4]. Given the different principle andimplementation of this algorithm it is difficult to performa quantitative comparison in terms of computation time ormodel complexity. Anyway, since this is one the most suc-cessfull image segmentation algorithms, it provides a qualitycriteria benchmark to take into account. We segmented theimages as 5-dimensional input in the(R,G,B,x,y) space.The color image segmentation results are shown in Fig. 2.The set of images is divided into two groups: Some gen-eral images (from (1) to (3)) and some images taken by theiCub’s cameras (from (4) to (6)). For each group we showthe original images, the segmentation provided by the EDI-SON implementation of themeanshiftalgorithm, those ob-tained with the [9], and those obtained with our algorithm,from left to right, respectively.

From a visual inspection of the figures it is hard to saywhat is the best segmentation. Themeanshiftoutput is prob-ably more visually appealing. This method has been spe-cially designed for image segmentation and contains a fewpost-processing steps that improve the overall quality of theresults. Using raw mixture model estimation, as our methodand FIGJ, there are no clear qualitative advantages of oneover the other.

Table 3 shows quantitative results of the image segmen-tation runs using FASTGMM, FIGJ and FASTGMM withthe final optimization step. If computational speed if an im-portant requirement in the application at hand, we may saythat our algorithm is better for image segmentation, becauseit is fast, its initialization is simple and produces compara-ble results. On the negative side on theαMAX parameter (seesec. 3.5). [9] instead, it requires thea-priori definition ofthe maximum number of mixture components, that is thestarting number of components. The higher it is the longerit will take for the input segmentation. If it is set too small,the solution space may not be appropriatelly explored, giv-ing rise to undersegmentation. In addition, some problemsarise with the evaluation of the covariances matrix for somereal images. It happens that some of them may become ill-conditioned, making FIGJ “crash” when too many compo-nents are computed.

6 Discussion

In this section we highlight other aspects of the proposedalgorithm and provide an overall comparison of the advan-tages and disadvantages with respect to FIGJ.

https://www.researchgate.net/publication/220723359_Evaluation_of_Distance_Measures_Between_Gaussian_Mixture_Models_of_MFCCs?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==

https://www.researchgate.net/publication/260387385_The_Multivariate_Gaussian_Probability_Distribution?el=1_x_8&enrichId=rgreq-0f7733d6-66e7-42c5-8c4c-029715e86dd1&enrichSource=Y292ZXJQYWdlOzIyNjM2ODQwNjtBUzoxMDM3NDYyMzk0Njc1MjRAMTQwMTc0NjQzOTcwMQ==


12

Original Image EDISON FIGJ FASTGMM

(1)

(2)

(3)

(4)

(5)

(6)

Fig. 2 Color image segmentation results. We divide these images into two groups: Some general images, on the top (lines (1) to (3)), and someimages taken by the iCub’s cameras, on the bottom (lines (4) to (6)). For each group we show the original images, those obtained with EDISON,those obtained with FIGJ, and those obtained with our algorithm, from left to right, respectively.

6.1 FASTGMM Optimization Procedure

We reported our results with and without the optimizationprocedure. Since one of the most prominent key feature ofour approach is its fast computation, together with its simpleimplementation, the optimization process may seem worth-less or too computational demanding. However, by compar-ing its performance against those of [9], our algorithm stillremains faster (see sec. 5.1.2). The difference in terms of fi-nal mixture precision is not so evident at a first glance, both

in the final log-likelihood and the normalized L2 distance,but allows some improvements in the final segmentation. Ifone claims for the fastest algorithm it is advisable to not usethe optimization, even though it may lead to some improve-ments to the final mixture. Otherwise, FASTGMM gives agood precision and a better computational cost.

13

Input Algorithm# Initial # Detected

# IterationsElapsed Time Diff time FASTGMM

Log-likelihoodDiff lik FASTGMM Diff lik FASTGMM

Crashedgaussians gaussians [s] with opt vs FIGJ % with opt vs FIGJ with opt vs FIGJ

1

FASTGMM 9 551 71.460507

130.9699636

-235130.6216

0.186374662 17.03330409Optimization 1 9 23 22.131293 -234692.3977 no

FASTGMM + Opt. 9 700 93.5918 -234692.3977FIGJ 16 16 422 307.454885 -274668.2675 yes, 17

2

FASTGMM 14 426 80.365553

109.5710584

-314931.44

0.120630946 11.16531396Optimization 1 14 374 88.057387 -314551.5352 no

FASTGMM + Opt. 14 800 168.422941 -314551.5352FIGJ 30 25 572 1611.816189 -349672.2017 no

3

FASTGMM 7 276 34.977055

16.7543208

-226138.1494

3.34203E-05 17.57803149Optimization 1 7 36 5.860168 -226138.0738 no


4

FASTGMM 7 576 72.744922

3.588007146

-281176.5478



5

FASTGMM 3 151 16.899695

13.4646217

-218447.7912



6

FASTGMM 11 451 67.534808

47.35502469

-210899.0185

0.114549212 17.68706256Optimization 1 10 180 31.981125 -210657.4353 no

FASTGMM + Opt. 10 631 99.515933 -210657.4353FIGJ 24 22 514 624.913104 -247916.5477 no

Table 3 Experimental results on real images segmentation.

0 10 20 30 40 50 60 70 80−1.4

−1.3

−1.2

−1.1

−1

−0.9

−0.8

−0.7x 10

4 LogLikelihood

Iteration number0 50 100 150 200 250 300 350 400 450

−11500

−11000

−10500

−10000

−9500

−9000

−8500

−8000

−7500

−7000LogLikelihood

Iteration number0 100 200 300 400 500 600

−6.4

−6.2

−6

−5.8

−5.6

−5.4

−5.2

−5x 10

5 LogLikelihood

Iteration number

LogL

ikel

ihoo

d

0 100 200 300 400 500 600−10

−9

−8

−7

−6

−5

−4

−3x 10

5 LogLikelihood

Iteration number

LogL

ikel

ihoo

d

(a) (b) (c) (d)

0 50 100 150 200 250 300 350 400 450 500−5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0x 10

5 LogLikelihood


−1.8

−1.6

−1.4

−1.2

−1

−0.8

−0.6x 10

4 LogLikelihood


−6

−5.9

−5.8

−5.7

−5.6

−5.5

−5.4

−5.3x 10

5 LogLikelihood

Iteration number

LogL

ikel

ihoo

d

0 50 100 150 200 250 300 350 400 450 500−4.6

−4.5

−4.4

−4.3

−4.2

−4.1

−4x 10

5 LogLikelihood

Iteration number

LogL

ikel

ihoo

d

(e) (f) (g) (h)

Fig. 3 The final log-likelihood evolution as function of the numberof iterations of two different kinds of input data used within the experiments:the 4 mixture components (a) - FIGJ and (e) FASTGMM, the 12 mixture components - (b) FIGJ and (f) FASTGMM, the image (4) - (c)FIGJ and(g) FASTGMM, and the image (5) - (d) FIGJ and (h) FASTGMM.

6.2 Log-Likelihood

We approach the problem conversely than in [9]. Instead ofstarting with the maximum number of mixture components,we start with just only one. This will generate two oppositeevolutions of the cost functions. FIGJ is characterized by adecreasing exponential curve, that raises at the end. Our ap-proach, instead, is characterized by an increasing exponen-tial curve. Fig. 3 shows four examples of these output foreach algorithm.

Here it is possible to notice spikes in FIGJ curve, corre-sponding to component annihilation. When the log-likelihoodincreases, it means that the current mixture description isworse than the previous one.

Similarly, when our algorithm splits a gaussian, i.e. addinga component, a spike that decreases the log-likelihood ap-pears on the curve. Finally, when the log-likelihood does notincrease significantly the computation stops.

6.3 Overall comparison: Which algorithm is the bestone?

There’s not a unique answer to this question. Both approacheshas advantages and disadvantages. These are summarized intab. 4.

Generally, our algorithm is faster than the [9]. However,both techniques rely on somea-priori parameters: FAST-GMM αMAX andλ , and FIGJ the maximum number of com-

14

Algorithm Pro Contro Application Domain

1. Deterministic. Independent runs of the1. Requires tuning parameters for 1. High dimensional data (e.g. Imagealgorithm will produce the same result. controlling the convergence. Segmentation, Robotics)

FASTGMM 2. Initialization is unique. No parameters 2. On-line and real-time applicationsrequired. 3. Applications requiring a low number3. Low computational burden. of components (e.g. low bandwidth coding).

1. Does not require parameter tuning for 1. Stochastic. Different runs will lead to 1. Low-dimensional data.controlling convergence. different results. 2. Off-line applications.2. Better results on average in synthetic 2. Requires the specification of the initial 3. Applications requiring a high number

FIGJ data generated by gaussian mixtures number of components. of components (e.g. high quality coding).3. High computational cost.4. Showed unreliable performance inimage segmentation.

Table 4 Comparison between FASTGMM and FIGJ: Advantages and drawbacks of both algorithms.

ponents. These greatly affect the algorithms, the first for theprecision most, and the second for the computational burdenmost.

Our approach is better when the time computation is astrict requirement. If the computation time is not a problem,FIGJ can be applied to synthetic data with more precise re-sults than FASTGMM.

FASTGMM performed especially well on image seg-mentation. FIGJ, on the other side, did well in synthetic data,but not for real images. This is because of three main points:its low robustness to numerical errors produced crashes dueto singular matrices when a certain number of componentsis analyzed (usually from 18, 20) as reported in tab. 3; highercomputation time; and inaccurate selection of number ofcomponents.

7 Conclusion

In this paper we proposed a unsupervised algorithm that es-timates on-line the parameters and number of componentsof a finite mixture model from multivariate data. The mainfocus was on computational speed. Our methodology startswith a single mixture component and incrementally addsnew components until a good compromise between preci-sion and speed is achieved. This promotes a faster compu-tation that alternative methods that start with many compo-nents and gradually refine them. From this class is the algo-rithm of [9], for which we performed a detailed comparisonin the application to simple 2D mixtures or to more com-plex image segmentation scenarios. Our methos has severaladvantages: is fast, its initialization is unique, a rough tun-ing of design parameters is sufficient for good performance.We describe in detail the operation of the algorithm and itsvariants, provide rules for tuning the important parametersand show, through simulated and real data sets, the advan-

tages of our algorithm in what regards computation time andfitting score to ground truth.

Acknowledgements This work was supported by the European Com-mission, Project IST-004370 RobotCub and FP7-231640 Handle, andby the Portuguese Government - Fundacao para a Ciencia e Tecnolo-gia (ISR/IST pluriannual funding) through the PIDDAC program fundsand through project BIO-LOOK, PTDC / EEA-ACR / 71032 / 2006.

References

1. Ahrendt, P.: The multivariate gaussian probability distribution.Tech. rep., http://www2.imm.dtu.dk/pubdb/p.php?3312 (2005)

2. Bueno, J.I., Kragic, D.: Integration of tracking and adaptive gaus-sian mixture models for posture recognition. The 15th IEEE In-ternational Symposium on Robot and Human Interactive Commu-nication (RO-MAN06), Hatfield, UK (2006)

3. Carpin, S., Lewis, M., Wang, J., Balakirsky, S., Scrapper, C.:Bridging the gap between simulation and reality in urban searchand rescue”. In: Robocup 2006: Robot Soccer World Cup X(2006)

4. Comaniciu, D., Meer, P.: Mean shift: A robust approach towardfeature space analysis. IEEE Trans. Pattern Anal. Machine Intell24, 603–619 (2002)

5. Comon, P.: Independent component analysis: a new concept? Sig-nal Processing, Elsevier36(3), 287–314 (1994)

6. Davis, J.W., Morison, A.W., Woods, D.D.: An adaptive focus-of-attention model for video surveillance and monitoring. MachineVision and Applications18(1), 41–64 (2006)

7. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood estima-tion from incomplete data via the em algorithm. J. Royal StatisticSoc.30(B), 1–38 (1977)

8. Dobbe, J.G.G., Streekstra, G.J., Hardeman, M.R., Ince, C., Grim-bergen, C.A.: Measurement of the distribution of red blood celldeformability using an automated rheoscope. Cytometry (ClinicalCytometry)50, 313–325 (2002)

9. Figueiredo, A., Jain, A.: Unsupervised learning of finitemixturemodels. IEEE Trans. Patt. Anal. Mach. Intell.24(3) (2002)

10. Fritzke, B.: A growing neural gas network learns topologies. Advances in Neural Inform ation Processing Systems 7 (NIPS’94),MIT Press, Cambridge MA pp. 625–632 (1995)

11. Greenspan, H., Goldberger, J., Eshet, I.: Mixture modelfor face-color modeling and segmentation. Pattern Recognition Letters(22), 1525–1536 (2001)


































15

12. Greggio, N., Silvestri, G., Menegatti, E., Pagello, E.:Simulation ofsmall humanoid robots for soccer domain. Journal of The FranklinInstitute - Engineering and Applied Mathematics346(5), 500–519(2009)

13. Hartley, H.: Maximum likelihood estimation from incompletedata. Biometrics14, 174–194 (1958)

14. Henderson, N., King, R., Middleton, R.H.: An application of gaus-sian mixtures: Colour segmenting for the four legged leagueusinghsi colour space. Proceedings of the 2008 Australasian Confer-ence on Robotics & Automation - Canberra, Australia(ISBN 978-0-646-50643-2)(2008)

15. Holmstrom, J.: Growing neural gas - experiments with gng, gngwith utility and supervised gng (2002)

16. Hyvarinen, A., Karhunen, J., Oja, E.: Independent componentanalysis. New York: John Wiley and SonsISBN 978-0-471-40540-5(2001)

17. Jensen, J.H., Ellis, D., Christensen, M.G., Jensen, S.H.: Evaluationdistance measures between gaussian mixture models of mfccs.Proc. Int. Conf. on Music Info. Retrieval ISMIR-07 Vienna, Aus-tria pp. 107–108 (October, 2007)

18. Jinachitra, P.: Constrained em estimates for harmonic source sepa-ration. Proceedings of the 2003 International Conference on Mul-timedia and Expo - Volume 1 pp. 617 – 620 (2003)

19. Ketchantang, W., Derrode, S., Martin, L., Bourennane, S.:Pearson-based mixture model for color object tracking. MachineVision and Applications19(5-6), 457–466 (2008)

20. Key, J.: The em algorithm in medical imaging. Stat Methods MedRes6(55), 55–75 (1997)

21. Kohonen, T.: Analysis of a simple self-organizing process. Bio-logical Cybernetics44(2), 135–140 (1982)

22. Kohonen, T.: Self-organizing formation of topologically correctfeature maps. Biological Cybernetics43(1), 59–69 (1982)

23. Kouzani, A.Z.: Classification of face images using localiteratedfunction system. Machine Vision and Application19(4), 223–248(2008)

24. Lanterman, A.: Schwarz, wallace and rissanen: Intertwiningthemes in theories of model order estimation. Int’l Statistical Rev.69, 185–212 (2001)

25. MacQueen, J.B.: Some methods for classification and analysisof multivariate observations. Proceedings of 5th BerkeleySym-posium on Mathematical Statistics and Probability. pp. 281–297(1967)

26. McLachlan, G., Peel, D.: Finite mixture models. John Wiley andSons (2000)

27. Montesano, L., Lopes, M., Bernardino, A., Santos-Victor, J.:Learning object affordances: From sensory motor maps to imi-tation. IEEE Trans. on Robotics24(1) (2008)

28. Pernkopf, F., Bouchaffra, D.: Genetic-based em algorithm forlearning gaussian mixture models. IEEE Trans. Patt. Anal. Mach.Intell. 27(8), 1344–1348 (2005)

29. Rissanen, J.: Stochastic complexity in statistical inquiry. WoldScientific Publishing Co. USA (1989)

30. Sakimoto, Y., Iahiguro, M., Kitagawa, G.: Akaike information cri-terion statistics. KTK Scientific Publisher, Tokio (1986)

31. Shim, H., Kwon, D., Yun, I., Lee, S.: Robust segmentationof cere-bral arterial segments by a sequential monte carlo method: Particlefiltering. Computer Methods and Programs in Biomedicine84(2-3), 135–145 (2006)

32. Ueda, N., Nakano, R., Ghahramani, Y., Hiton, G.: Smem algo-rithm for mixture models. Neural Comput12(10), 2109–2128(2000)

33. Wallace, C., Freeman, P.: Estimation and inference by compactcoding. J. Royal Statistic Soc. B49(3), 241–252 (1987)

34. Zhang, Z., Chen, C., Sun, J., Chan, K.: Em algorithms for gaussianmixtures with split-and-merge operation. Pattern Recognition 36,1973 – 1983 (2003)































































Fast estimation of Gaussian mixture models for image segmentation

Documents

Transcript of Fast estimation of Gaussian mixture models for image segmentation