Object clustering and recognition using multi-finite mixtures for semantic classes and hierarchy...

18
Object clustering and recognition using multi-finite mixtures for semantic classes and hierarchy modeling Taoufik Bdiri a , Nizar Bouguila b,, Djemel Ziou c a Department of Electrical and Computer Engineering, Concordia University, Montreal, QC H3G 1T7, Canada b The Concordia Institute for Information Systems Engineering (CIISE), Concordia University, Montreal, QC H3G 1T7, Canada c DI, Faculté des Sciences, Université de Sherbrooke, Sherbrooke, QC J1K 2R1, Canada article info Keywords: Data clustering Mixture models Hierarchical models Semantic clustering Inverted Dirichlet distribution Maximum likelihood Visual objects abstract Model-based approaches have become important tools to model data and infer knowledge. Such approaches are often used for clustering and object recognition which are crucial steps in many applica- tions, including but not limited to, recommendation systems, search engines, cyber security, surveillance and object tracking. Many of these applications have the urgent need to reduce the semantic gap of data representation between the system level and the human being understandable level. Indeed, the low level features extracted to represent a given object can be confusing to machines which cannot differen- tiate between very similar objects trivially distinguishable by human beings (e.g. apple vs. tomato). In this paper, we propose a novel hierarchical methodology for data representation using a hierarchical mix- ture model. The proposed approach allows to model a given object class by a set of modes deduced by the system and grouped according to a labeled training data representing the human level semantic. We have used the inverted Dirichlet distribution to build our statistical framework. The proposed approach has been validated using both synthetic data and a challenging application namely visual object clustering and recognition. The presented model is shown to have a flexible hierarchy that can be changed on the fly within costless computational time. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction Available digital data has increased significantly in the recent years with the intensive use of technological devices and Internet. Due to the huge amount of such heterogeneous data, an urgent need has been triggered to automate its analysis and modeling for different purposes and applications. One challenging crucial as- pect in data analysis is clustering, a form of unsupervised learning, which is defined as the process of assigning observations sharing similar characteristics to subgroups, such that their heterogeneity is minimized within a given subgroup and maximized between the subgroups. Such an assignment is not trivial especially when we deal with high dimensional data. Indeed, it has been shown that clustering is considered as one of the most important aspects of artificial intelligence and data mining (Jain, Murty, & Flynn, 1999; Fisher, 1996). Given a data set that we need to extract knowledge from it, the ultimate goal is to construct consistent high quality clusters using a computationally inexpensive way. Statisti- cal-based approaches for data clustering have recently become an interesting and attractive research domain with the advancement of computational power that enables researchers to implement complex algorithms and deploy them in real time applications. One major approach based on statistics is model-based clustering using finite mixture models. A finite mixture model can be defined as a weighted sum of probability distributions where each distri- bution represents the population of a given subgroup. The authors in Fraley and Raftery (2002) traced the use of finite mixture models back to the 1960s and 1970s, citing amongst others, works in Ed- wards and Cavalli-Sforza (1965), Day (1969) and Binder (1978). Although their use backs at least as far as 1963, it is only in the re- cent decades that mixture models applications started to cover many fields including, but not limited to, digital image processing and computer vision (Sefidpour & Bouguila, 2012; Stauffer & Grimson, 2000; Allili, Ziou, Bouguila, & Boutemedjet, 2010), social networks (Couronne, Stoica, & Beuscart, 2010; Handcock, Raftery, & Tantrum, 2007; Morinaga & Yamanishi, 2004), medicine (Koestler et al., 2010; Tao, Cheng, & Basu, 2010; Neelon, Swamy, Burgette, & Miranda, 2011; Schlattmann, 2009; Rattanasiri, Böhning, Rojanavipart, & Athipanyakom, 2004), and bioinformatics (Kim, Cho, & Kim, 2010; Meinicke, Ahauer, & Lingner, 2011; Ji, Wu, Liu, Wang, & Coombes, 2005). The consideration of mixture models is practical for many applications. In many cases, however, the complexity of the observed data may render the use of one single distribution to 0957-4174/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.08.005 Corresponding author. Tel.: +1 5148482424; fax: +1 5148483171. E-mail addresses: [email protected] (T. Bdiri), nizar.bouguila@ concordia.ca (N. Bouguila), [email protected] (D. Ziou). Expert Systems with Applications 41 (2014) 1218–1235 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Transcript of Object clustering and recognition using multi-finite mixtures for semantic classes and hierarchy...

Expert Systems with Applications 41 (2014) 1218–1235

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

Object clustering and recognition using multi-finite mixtures forsemantic classes and hierarchy modeling

0957-4174/$ - see front matter � 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.eswa.2013.08.005

⇑ Corresponding author. Tel.: +1 5148482424; fax: +1 5148483171.E-mail addresses: [email protected] (T. Bdiri), nizar.bouguila@

concordia.ca (N. Bouguila), [email protected] (D. Ziou).

Taoufik Bdiri a, Nizar Bouguila b,⇑, Djemel Ziou c

a Department of Electrical and Computer Engineering, Concordia University, Montreal, QC H3G 1T7, Canadab The Concordia Institute for Information Systems Engineering (CIISE), Concordia University, Montreal, QC H3G 1T7, Canadac DI, Faculté des Sciences, Université de Sherbrooke, Sherbrooke, QC J1K 2R1, Canada

a r t i c l e i n f o

Keywords:Data clusteringMixture modelsHierarchical modelsSemantic clusteringInverted Dirichlet distributionMaximum likelihoodVisual objects

a b s t r a c t

Model-based approaches have become important tools to model data and infer knowledge. Suchapproaches are often used for clustering and object recognition which are crucial steps in many applica-tions, including but not limited to, recommendation systems, search engines, cyber security, surveillanceand object tracking. Many of these applications have the urgent need to reduce the semantic gap of datarepresentation between the system level and the human being understandable level. Indeed, the lowlevel features extracted to represent a given object can be confusing to machines which cannot differen-tiate between very similar objects trivially distinguishable by human beings (e.g. apple vs. tomato). Inthis paper, we propose a novel hierarchical methodology for data representation using a hierarchical mix-ture model. The proposed approach allows to model a given object class by a set of modes deduced by thesystem and grouped according to a labeled training data representing the human level semantic. We haveused the inverted Dirichlet distribution to build our statistical framework. The proposed approach hasbeen validated using both synthetic data and a challenging application namely visual object clusteringand recognition. The presented model is shown to have a flexible hierarchy that can be changed onthe fly within costless computational time.

� 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Available digital data has increased significantly in the recentyears with the intensive use of technological devices and Internet.Due to the huge amount of such heterogeneous data, an urgentneed has been triggered to automate its analysis and modelingfor different purposes and applications. One challenging crucial as-pect in data analysis is clustering, a form of unsupervised learning,which is defined as the process of assigning observations sharingsimilar characteristics to subgroups, such that their heterogeneityis minimized within a given subgroup and maximized betweenthe subgroups. Such an assignment is not trivial especially whenwe deal with high dimensional data. Indeed, it has been shownthat clustering is considered as one of the most important aspectsof artificial intelligence and data mining (Jain, Murty, & Flynn,1999; Fisher, 1996). Given a data set that we need to extractknowledge from it, the ultimate goal is to construct consistent highquality clusters using a computationally inexpensive way. Statisti-cal-based approaches for data clustering have recently become aninteresting and attractive research domain with the advancement

of computational power that enables researchers to implementcomplex algorithms and deploy them in real time applications.One major approach based on statistics is model-based clusteringusing finite mixture models. A finite mixture model can be definedas a weighted sum of probability distributions where each distri-bution represents the population of a given subgroup. The authorsin Fraley and Raftery (2002) traced the use of finite mixture modelsback to the 1960s and 1970s, citing amongst others, works in Ed-wards and Cavalli-Sforza (1965), Day (1969) and Binder (1978).Although their use backs at least as far as 1963, it is only in the re-cent decades that mixture models applications started to covermany fields including, but not limited to, digital image processingand computer vision (Sefidpour & Bouguila, 2012; Stauffer &Grimson, 2000; Allili, Ziou, Bouguila, & Boutemedjet, 2010), socialnetworks (Couronne, Stoica, & Beuscart, 2010; Handcock, Raftery,& Tantrum, 2007; Morinaga & Yamanishi, 2004), medicine(Koestler et al., 2010; Tao, Cheng, & Basu, 2010; Neelon, Swamy,Burgette, & Miranda, 2011; Schlattmann, 2009; Rattanasiri,Böhning, Rojanavipart, & Athipanyakom, 2004), and bioinformatics(Kim, Cho, & Kim, 2010; Meinicke, Ahauer, & Lingner, 2011; Ji, Wu,Liu, Wang, & Coombes, 2005).

The consideration of mixture models is practical for manyapplications. In many cases, however, the complexity of theobserved data may render the use of one single distribution to

T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235 1219

represent a given class insufficient for inference. Many techniqueshave been proposed to select the best number of mixture compo-nents that best represents the data. Examples include Bayesianinference criterion (BIC) (Schwarz, 1978), minimum descriptionlength (MDL) (Rissanen, 1999) and minimum message length(MML) (Wallace, 2005) criteria. These criteria are mainly used inunsupervised algorithms where the system handles data modelingwithout any intervention during the learning process. This ap-proach has a serious drawback in many applications as the seman-tic meaning of the mixture modes in the selected model does notnecessarily fit with a human comprehensible semantic. Considerfor instance an object recognition application where the systemhas to recognize different objects according to the user need. AnMML, BIC or MDL criterion would in most cases consider classeswith important visual similarities (e.g., apple and tomato), as beingthe classes that should be represented by a single mode in the mix-ture. This is not always the case in real applications, where a hu-man being or an application would need to differentiate betweendifferent classes even when they have close visual properties andsimilarities, thus we talk here about a semantic meaning of themixture modes. Therefore, the gap between the system representa-tion and the human representation of data is still high when usingthese methods. An eventual solution is to form a hierarchical mod-el based on some ontology as in Ziou, Hamri, and Boutemedjet(2009), Bakhtiari and Bouguila (2010), Bakhtiari and Bouguila(2011) and Bakhtiari and Bouguila (2012) where the data isgrouped into clusters and sub clusters (i.e., tree-structured cluster-ing).Yet, this approach still form the model using the visual simi-larities and groups data according to the system choice.Furthermore, since the model is based on estimating the model’sparameters as the algorithm goes deeper in the hierarchy (the dis-tributions’ parameters of the children clusters depend on theparameters of the parents clusters), when changing the hierarchi-cal model, a whole new estimation should take place, which in-creases the computational cost. The user intervention to build ahierarchical mixture has been introduced in Bishop and Tipping(1998) which developed the concept of hierarchical visualizationwhere the construction of the hierarchical tree proceeds top-down,and for each level, the user decides on the suitable number of mod-els to fit at the next level down. Indeed this interaction, may serveto have an optimal number of clusters for each level according tothe user, but it does not permit the user to define any semanticmeaning to the clusters or group the clusters as he/she needs.Moreover, the user cannot define any ontological model to thedata, and there is a new estimation of the parameters to be calcu-lated at each level when the model goes deeper in the tree.

In this work, we present a novel way to model data and assign asemantic meaning to clusters according to the user needs whichcan reduce significantly the gap between the system representa-tion and the user level representation. We tackle the challengingproblem of object clustering, and recognition of new unseen datain terms of affectation to the appropriate clusters forming the ob-ject class. Naturally, the choice of the distribution forming the mix-ture model is crucial in terms of clustering efficiency and accuracyof the classification of unseen data. Indeed, many works have fo-cused on Gaussian mixture models (GMM) to build their applica-tions such as in Permuter, Francos, and Jermyn (2003), Zivkovic(2004), Yang and Ahuja (1999) and Weiling, Lei, and Yang (2010),but recent researches have shown that it is not appropriate to al-ways assume that data follows a normal distribution. For instance,the works in Boutemedjet, Bouguila, and Ziou (2009), Bouguila,Ziou, and Vaillancourt (2004), Bouguila and Ziou (2006), Bouguila,Ziou, and Hammoud (2009) have considered the Dirichlet and gen-eralized Dirichlet mixture models, to model proportional data,which have been shown to outperform the GMM. We have devel-oped in our previous work the inverted Dirichlet mixture model

(IDMM) which has better capabilities than the GMM when model-ing positive data that occurs naturally in many real applications(Bdiri & Bouguila, 2012; Bdiri & Bouguila, 2011). Hence, we pro-pose our new methodology using IDMM, although it is noteworthyto bear in mind than any other distribution can be used as the pre-sented framework is general.

The rest of this paper is organized as follows. In Section 2, wepresent our statistical framework by considering a two-levels hier-archy for ease of representation and understanding of the generalmethodology. In Section 3, we propose a detailed approach to learnthe proposed statistical model. In Section 4, we propose a general-ization of our modeling framework to cover many hierarchical lev-els. Section 5 is devoted to present the experimental results usingboth synthetic data and a real-life application concerning objectrecognition. Finally, Section 6 gives a conclusion and future per-spectives for research.

2. Statistical framework: the model

We propose to develop a statistical framework that can modeldata in a hierarchical fashion. The attribution of a semantic mean-ing to the model is discussed in sub Section 5.2.1. In this section,we consider a two-levels hierarchy where we have a set of superclasses, composed each, of a set of classes. The generalization ofthe model is discussed in Section 4. Let us consider a set X of ND-dimensional vectors, such that X ¼ ð~X1;~X2; . . . ;~XNÞ. Let M de-notes the number of different super classes and Kj the number ofclasses forming the super class j. We assume that X is controlledby a mixture of mixtures, such that each super class j is representedby a mixture of Kj components and the parent mixture is composedof M mixtures representing the super classes. Thus, we consider twoviews or levels for the statistical model. The first view focuses onthe super classes and the second one zooms on the classes (seeFig. 1). We suppose that the vectors follow a common but un-known probability density function pð~XnjNÞ, where N is the set ofits parameters. Let Z ¼ f~Z1;~Z2; . . . ;~ZNg denotes the missing groupindicator, where ~Zn ¼ ðzn1; zn2; . . . ; znMÞ is the label of ~Xn, such thatznj 2 {0,1},

PMj¼1znj ¼ 1 and znj is equal to one if ~Xn belongs to super

class j and zero, otherwise. Then, the distribution of ~Xn given thesuper class label ~Zn is:

pð~Xnj~Zn;HÞ ¼YMj¼1

pð~XnjhjÞznj ð1Þ

where H = {h1,h2, . . . ,hM} and hj is the set of parameters of the superclass j. In practice, pð~XnjHÞ can be obtained by marginalizing thecomplete likelihood pð~Xn;~ZnjHÞ over the hidden variables. We de-fine the prior distribution of ~Zn as follows:

pð~Znj~pÞ ¼YMj¼1

pznj

j ð2Þ

where ~p ¼ ðp1; . . . ;pMÞ;pj > 0 andPM

j¼1pj ¼ 1, then we have:

pð~Xn;~ZnjH; ~pÞ ¼ pð~Xnj~Zn;HÞPð~Znj~pÞ ¼YMj¼1

ðpð~XnjhjÞpjÞznj ð3Þ

We proceed by the marginalization of Eq. (3) over the hidden vari-able (see Appendix A), so the first level of our mixture for a givenvector ~Xn can be written as follows:

pð~XnjH; ~pÞ ¼XM

j¼1

pð~XnjhjÞpj ð4Þ

Thus, according to the previous equation, the set of parameters Ncorresponding to the first level is N ¼ ðH; ~pÞ. When we examinethe second level which considers the classes, given that ~Xn isgenerated from the mixture j, we suppose that it is also generated

Fig. 1. Two-levels hierarchical model.

1220 T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235

from one of the Kj modes of the mixture j. Thus, we considerYj ¼ f~Y1j;~Y2j; . . . ;~YNjg that denotes the missing group indicatorwhere the ith element of ~Ynj; ynji is equal to one if ~Xn belongs tothe class i of the super class j and zero, otherwise. Letf~Ynjg ¼ f~Yn1;~Yn2; . . . ;~YnMg denotes the classes label of ~Xn. Then,the distribution of ~Xn given the super class label ~Zn and the classeslabels f~Ynjg is

pð~Xnj~Zn; f~Ynjg;H;uÞ ¼ pð~Xnj~Zn; f~Ynjg;uÞ

¼YMj¼1

YKj

i¼1

pð~Xnj~ujiÞynji

!znj

ð5Þ

where u ¼ f~ujig, j = 1, . . . ,M, i = 1, . . . ,Kj, is the set of parameters ofthe modes representing the different classes. It is noteworthy tomention that Kj depends on the number of classes that the superclass j contains. We define the prior distribution of ~Ynj and f~Ynjgas follows:

pð~Ynjj~Zn; fwjig; ~pÞ ¼ pð~Ynjj~Zn; fwjigÞ ¼YKj

i¼1

wynji

ji ð6Þ

pðf~Ynjgj~Zn; fwjig; ~pÞ ¼ pðf~Ynjgj~Zn; fwjigÞ ¼YMj¼1

YKj

i¼1

wynji

ji

!znj

ð7Þ

where wji > 0;PM

j¼1

PKj

i¼1wji ¼ 1, and {wji} is the set of the modes’mixing weights with j = 1, . . . ,M and i = 1, . . . ,Kj, then we have:

pð~Xn;~Zn; f~Ynjgju; fwjig; ~pÞ ¼ pð~Xnj~Zn; f~Ynjg;uÞpðf~Ynjgj~Zn; fwjigÞpð~Znj~pÞ

¼YMj¼1

pznj

j

YKj

i¼1

wynji

ji pð~Xnj~ujiÞynji

!znj

ð8Þ

We proceed by the marginalization of Eq. (8) over the hidden vari-ables (see Appendix A), so the second level mixture can be writtenas:

pð~Xnju; fwjig; ~pÞ ¼XM

j¼1

pj

XKj

i¼1

pð~Xnj~ujiÞwji

!ð9Þ

Thus, according to the previous equation, the set of parameters Ncorresponding to the second level is N ¼ ðu; fwjig; ~pÞ.

3. Model learning

3.1. Log-likelihood of the complete data

The model for the complete data hX ; Z; fYjgi is given by:

pðX ; ZjH; ~pÞ ¼YN

n¼1

YM

j¼1pð~XnjhjÞ

znjpznj

j|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}First level

¼YN

n¼1

YM

j¼1

YKj

i¼1ðpð~Xnj~ujiÞwjiÞ

ynji�znj� �

pznj

j|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}Second level

¼ pðX ; Z; fYjgju; fwjig; ~pÞ ð10Þ

We maximize the log-likelihood instead of the likelihood. The log-likelihood is given by:

UcðX ; Z; fYjgju; fwjig; ~pÞ ¼ logðpðX ; Z; fYjgju; fwjig; ~pÞÞ

¼XN

n¼1

XM

j¼1

XKj

i¼1

znj

�logðpjÞ þ ynjiðlogðwjiÞ

þ logðpð~Xnj~ujiÞÞÞ�

ð11Þ

Let us recall that u ¼ f~ujig; j ¼ 1; . . . ;M, i = 1, . . . ,Kj, such that ~uji isthe set of parameters of the inverted Dirichlet distribution, for theclass i, of the super class j. In order to estimate the parameters, weuse the expectation maximization (EM) algorithm which proceedsiteratively in two steps; the expectation (E) step and the maximiza-tion (M) step. In the E-step, we compute the conditional expectationof UcðX ; Z; fYjgju; fwjig; ~pÞ which is reduced to the computation ofthe posterior probabilities (i.e. the probability that a vector ~Xn is as-signed to a mixture j, and the probability that ~Xn is assigned to themode i of j), such that (see Appendix B):

pðj; ij~Xn;u; fwjig; ~pÞ ¼wjipjpð~Xnj~ujiÞPM

j¼1

PKj

i¼1wjipjpð~Xnj~ujiÞð12Þ

and

pðjj~Xn;H; ~pÞ ¼pjpð~XnjhjÞPMj¼1pjpð~XnjhjÞ

¼ pjPKj

i¼1wjipð~Xnj~ujiÞPMj¼1pj

PKj

i¼1wjipð~Xnj~ujiÞ

¼ pðjj~Xn;u; fwjig; ~pÞ ð13Þ

Thus, we have:

log pðXju; fwjig; ~pÞ ¼XN

n¼1

XM

j¼1

XKj

i¼1

pðjj~Xn;u; fwjig; ~pÞ

� logðpjÞ þ pðj; ij~Xn;u; fwjig; ~pÞðlogðwjiÞ�

þ logðpð~Xnj~ujiÞÞÞ�

ð14Þ

Then, the conditional expectation of the complete-data log likeli-hood, using Lagrange multipliers to introduce the constraints aboutthe mixture proportions {pj} and {wij}, is given by:

QðX ;u; fwjig; ~p;KÞ ¼ log pðXju; fwjig; ~pÞ þ k1 1�XM

j¼1

pj

!

þ k2 1�XM

j¼1

XKj

i¼1

wij

!ð15Þ

where K = {k1,k2} is the Lagrange multiplier.

3.2. Parameters estimation

Although our model is general and can therefore adopt any pos-sible probability density function, we use here the inverted Dirichletdistribution as the density model. This distribution permits multi-

1 Many applications might find the prior specification of the number of compo-nents forming each super class, useful. In our case we want the system to form theclasses and then group them dynamically according to the user’s needs. In this paper,the two approaches are proposed.

T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235 1221

ple symmetric and asymmetric modes and it may be skewed to theright, skewed to the left or symmetric, which gives it suitable prop-erties to model different forms of positive data. If a D-dimensionalpositive vector~X ¼ ðX1;X2; . . . ;XDÞ follows an inverted Dirichlet dis-tribution, the joint density function is given by Bdiri and Bouguila(2012), Bdiri and Bouguila (2011) and Tiao and Cuttman (1965):

pð~Xj~uÞ ¼ Cðj~ujÞQDþ1d¼1 Cð~uðdÞÞ

YD

d¼1

X~uðdÞ�1d 1þ

XD

d¼1

Xd

!�j~ujð16Þ

where C(.) is the gamma function, Xd > 0, d = 1, 2, . . . ,D,~u ¼ ð~uð1Þ; . . . ; ~uðDþ 1ÞÞ is the vector of parameters andj~uj ¼

PDþ1d¼1 ~uðdÞ; ~uðdÞ > 0; d ¼ 1;2; . . . ;Dþ 1. The parameters esti-

mation is based on the maximization of the log likelihood (Eq.(15)). The maximization gives the following for the mixturesweights (see Appendix B):

pj ¼PN

n¼1pðjj~Xn;u; fwjig; ~pÞN

ð17Þ

wji ¼PN

n¼1pðj; ij~Xn;u; fwjig; ~pÞN

ð18Þ

Calculating the derivative with respect to ~ujiðdÞ, d = 1, . . . ,D, andusing the inverted Dirichlet distribution, we obtain:

@QðX ;u; fwjig; ~p;KÞ@~ujiðdÞ

¼XN

n¼1

pðjj~Xn;u; fwjig; ~pÞpðj; ij~Xn;u; fwjig; ~pÞ

� @ logðpð~Xnj~ujiÞÞ@~ujiðdÞ

¼XN

n¼1

pðjj~Xn;u; fwjig; ~pÞpðj; ij~Xn;u; fwjig; ~pÞ

� Wðj~ujijÞ �Wð~ujiðdÞÞ þ logXnd

1þPD

d¼1Xnd

! !ð19Þ

where W(.) is the digamma function. The derivative with respect to~ujiðDþ 1Þ is given by:

@QðX ;u;fwjig;~p;KÞ@~ujiðDþ1Þ ¼

XN

n¼1

pðjj~Xn;u;fwjig;~pÞpðj;ij~Xn;u;fwjig;~pÞ

�@ logðpð~Xnj~ujiÞ@~ujiðDþ1Þ

¼XN

n¼1

pðjj~Xn;u;fwjig;~pÞpðj;ij~Xn;u;fwjig;~pÞ

� Wðj~ujijÞ�Wð~ujiðDþ1ÞÞþ log1

1þPD

d¼1Xnd

!

ð20Þ

Looking at the previous two equations, it is clear that an explicitform of the solution to estimate ~uji does not exist. Thus, we referto Newton Raphson method expressed as:

~unewji ¼ ~uold

ji � H�1ji Gji; j ¼ 1; . . . ;M; i ¼ 1; . . . ;Kj ð21Þ

where Hji is the Hessian matrix associated with QðX ;u; fwjig; ~p;KÞand Gji is the first derivatives vector,

Gji ¼@QðX ;u;fwjig;~p;KÞ

@~ujið1Þ; . . . ;

@QðX ;u;fwjig;~p;KÞ@~ujiðDþ1Þ

� �T. To calculate the Hessian

of QðX ;u; fwjig; ~p;KÞ we have to compute the second and mixedderivatives:

@QðX ;u;fwjig;~p;KÞ@2~ujiðdÞ

¼ ðW0ðj~ujijÞ �W0ð~ujiðdÞÞÞ

�XN

n¼1

pðjj~Xn;u;fwjig;~pÞpðj; ij~Xn;u;fwjig;~pÞ !

ð22Þ

@QðX ;u;fwjig;~p;KÞ@~ujiðd1Þ@~ujiðd2Þ

¼ ðW0ðj~ujijÞÞXN

n¼1

pðjj~Xn;u;fwjig;~pÞpðj; ij~Xn;u;fwjig;~pÞ !

ð23Þ

where W0(.) is the trigamma function.Thus,

Hji ¼XN

n¼1

pðjj~Xn;u;fwjig;~pÞpðj; ij~Xn;u;fwjig;~pÞ

W0ðj~ujijÞ �W0ð~ujið1ÞÞ W0ðj~ujijÞ � � � W0ðj~ujijÞW0ðj~ujijÞ W0ðj~ujijÞ �W0ð~ujið2ÞÞ � � � W0ðj~ujijÞ

..

. ... . .

. ...

W0ðj~ujijÞ � � � W0ðj~ujijÞ �W0ð~ujiðDþ 1ÞÞ

0BBBBB@

1CCCCCA

ð24Þ

we can write:

Hji ¼ Dji þ ajiATjiAji ð25Þ

where

Dji ¼ diag �XN

n¼1

pðjj~Xn;u;fwjig;~pÞpðj; ij~Xn;u;fwjig;~pÞW0ð~ujið1ÞÞ; . . . ;

"

�XN

n¼1

pðjj~Xn;u;fwjig;~pÞpðj; ij~Xn;u;fwjig;~pÞW0ð~ujiðDþ 1ÞÞ#ð26Þ

Dji is a diagonal matrix, aji ¼PN

n¼1pðjj~Xn;u; fwjig; ~pÞ�

pðj; ij~Xn;

u; fwjig; ~pÞÞW0ðj~ujijÞ and ATji ¼ ða1; . . . ; aDþ1Þ, ad = 1,d = 1, . . . ,D + 1,

then using the Theorem 8.3.3 in Graybill (1983), the inverse matrixis given by:

H�1ji ¼ D�1

ji þ a�jiA�Tji A�ji ð27Þ

D�ji 1 can be easily computed as following:

A�ji ¼1PN

n¼1pðjj~Xn;u; fwjig; ~pÞpðj; ij~Xn;u; fwjig; ~pÞ

� 1W0ð~ujið1ÞÞ

; . . . ;1

W0ð~ujiðDþ 1ÞÞ

� �ð28Þ

and

a�ji ¼ W0ðj~ujijÞXDþ1

d¼1

1W0ð~ujiðdÞÞ

!� 1

" #�1

�W0ðj~ujijÞXN

n¼1

pðjj~Xn;u; fwjig; ~pÞpðj; ij~Xn;u; fwjig; ~pÞ ð29Þ

4. Generalization of the model

4.1. A flexible hierarchical model

In the previous sections, two levels of the hierarchy were mod-eled with prior knowledge about how to group the classes withinthe super classes. In many real life applications, however, we donot have this prior knowledge. Thus, we propose to overcome thisproblem by generalizing our model to several hierarchical levels.Indeed, when we change the strategy of how we look at the model,we can overcome the model selection problem at each super classlevel (i.e. prior knowledge about the number of components thateach super class must have).1 Indeed, the second level can be viewedas a mixture of

PMj¼1Kj components with mixture weights wji such asPM

j¼1

PKj

i¼1wji ¼ 1. According to Eqs. (12) and (13), we have:

pðjj~Xn;u; fwjig; ~pÞ ¼XKj

i¼1

pðj; ij~Xn;u; fwjig; ~pÞ ð30Þ

Thus, Eqs. 17, 18 and 30 give us:

1222 T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235

pj ¼XKj

i¼1

wji ð31Þ

When considering a hierarchical model with several levels, we pro-pose to estimate the probability densities parameters and theweight of the bottom level using the algorithm that we proposedin Bdiri and Bouguila (2012), and then estimate the weight of eachupper level by summing the weights of their children. This ap-proach can be practical in real life applications when we do nothave prior knowledge about the hierarchy as we will show furtherin the experimental part. For instance, an object class can be mod-eled by a Kj-components mixture without having prior knowledgeabout Kj. To define a super class of objects classes, we simply sumtheir corresponding weights and we move to an upper level in thehierarchy. Thus, we mainly develop two approaches, the first oneis when we have prior knowledge about the number of clusters/classes forming a super class and the second one when we do nothave such knowledge, so we can group the clusters/classes accord-ing to a given hierarchy that can change on the fly, which is the pur-pose of this work.

4.2. Learning algorithm

To initialize the EM algorithm we use the approach proposed inBdiri and Bouguila (2012) based on K-Means algorithm and themethod of moments. As we have two approaches for building thehierarchical model, we present two algorithms. The first one whenwe have the required prior knowledge about the super classes andthe second one when we do not have such knowledge.

4.2.1. Learning in the presence of prior knowledgeWhen we know in advance the number of components forming

each super class, we have two ways to initialize the parameters. Thefirst way is to let the system group the clusters basing on similar-ities, thus, we use the K-Means algorithm to have M clusters repre-senting the super classes and then reuse K-Means to determine theappropriate classes Kj of each super class. The second way is to useK-Means to have

PMj¼1Kj clusters and then group them within each

super class j.

� Initialization algorithm1. – Grouping by the system: Apply the K-Means on the N D-

dimensional vectors to obtain initial M super classes andreapply it on each super class j to obtain Kj clusters.

– Grouping by the user: Apply the K-means on the N D-dimensional vectors to obtain initial

PMj¼1Kj, and then group

the chosen clusters Kj for each super class j.

2. Weights initialization:

– Calculate wji ¼ Number of elements in cluster jiN

– Calculate pj ¼ Number of elements in superclass jN

3. Apply the moments method described in Bdiri and Bouguila(2012) for each component ji to obtain a vector of parame-ters corresponding to a given cluster ji.

Then, the estimation algorithm can be summarized as follows.� Estimation algorithm

1. INPUT: D-dimensional data ~Xn;n ¼ 1; . . . ;N, a specifiednumber of super classes M, and the number of classes, Kj,in each super class j.

2. Initialization algorithm3. E-Step: Compute the posterior probabilities using Eqs. (12)

and (13).4. M-Step:

– Update ~uji using Eq. (21), j = 1, . . . ,M, i = 1, . . . ,Kj.– Update wji using Eq. (18), and pj using Eq.(31), j = 1, . . . ,M, i = 1, . . . ,Kj

5. If the convergence test ðDpðXju; fwjig; ~pÞ < �Þ is passed ter-minate, else go to 3.

where DpðXju; fwjig; ~pÞ is the difference between the likeli-hoods calculated in two consecutive iterations.

4.2.2. Dynamical hierarchical grouping without prior knowledge aboutthe super classes

For many applications, the hierarchical model can be built onthe fly according to the user’s need or depending on some circum-stances. Here, we present the algorithm proposed in this case.

� Initialization algorithm1. Apply K-Means on the N D-dimensional vectors to obtain

initialPM

j¼1Kj clusters.2. Calculate wji ¼ Number of elements in cluster ji

N3. Apply the moments method for each component ji to obtain

the initial parameters corresponding to each cluster.Then, the estimation algorithm can be summarized as follows.� Estimation algorithm

1. INPUT: D-dimensional data ~Xn;n ¼ 1; . . . ;N and a specifiednumber of total clusters

PMj¼1Kj.

2. Initialization algorithm.3. E-Step: Compute the posterior probabilities for

PMj¼1Kj com-

ponents of an IDM using the algorithm in Bdiri and Bouguila(2012).

4. M-Step:

– Update the parameters of the mixture at the lowest

level according to Bdiri and Bouguila (2012), consider-ing a finite mixture of IDM with

PMj¼1Kj components.

– Update wt = wji using Eq. (18) when considering theposterior probabilities calculated in 3, j ¼ 1; . . . ;M;

i ¼ 1; . . . ;Kj; t ¼ 1; . . . ;PM

j¼1Kj.

5. If the convergence test ðDpðXju; fwtgÞ < �Þ is passed go to 6,

else go to 3.pðXju; fwtgÞ is the likelihood of the considered IDMM hav-ing

PMj¼1Kj-components, with a set of parameters

fu; fwtgg; t ¼ 1; . . . ;PM

j¼1Kj.6. For each level of the hierarchy: compute pj ¼

PKj

i¼1wji

according to a given ontological model.

Using this algorithm, we do not have to specify Kj for each superclass j, but we rather specify the total number

PMj¼1Kj of the mod-

elled classes at the lowest level of the hierarchy. We then specify Kj

for each super class j basing on a given ontological model, and theobtained results. It is noteworthy to mention that any number oflevels in the hierarchy can be built by simply constructing itsweights by summing their respective children weights. In ourapplication and experiments we use this algorithm.

5. Experimental results

5.1. Synthetic data

In this section, an evaluation of our proposed algorithm is per-formed using synthetic data. We report results on one-dimensionaland multi-dimensional synthetic data sets. We also analyze theperformance of our estimation approach in the presence of outliersand finally we analyze the capabilities of our approach in modelingoverlapping classes. It is noteworthy that we suppose here that wehave a certain knowledge about the grouping of the different clas-ses into super classes based on a given semantic model. The gener-ation of inverted Dirichlet distribution densities according to some

0 0.5 1 1.5 2 2.5 3 3.5 4 4.50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

11122122

(a) Classes labeled in ji

0 0.5 1 1.5 2 2.5 3 3.5 4 4.50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

Superclass 1Superclass 2

(b) Super classes

0 0.5 1 1.5 2 2.5 3 3.5 4 4.50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

(c) MixtureFig. 2. Different representations of the model in Table 1.

Table 1Real and estimated parameters in the case of a one-dimensional dataset generated

T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235 1223

given parameters is discussed in details in Bdiri and Bouguila(2012) and Bdiri and Bouguila (2011).

from 4-components mixture model.

j i pj wji ~ujið1Þ ~ujið2Þ pj wji ~ujið1Þ ~ujið2Þ

1 1 0.5 0.3 50 50 0.4997 0.2993 50.5311 50.45792 0.2 3 23 0.2004 2.9847 22.8295

2 1 0.5 0.2 85 30 0.5003 0.2000 85.0137 29.99592 0.3 20 50 0.3003 20.0125 49.9427

5.1.1. ResultsWe performed all the following experiments with a Newton

Raphson convergence precision equal to � = 10�5. We generatedartificial histograms from artificial inverted Beta mixture distribu-tions,2 and we estimated their related parameters according to theproposed algorithm. For the first histogram we considered two superclasses with two classes each, as shown in Fig. 2 which displays dif-ferent representations of the considered model. Fig. 2(a) shows thedifferent classes of the model, whereas Fig. 2(b) shows the consid-ered super classes. Note that a different hierarchy of super classescan be chosen by grouping different classes. Fig. 2(c) shows thewhole mixture. The obtained results are reported in Table 1 wherewe show the real and estimated parameters. The maximum detectedpercent error of relative change between real and estimated param-eters among all the set of parameters is 1.06% which reflects a goodestimation. Thus, the estimated histogram and the real one are indis-tinguishable. Our algorithm is shown to perform well for one dimen-sional data.

2 The inverted Beta is the one-dimensional special case of the inverted Dirichletobtained when D = 1.

We also performed a set of experiments on multi-dimensionalgenerated data. We generated two different 2-dimensional datasets (see Figs. 3 and 4). In the first experiment associated withFig. 3, we generate three super classes, the first one groups two clas-ses, and the others have one class each. The results are reported inTable 2. In the second experiment associated with Fig. 4, we gener-ate two super classes where the first one groups two classes and thesecond one groups four classes. The results are reported in Table 3.The maximum detected percent error of relative change betweenreal and estimated parameters among all the set of parametersfor the two experiments are respectively 0.69% and 1.19% whichreflects a good estimation as well, as we have considered overlap-ping densities. Naturally the more the densities get overlapped theharder it is to have good estimates of their parameters. It is need-less to say that the representation of the estimated probability

Fig. 3. Different perspectives for the representation of the model in Table 2.

1224 T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235

mixtures and the real probability mixtures are also indistinguish-able. Finally, we considered 6-dimensional data where we gener-ated 2 super classes, one having 2 classes and the other havingone single class. The results are reported in Tables 4 and 5 wherewe respectively report on real and estimated parameters. The max-imum detected percent error of relative change between real andestimated parameters among all the set of parameters is 0.44%which shows again that our algorithm performs well when dealingwith multi-dimensional data.

5.1.2. The estimator robustnessThe purpose of this experiment is to empirically evaluate a

breakdown point which gives an idea about the robustness ofour estimator and how our algorithm gets affected by outliers, asin the previous subsections we used perfect data that was gener-ated completely from inverted Dirichlet distributions. The break-down point is the smallest fraction of data contamination neededto cause an arbitrarily large change in the estimate (He, Simpson,& Portnoy, 1990; Hennig, 2004). We propose to generate two superclasses one with two classes and the other with one single class asshown in Table 6, then we contaminate the data with a certain per-centage of outliers, that do not follow any of the previous used dis-tributions. The outliers can be generate from any other distribution(e.g., normal distribution, uniform distribution, etc.). We model theoutliers by a normal distribution with mean l = 0 and a standarddeviation r = 4, as the histogram is located between 0 and 4. Wesubstitute the perfect generated data by the outliers according to

the contamination level using a uniform distribution. Naturallythe weight can be significantly affected, but we focus more onthe change of the distributions shape. The obtained results are re-ported in Fig. 5 where we plot the different histograms includingthe real one and the estimated histograms of the data contami-nated with different percentage of outliers. As shown in the previ-ous subsections when using perfect data with 0% contaminationthe real and estimated histograms are indistinguishable. As we in-crease the contamination level the shape of the histograms startsto differ from the real one. The shape do not change significantlyuntil a certain level of contamination, and then it starts naturallyto deviate from the original shape. The aspect of the histogramwith a very high contamination achieving 50% remains acceptableas we still can distinguish the three clusters with their respectiveshapes. These results show that our estimator is robust when af-fected by outliers. Nevertheless, the choice and nature of outliersthemselves can affect significantly these results. For instance, ifwe use outliers whose values overpass certain boundaries (ex-treme outliers), these results might be altered. An eventual solu-tion to face the outliers problem is to model the outliersthemselves within a super class grouping eventual outliers as done,for example, in Bouguila, Almakadmeh, and Boutemedjet (2012).

5.1.3. Classic model vs multi-clusters modelWhen we encounter a clustering problem, one important aspect

is to precise the number of classes representing the data. Neverthe-less, in many applications, the number of classes is already known

(a) Perspective 1 (b) Perspective 2

(c) Perspective 3 (d) Perspective 4

(e) Perspective 5

Fig. 4. Different perspectives for the representation of the model in Table 3.

Table 2Real and estimated parameters in the case of a two-dimensional dataset generated from 4-components mixture model.

j i pj wji ~ujið1Þ ~ujið2Þ ~ujið3Þ pj wji ~ujið1Þ ~ujið2Þ ~ujið3Þ

1 1 0.4 0.2 5 53 50 0.4000 0.2003 4.9778 52.6717 49.78772 0.2 10 30 50 0.1997 9.9563 29.8248 49.7256

2 1 0.3 0.3 32 23 40 0.3000 0.3000 32.2221 23.1393 40.27353 1 0.3 0.3 80 10 60 0.3000 0.3000 80.3250 10.0611 60.3193

T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235 1225

Table 3Real and estimated parameters in the case of a two-dimensional dataset generated from 6-components mixture model.

j i pj wji ~ujið1Þ ~ujið2Þ ~ujið3Þ pj wji ~ujið1Þ ~ujið2Þ ~ujið3Þ

1 1 0.3 0.15 18 60 60 0.2997 0.1500 18.1848 60.6012 60.62012 0.15 18 20 34 0.1497 17.8870 19.9435 33.9275

2 1 0.7 0.10 53 9 39 0.7003 0.1001 52.5474 8.9642 38.65972 0.20 50 50 50 0.2007 49.4120 49.4034 49.45373 0.20 5 40 23 0.1997 5.0077 40.1441 23.06894 0.20 20 8 40 0.1998 20.0005 7.9958 39.9639

Table 4Real parameters in the case of a 6-dimensional dataset generated from 3-componentsmixture model.

j i pj wji ~ujið1Þ ~ujið2Þ ~ujið3Þ ~ujið4Þ ~ujið5Þ ~ujið6Þ ~ujið7Þ

1 1 0.8 0.4 50 39 34 22 56 3 412 0.4 18 19 29 39 49 32 95

2 1 0.2 0.2 43 56 90 93 94 95 32

Table 6Real parameters in the case of a one-dimensional dataset generated from3-components mixture model.

j i pj wji ~ujið1Þ ~ujið2Þ

1 1 0.8 0.4 8 102 0.4 90 40

2 1 0.2 0.2 10 60

0 0.5 1 1.5 2 2.5 3 3.5 40

0.5

1

1.5

X

P(X)

Real0%2%5%10%20%30%40%50%

Fig. 5. Real and estimated histograms using different data contamination levels (%).

1226 T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235

(e.g. binary classification: an image contains an object or not). Theclassical approach when we have a fixed number of classes, is tomodel the data by a mixture whose number of components isequal to the number of classes. However, when using the classicalapproach, the modeling process can face two major problems. Thefirst problem is that one single density component does not neces-sarily fit the class data. Even though we use flexible probabilitydensities in terms of shape and symmetry or asymmetry, in manycases the data points of certain classes cannot be modeled by a sin-gle mixture component. The second major problem is the overlap-ping between the classes when using a single distribution to modeleach class. Consider for instance that the class elements of a givenclass X are distributed between the class elements of another classY, an adequate data modeling would be unlikely well done whenusing a single density component to represent each class. In thissubsection, we compare the classic data modeling approach withour proposed one in terms of data fitting and adequate representa-tions. Let us reconsider the model in Table 1 displayed in Fig. 2. Wegenerate a set of points according to these distributions, and wepropose to use them to estimate their related distributions’ param-eters assuming that they are generated from inverted Dirichlet dis-tributions. Knowing that we have two super classes, thus twocategories, the classical approach consists of representing everysuper class, with one single density. The second approach whichis ours, is to represent every super class with a set of classes or clus-ters. We report on the resulted histograms in Fig. 6.

We established several experiments using different numbers ofclusters. The exact number of model clusters which is four, is al-ready experimented is sub Section 5.1.1 and it gives very good re-sults as it generates histograms that are indistinguishable from thereal ones. When using two clusters only, the results do not reflectthe real histograms and the algorithm fails to find the real shape ofthe mixture, which is expected as we have two super classes whoseparts are within each others. So, the classical approach fails to findout the real shape of the histograms. When using three clusters, wehave better results as we approach the real shape of the super class2. When using four clusters and more, the algorithm succeeds to

Table 5Estimated parameters in the case of a 6-dimensional dataset generated from 3-componen

j i pj wji ~ujið1Þ ~ujið2Þ

1 1 0.8000 0.4000 49.9276 38.96252 0.4000 18.0368 18.9945

2 1 0.2000 0.2000 43.1571 56.2434

find the real shape of the super classes and the mixture. It is note-worthy that we assume the presence of knowledge about how togroup the clusters into super classes. When using ten clusters, wedistinguished the classes in Fig. 6(m) (for easiness of presentation,the classes belonging to the super class 2 were plotted in dashlines). Increasing the number of classes within a super class doesnot affect the shape of the super class, as we just distinguish morebetween the different classes at a lower level, but this does not af-fect their membership to their respective super classes. This obser-vation will be discussed in details in sub Section 5.2.2. As for now,It is clear that the classic model gives an inadequate representa-tion, as the two super classes overlap and the classic approach failsto distinguish them. When considering our algorithm, the results

ts mixture model.

~ujið3Þ ~ujið4Þ ~ujið5Þ ~ujið6Þ ~ujið7Þ

33.9762 21.9628 55.9363 2.9990 40.962529.0011 39.0077 49.0462 32.0068 95.048390.3963 93.2768 94.2857 95.2934 32.1198

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

1121

(a) 2 classes labeled in ji.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

Superclass 1Superclass 2

(b) 2 super classes formed by 2 classes.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

(c) Mixture formed by 2 classes.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

111221

(d) 3 classes labeled in ji.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

Superclass 1Superclass 2

(e) 2 super classes formed by 3 classes.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

(f) Mixture formed by 3 classes.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

1112212223

(g) 5 classes labeled in ji.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

Superclass 1Superclass 2

(h) 2 super classes formed by 5 classes.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

(i) Mixture formed by 5 classes.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

111221222324

(j) 6 classes labeled in ji.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

Superclass 1Superclass 2

(k) 2 super classes formed by 6 classes.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

(l) Mixture formed by 6 classes.

0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

X

P(X)

11121314152122232425

(m) 10 classes labeled in ji.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

Superclass 1Superclass 2

(n) 2 super classes formed by 10classes.

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

X

P(X)

(o) Mixture formed by 10 classes.

Fig. 6. Different generated models and their representations as a function of the number of used classes.

T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235 1227

Fig. 7. 10 Objects in 8 classes.

Fig. 8. 41 Different views of the object ‘Dog’.

1228 T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235

are much closer to reality, since we clearly overcome the overlap-ping problem.

3 http://www.d2.mpi-inf.mpg.de/Datasets/ETH80

5.2. Real data

In this section we investigate the performance of our algorithmusing real-life data extracted when tackling the challenging prob-

lems of object modeling and recognition. The first goal of thisapplication is to analyze the benefits of our approach as comparedto the commonly used method where every object class is repre-sented by one single cluster, thus one single probability density.We also discuss how to attribute different semantic meanings tothe clusters according to a chosen hierarchy during the trainingphase. It is noteworthy that the proposed hierarchical grouping ap-proach is simple and do not need extra computational time whenchanging the hierarchical model. We have considered the publiclyavailable ETH-803 dataset which is composed of 3280 imagesgrouped into eight object classes having ten unique objects each asit is shown in Fig. 7. Each object has 41 perspectives. Fig. 8 showsthe different perspectives of the object ‘Dog’ located on the extremeleft of the ‘Dog’ class in Fig. 7. We propose to apply our algorithm onthis database which is interesting in terms of hierarchical classifica-tion as the objects classes can be grouped into different hierarchicallevels. Indeed, different super classes can be considered such as ob-jects (Car, Cup) vs non objects, or animals vs fruits vs objects or fruitsvs non fruits, etc. The features of each image have been extractedusing the local Histogram of Oriented Gradient (HOG) descriptorproposed in Dalal and Triggs (2005). The HOG descriptor is efficientin terms of detecting local characteristics of images and object detec-tion. It generates descriptors having positive values which justifiesthe use of the inverted Dirichlet distribution. In our experiment eachimage was represented by 81-dimensional feature vector. We con-sidered a 10-fold cross validation (Kohavi, 1995) for the evaluationof our algorithm. The training data is used to learn the mixture mod-el at the bottom level, and to build the hierarchy according to themodel that we shall discuss in sub Section 5.2.1. The testing datais used to validate the statistical model and analyze its capabilityto recognize different testing objects and affect the unseen data totheir related clusters properly. The results are illustrated in sub Sec-tion 5.2.2.

Fig. 9. 8 Objects hierarchical model of ETH-80.

Fig. 10. Fruits vs non fruits hierarchical model of ETH-80.

T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235 1229

5.2.1. Clusters assignment and semantic grouping of dataOur used approach may generate many clusters that need to be

assigned to a super class and a label. In order to attribute well acluster to a specific object class, we label each cluster accordingto the images that have been grouped into it. Indeed, we attributeeach cluster k to the super class j whose elements are the mostpresent in cluster k such that:

labelclusterk¼ arg max

j

number of elements of super class j in clusterk

number of elements in clusterkð32Þ

The training data labels serve to specify the membership of eachelement present in a given cluster to the appropriate super class.Hence, the labeled training data can be seen as the user’s semanticmodel of object classes serving to reduce the distance between thesystem representation and the user comprehension e.g. a user canspecify a set of training data labeled as ‘Tomato’ and ‘Apple’ (binarymodel), then perform a clustering of the data into

P2j¼1Kj clusters,

say 10 clusters. The framework would label each cluster model asbeing a representation of ‘Tomato’ or ‘Apple’ according to thetraining labels and using Eq. (32). Indeed, one major aspect of thiswork is the semantic grouping of clusters as the proposed approachcan fit any hierarchical model using a bottom-top methodology. Theproposed method can change the hierarchy model at anytime withnegligible computational costs as it does not need to recompute thedifferent density parameters once we have them (see sub Section4.1). Indeed, the used labeled data represent a key component tosetup the hierarchy, as we can assign a semantic meaning to system

level clusters using Eq. (32). We adopt a bottom-top approach,where we estimate the densities parameters and their relatedweights at the lowest hierarchy level, then we construct the superclasses by grouping these densities into a sort of sub mixtures rep-resenting the object classes or the semantic that we want to con-sider. In the following, we discuss three possible hierarchicalmodels for our application using ETH-80 dataset. It is importantto bear in mind that other models may be considered.

� Class to object: As shown in Fig. 9 the super class is considered tobe an object class e.g. ‘Tomato’, ‘Pear’,‘Cow’, etc. Thus, M = 8 andeach super class is composed of Kj clusters that are assignedaccording to Eq. (32). When we consider the model from thisperspective, we consider that each object is formed by a set ofprobability densities or a probabilistic mixture. The fact thatin ETH-80 every object class is formed by 10 unique objects,each having 41 views, makes this choice practical as the diver-sity of visual features within a given object class is considerable.Semantically we can explain such a choice by the fact that anobject can be formed by different low level features that candescribe that class. e.g. an ‘Apple’ may have different colors, dif-ferent shapes, different types, etc., so the diversity might beimportant in terms of the system features, but expresses thesame meaning to a human being.� Class to fruits vs non fruits: Fig. 10 considers a model of a higher

level in the ontology where the super classes are the fruits andnon fruits. In this case M = 2 and Kj is also concluded from Eq.(32). Indeed, we simply change the different clusters grouping,so we keep the same system clustering at the lowest level butchange the user semantic perspective to the data. We just groupthe clusters according to the given training labeled vector,which in this case composed of binary valued elements tellingif an image is a fruit or not.� Class to object to fruits vs non fruit: When we merge the model of

Fig. 10 with the model of Fig. 9, we can form a model wherethree levels are considered (see Fig. 11). The first level is thesystem level, where the clusters are generated using the systemlowest features that do not necessarily have a comprehensiblehuman semantic. The second level is a human semantic levelwhere we construct the objects, and the third level is wherewe group the object into a final constructed level which is fruits

Fig. 11. 3 Levels hierarchical model of ETH-80.

8 15 20 60 80 900

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Number of clusters

Accu

racy

Fig. 12. Classification accuracy of objects classes in % as a function of the total usednumber of clusters.

8 15 20 60 80 900

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of clusters

Accu

racy

Fig. 13. Classification accuracy of fruits vs non fruits classes in % as a function of thetotal used number of clusters.

1230 T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235

vs non fruits. It is noteworthy that we do not need any extracalculation in order to consider this hierarchy. This is a veryinteresting way to construct the model, as we can constructtwo human comprehensible semantic levels, and we can easilymove from the system representation of the features to thehuman representation.

5.2.2. ResultsWe report on the accuracy of object classes categorization in

Fig. 12, and the classification of the two super classes fruits and

non fruits in Fig. 13. Recall that we used a bottom-top methodol-ogy to establish the classification using the model in Fig. 11 men-tioned previously. Both figures show clearly that the best result isobtained when considering 90 clusters. The objects classificationaccuracy starts with a very poor performance that equals 44.27%when using a number of clusters equal to the number of objectclasses, and ends up to be 71.80% when using 90 clusters. However,the accuracy of the fruits vs non fruits already starts with arespectable value that equals 87.87% when using a number of

T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235 1231

clusters equal to the number of objects classes (8 clusters) and itends up to have a close accuracy value equal to 96.19% when using90 clusters. It is needless to say that using two clusters only tomodel fruits vs non fruits gives very bad results (i.e. the accuracygoes under 40%), so using 8 clusters at the beginning of the exper-iment is already considered as a practice to our proposed ap-proach.To better interpret those results and to explain why thedifference of accuracy between the different models is huge for ob-jects but not as much important when dealing with the fruits vsnon fruits model we report on the detailed classification resultsusing different numbers of total used clusters reported in Tables7–12. For the easiness of lecture, we wiped all the misclassifiedpercentage values that are under 10%, we considered two decimalpoints for the percentages, and colored the good classified cells onthe confusion matrices diagonals with a blue color. When we lookat the object classification accuracy we notice that using a numberof clusters equal to the number of object classes gives poor resultsas we have 44.27% accuracy. The corresponding confusion matrixin Table 7 clearly shows that 8 clusters are unable to adequatelymodel the data as the results are confusing and the model cannotdifferentiate between different objects whose features are cer-tainly overlapping in the features space when considering thatfew number of clusters. A small increase in the number of the usedclusters to 15 gives better results achieving 55.39%. The corre-sponding confusing matrix in Table 8 informs us that some objectsstarts to be distinguished from the other objects and form theirown clusters like the ‘Cow’ class.We notice that the objects thatshare some common visual characteristics are grouped together.For instance, we notice that most of the ‘Cow’ class images goeswhether to the ‘Dog’ class or the ‘Horse’ class. The same behaviorhappens for the ‘Tomato’ and ‘Apple’ classes. This explains whythe accuracy of the fruits vs non fruits model is high at the begin-ning when we use 8 clusters. Indeed, if a ‘Tomato’ object is consid-ered as an ‘Apple’, it remains within the fruits true positives and isnot considered as a wrong classification for the super class fruits. Ahigh increase in the number of clusters at the end does not havemuch effect on the classification results as the used features arenot discriminative anymore. Hence, the difference in accuracywhen using 20 clusters and when considering 60 clusters is almost

Table 78 Clusters confusion matrix.

Apple Tomato Pear Cow

Apple 49.51% 20.24% – –Tomato 42.19% 33.41% – –Pear 16.34% 11.95% 30.97% –Cow – – – 0%Dog – – – –Horse – – – –Car – – – –Cup 10.97% – – –

Table 815 Clusters confusion matrix.

Apple Tomato Pear Cow

Apple 57.07% 35.36% – –Tomato 10.73% 81.95% – –Pear – 10% 64.87% –Cow – – – 8.29%Dog – – – –Horse – – – 10.24Car – – – –Cup 13.17% – – –

8% and the difference between using 80 clusters and 90 clusters isalmost 1.5% only. This is expected as the algorithm starts to dealwith hard cases where the dissimilarities are not detected anymorebetween objects and it fails to construct more distinguishable clus-ters. Nevertheless, looking at the confusion matrix in Table 12where we considered 90 clusters, shows clearly that the algorithmdistinguishes well between animals, objects and fruits, and consid-erably outperforms the classical model.

5.3. Discussion

It is clear that the obtained results when using multi clusters torepresent an object class outperforms the results when modelingeach object class by one single cluster. The more we have increasedthe number of clusters the more we have obtained better results asthe overlapping features vectors start to be assigned to differentclusters and the visually similar objects are more distinguished.We notice that at the beginning, when increasing the number ofmodeled clusters, the classification accuracy is improved tremen-dously and then it starts to stabilize when the maximum possiblenumber of modeled clusters is achieved as the algorithm findsmore difficulties to distinguish the features differentiating certainimages, and it ends up by achieving a maximum possible numberof clusters that can represent the data. One possible way to im-prove these results is to perform a feature selection algorithm soit can discard the visual similarities between different objectsand focus on the discriminating features at the lowest level ofthe hierarchy. Still, our proposed approach outperforms the classicone. One interesting point as well is that the hierarchical modelcan be changed easily once we have the clusters representing thelowest level of the hierarchy. This is useful when we have a dy-namic hierarchy based on the user desire. Consider for examplesearch engines, or recommendation systems that have to representa given hierarchy according to the user’s needs who might changehis/her mind during time. Also one single computation of the bot-tom representation of the data can serve many users according totheir request in a hierarchical fashion. The user can build up on thefly any model that is computed in a costless effort in terms of re-sources and time. Note that for any hierarchy we just estimate

Dog Horse Car Cup

– – – 13.90%11.21% – – –12.68% 27.80% – –50% 37.56% 11.46% –61.70% 31.46% – –40.48% 49.75% – –10.73% – 78.04% –– – 30.48% 50.73%

Dog Horse Car Cup

– – – –– – – –11.46% – – –36.58% 30.48% 19.02% –57.07% 21.70% – –

% 37.07% 36.09% – –– – 85.85% –– – 26.09% 51.95%

Table 1180 Clusters confusion matrix.

Apple Tomato Pear Cow Dog Horse Car Cup

Apple 74.63% 20.73% – – – – – –Tomato – 87.31% – – – – – –Pear – – 84.87% – – – – –Cow – – – 41.95% 25.85% 24.87% – –Dog – – – 22.68% 49.51% 21.95% – –Horse – – – 27.31% 25.60% 41.46% – –Car – – – – – – 92.68% –Cup – – – – – – – 91.46%

Table 920 Clusters confusion matrix.

Apple Tomato Pear Cow Dog Horse Car Cup

Apple 67.80% 25.36% – – – – – –Tomato 10.73% 86.34% – – – – – –Pear – – 70% – 10.97% – – –Cow – – – 23.65% 34.14% 31.21% – –Dog - - - 16.34% 51.70% 25.60% - -Horse – – – 13.90% 35.12% 38.29% – –Car – – – – – – 78.78% –Cup – – – – – – 16.82% 66.82%

Table 1060 Clusters confusion matrix.

Apple Tomato Pear Cow Dog Horse Car Cup

Apple 77.56% 19.02% – – – – – –Tomato 11.70% 83.17% – – – – – –Pear – – 86.34% – – – – –Cow – – – 44.39% 26.09% 19.02% – –Dog – – – 24.63% 47.80% 18.04% – –Horse – – – 27.31% 28.29% 35.60% – –Car – – – – – – 87.56% –Cup – – – – – – – 85.36%

Table 1290 Clusters confusion matrix.

Apple Tomato Pear Cow Dog Horse Car Cup

Apple 78.53% 18.53% – – – – – –Tomato – 85.36% – – – – – –Pear – – 84.87% – – – – –Cow – – – 43.90% 24.39% 27.56% – –Dog – – – 24.63% 48.04% 22.92% – –Horse – – – 25.60% 24.14% 46.58% – –Car – – – – – – 93.41% –Cup – – – – – – – 93.65%

1232 T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235

the parameters ofPM

j¼1Kj densities and it is only the weights of thehigher levels that change according to the ontology or hierarchicalmodel. This is very practical as this approach helps to reduce thedistance between the systems level features and the humansemantic and understanding. The system level features similaritiesare not necessarily the same of what a user might have in mind orcomprehend. Thus, the first generation of the model at its bottomlevel can be made off line, and then the representation of the datais done according the users needs with costless computationaltime. Another interesting point to discuss, is the study of the fea-tures role in the hierarchy. Indeed, the adoption of a bottom-topstrategy to form the hierarchy enables the system to have very dis-criminative features on the lowest level of the hierarchy that be-come less and less selective when we move to the upper levels.For instance, consider a feature which is discriminative for ‘Toma-to’ and ‘Pear’. When we move to the super class fruits, that feature

is not discriminative anymore as the super class fruits contains both‘Tomato’ and ‘Pear’. On the other hand when we move from anupper level to the bottom in the hierarchy, one feature might bediscriminative between fruits and non fruits, but loses its impor-tance, when we want to distinguish between ‘Tomato’, and ‘Apple’once we zoom on the fruits. Thus, we can build a hierarchical fea-tures importance/contribution model for the classification, whichdefines for each level what are the features that are the most dis-criminative. This can improve the results of classification when weare interested in one specific level or a specific super class.

6. Conclusion

The aim of this paper was to provide a methodology to modeldata in a flexible hierarchical way that can be altered on the fly

T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235 1233

according to a given ontology. The proposed model reduces the gapbetween the system representation of data in terms of clustering,and the human comprehension level that groups these clustersto form semantically understandable object representations. Thework addresses the parameter estimation of different invertedDirichlet distributions within a hierarchical model using a bot-tom-top strategy. The proposed methodology has been shown tooutperform the classical approach where each class is representedby a single mixture component in the training phase. We also pro-posed a voting strategy to assign semantic meanings to the systemlevel clusters in order to represent a given object. The model-basedclustering is an active area of research, and an eventual futurework to extend this framework is to include a feature selectionstrategy in order to distinguish the objects that the algorithm failsto recognize. Another eventual future work can also be the intro-duction of infinite mixture models in order to make the hierarchi-cal model more expandable when new data is introduced.

Acknowledgment

The completion of this research was made possible thanks tothe Natural Sciences and Engineering Research Council of Canada(NSERC).

Appendix A. The marginalization

Let the notation ~Zn ¼ znj denotes that ~ZnðjÞ ¼ znj ¼ 1, and all theother elements of ~Zn are equal to zero. Let also the notationf~Ynjg ¼ ynji denotes that all the elements of f~Ynjg are equal to zero,except the element ynji = 1.

A.1. First level

The first level of the mixture can be deducted by the marginal-ization of Eq. (3) over the hidden variables:

pð~XnjH; ~pÞ ¼XM

j¼1

pð~Xn;~Zn ¼ znjjH; ~pÞ

¼XM

j¼1

pð~Xnj~Zn ¼ znj;H; ~pÞpð~Zn ¼ znjj~pÞ ð33Þ

from Eq. (2) we have:

pð~Zn ¼ znjj~pÞ ¼ pðznj ¼ 1j~pÞ ¼ p01 . . .p0

j�1p1j p

0jþ1 . . .p0

M ¼ pj ð34Þ

and from Eq. (1), we have:

pð~Xnj~Zn ¼ znj;H; ~pÞ ¼ pð~Xnj~Zn ¼ znj;HÞ ¼ pð~Xnjznj ¼ 1;HÞ

¼ pð~Xnjh1Þ0

. . . pð~Xnjhj�1Þ0pð~XnjhjÞ

1pð~Xnjhjþ1Þ

0. . . pð~XnjhMÞ

0

¼ pð~XnjhjÞð35Þ

By substituting Eqs. (34) and (35) into Eq. (33), the first level of themixture for a given vector ~Xn can be obtained:

pð~XnjH; ~pÞ ¼XM

j¼1

pð~XnjhjÞpj ð36Þ

A.2. Second level

We proceed by the marginalization over the hidden variables,so the first level of our mixture can be deducted by the marginal-ization of Eq. (8):

pð~Xnju; fwjig; ~pÞ ¼XM

j¼1

XKj

i¼1

Pð~Xn;~Zn ¼ znj; f~Ynjg

¼ ynjiju; fwjig; ~pÞ ¼XM

j¼1

XKj

i¼1

pð~Xnj~Zn

¼ znj; f~Ynjg ¼ ynji;uÞpðf~Ynjg ¼ ynjij~Zn

¼ znj; fwjig; ~pÞpð~Zn ¼ znjj~pÞ ð37Þ

Eq. (7) gives us

Pðf~Ynjg ¼ ynjij~Zn ¼ znj; fwjig; ~pÞ ¼ Pðynji ¼ 1jznj ¼ 1; fwjigÞ¼ w0

j1 . . . w0jði�1Þw

1jiw

0jðiþ1Þ . . . w0

jKj¼ wji ð38Þ

and we can obtain the following from Eq. (5)

pð~Xnj~Zn ¼ znj; f~Ynjg ¼ ynji;uÞ ¼ pð~Xnjznj ¼ 1; ynji ¼ 1;uÞ

¼ ðpð~Xnj~u11Þ0

. . . ðpð~Xnj~u1K1 Þ0Þ . . . ððpð~Xnj~uðj�1Þ1Þ

0. . .

� ðpð~Xnj~uðj�1ÞKj�1Þ

0ÞÞ . . . ððpð~Xnj~uðjÞ1Þ

0. . . ðpð~Xnj~uðjÞði�1ÞÞ

� ðpð~Xnj~ujiÞÞ1Þðpð~Xnj~ujðiþ1ÞÞ

0Þðpð~Xnj~uðjÞKj

Þ0ÞÞ . . .

� ððpð~Xnj~uðjþ1Þ1Þ0

. . . ðpð~Xnj~uðjþ1ÞKjþ1Þ

0ÞÞ . . .

� ððpð~Xnj~uM1Þ0

. . . ðpð~Xnj~uMKM Þ0ÞÞ ¼ pð~Xnj~ujiÞ ð39Þ

By substituting Eqs. 34, 38 and 39 into Eq. (37), the second levelmixture can be written as

pð~Xnju; fwjig; ~pÞ ¼XM

j¼1

XKj

i¼1

pð~Xnj~ujiÞwjipj

¼XM

j¼1

pj

XKj

i¼1

pð~Xnj~ujiÞwji

!ð40Þ

Appendix B. wij and pj estimation

B.1. Estimation of wji

we have:

@QðX ;u; fwjig; ~p;KÞ@wji

¼ @

@wjilog pðXju; fwjig; ~pÞ þ k1 1�

XM

j¼1

pj

!

þk2 1�XM

j¼1

XKj

i¼1

wij

!!

¼ @ log pðXju; fwjig; ~pÞ@wji

� k2 ð41Þ

using Eq. (9):

@ log pðXju;fwjig;~pÞ@wji

¼XN

i¼1

@

@wjilog

XM

j¼1

pj

XKj

i¼1

pð~Xnj~ujiÞwji

! !

¼XN

n¼1

pjpð~Xnj~ujiÞPMj¼1pj

PKj

i¼1pð~Xnj~ujiÞwji

� � ==we multiply bywji

wji

¼XN

n¼1

wjipjpð~Xnj~ujiÞwjiPM

j¼1

PKj

i¼1wjipjpð~Xnj~ujiÞð42Þ

The posterior probability pðj; ij~Xn;u; fwjig; ~pÞ is given by:

pðj; ij~Xn;u; fwjig; ~pÞ ¼wjipjpð~Xnj~ujiÞPM

j¼1

PKj

i¼1wjipjpð~Xnj~ujiÞð43Þ

1234 T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235

Then, Eq. (42) becomes:

@ log pðXju; fwjig; ~pÞ@wji

¼PN

n¼1pðj; ij~Xn;u; fwjig; ~pÞwji

ð44Þ

So we have,

@QðX ;u; fwjig; ~p;KÞ@wji

¼PN

n¼1pðj; ij~Xn;u; fwjig; ~pÞwji

� k2 ¼ 0 ð45Þ

now we derive with respect to k2. We find:

@QðX ;u; fwjig; ~p;KÞ@k2

¼ 1�XM

j¼1

XKj

i¼1

wij ¼ 0 ð46Þ

we have:

wji ¼1k2

XN

n¼1

pðj; ij~Xn;u; fwjig; ~pÞ ð47Þ

XM

j¼1

XKj

i¼1

wji ¼1k2

XM

j¼1

XKj

i¼1

XN

n¼1

pðj; ij~Xn;u; fwjig; ~pÞ ¼ 1 ð48Þ

since,

XM

j¼1

XKj

i¼1

pðj; ij~Xn;u; fwjig; ~pÞ ¼ 1 ð49Þ

Thus,

XM

j¼1

XKj

i¼1

XN

n¼1

pðj; ij~Xn;u; fwjig; ~pÞ ¼ N ð50Þ

So,

k2 ¼ N ð51Þ

wji ¼PN

n¼1pðj; ij~Xn;u; fwjig; ~pÞN

ð52Þ

B.2. Estimation of pj

we have:

@QðX ;u; fwjig; ~p;KÞ@pj

¼ @

@pjlog pðXju; fwjig; ~pÞ þ k1 1�

XM

j¼1

pj

!

þk2 1�XM

j¼1

XKj

i¼1

wij

!!

¼ @ log pðXju; fwjig; ~pÞ@pj

� k1

using Eq. (9):

@ log pðXju;fwjig;~pÞ@pj

¼XN

i¼1

@

@pjlog

XM

j¼1

pj

XKj

i¼1

pð~Xnj~ujiÞwji

! !

¼XN

n¼1

PKj

i¼1pð~Xnj~ujiÞwjiPMj¼1pj

PKj

i¼1pð~Xnj~ujiÞwji

� � ==we multiply bypj

pj

¼XN

n¼1

pjPKj

i¼1pð~Xnj~ujiÞwji

pjPM

j¼1pjPKj

i¼1pð~Xnj~ujiÞwji

� � ð53Þ

The posterior probability pðjj~Xn;u; fwjig; ~pÞ is given by:

pðjj~Xn;u; fwjig; ~pÞ ¼pjPKj

i¼1pð~Xnj~ujiÞwjiPMj¼1pj

PKj

i¼1pð~Xnj~ujiÞwji

� � ð54Þ

Then Eq. (53) becomes:

@ log pðXju; fwjig; ~pÞ@pj

¼PN

n¼1pðjj~Xn;u; fwjig; ~pÞpj

ð55Þ

So we have

@QðX ;u; fwjig; ~p;KÞ@pj

¼PN

n¼1pðjj~Xn;u; fwjig; ~pÞpj

� k1 ¼ 0 ð56Þ

Thus,

pj ¼1k1

XN

n¼1pðjj~Xn;u; fwjig; ~pÞ ð57Þ

We have

XM

j¼1

pj ¼1k1

XM

j¼1

XN

n¼1

pðjj~Xn;u; fwjig; ~pÞ ¼ 1 ð58Þ

since,

XM

j¼1

pðjj~Xn;u; fwjig; ~pÞ ¼ 1 ð59Þ

then,

XM

j¼1

XN

n¼1

pðjj~Xn;u; fwjig; ~pÞ ¼ N ð60Þ

so we have:

k1 ¼ N ð61Þ

so we have,

pj ¼PN

n¼1pðjj~Xn;u; fwjig; ~pÞN

ð62Þ

References

Allili, M., Ziou, D., Bouguila, N., & Boutemedjet, S. (2010). Image and videosegmentation by combining unsupervised generalized Gaussian mixturemodeling and feature selection. IEEE Transactions on Circuits and Systems forVideo Technology, 20, 1373–1377.

Bakhtiari, A. S., Bouguila, N. (2010). A hierarchical statistical model for objectclassification. In Proceedings of the IEEE international workshop on multimediasignal processing (MMSP) (pp. 493–498).

Bakhtiari, A. S., Bouguila, N. (2011). An expandable hierarchical statisticalframework for count data modeling and its application to objectclassification. In Proceedings of the 23rd IEEE international conference on toolswith artificial intelligence (ICTAI) (pp. 817–824).

Bakhtiari, A. S., Bouguila, N. (2012). A novel hierarchical statistical model for countdata modeling and its application in image classification. In Proceedings of the19th international conference on neural information processing (ICONIP) (pp. 332–340).

Bdiri, T., Bouguila, N. (2011). Learning inverted dirichlet mixtures for positive dataclustering. In Proceedings of the 13th international conference on rough sets, fuzzysets, data mining and granular computing (RSFDGrC) (pp. 265–272).

Bdiri, T., & Bouguila, N. (2012). Positive vectors clustering using inverted dirichletfinite mixture models. Expert Systems With Applications, 39, 1869–1882.

Binder, D. A. (1978). Bayesian cluster analysis. Biometrika, 65, 31–38.Bishop, C. M., & Tipping, M. E. (1998). A hierarchical latent variable model for data

visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20,281–293.

Bouguila, N., Almakadmeh, K., & Boutemedjet, S. (2012). A finite mixture model forsimultaneous high-dimensional clustering, localized feature selection andoutlier rejection. Expert Systems with Applications, 39, 6641–6656.

Bouguila, N., & Ziou, D. (2006). A hybrid sem algorithm for high-dimensionalunsupervised learning using a finite generalized dirichlet mixture. IEEETransactions on Image Processing, 15, 2657–2668.

Bouguila, N., Ziou, D., & Hammoud, R. I. (2009). On bayesian analysis of a finitegeneralized dirichlet mixture via a metropolis-within-gibbs sampling. PatternAnalysis and Applications, 12, 151–166.

Bouguila, N., Ziou, D., & Vaillancourt, J. (2004). Unsupervised learning of a finitemixture model based on the dirichlet distribution and its application. IEEETransactions on Image Processing, 13, 1533–1543.

Boutemedjet, S., Bouguila, N., & Ziou, D. (2009). A hybrid feature extraction selectionapproach for high-dimensional non-Gaussian data clustering. IEEE Transactionson Pattern Analysis and Machine Intelligence, 31, 1429–1443.

T. Bdiri et al. / Expert Systems with Applications 41 (2014) 1218–1235 1235

Couronne, T., Stoica, A., Beuscart, J. S. (2010). Online social network popularityevolution: An additive mixture model. In Proceedings of the internationalconference on advances in social networks analysis and mining (ASONAM) (pp.346–350).

Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection.In Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR) (pp. 886–893). IEEE Computer Society.

Day, N. E. (1969). Estimating the components of a mixture of normal distributions.Biometrika, 56, 463–474.

Edwards, A., & Cavalli-Sforza, L. (1965). A method of cluster analysis. Biometrics, 21,362–375.

Fisher, D. H. (1996). Iterative optimization and simplification of hierarchicalclusterings. Journal of Artificial Intelligence Research, 4, 147–178.

Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, anddensity estimation. Journal of the American Statistical Association, 97, 611–631.

Graybill, F. A. (1983). Matrices with applications in statistics, Wadsworth statistics/probability series. Wadsworth International Group.

Handcock, M. S., Raftery, A. E., & Tantrum, J. M. (2007). Model-based clustering forsocial networks. Journal of the Royal Statistical Society: Series A (Statistics inSociety), 170, 301–354.

Hennig, C. (2004). Breakdown points for maximum likelihood estimators oflocation-scale mixtures.. Annals of Statistics, 32, 1313–1340.

He, X., Simpson, D., & Portnoy, S. (1990). Breakdown robustness of tests. Journal ofthe American Statistical Association, 85, 446–452.

Jain, A. K., Murty, M., & Flynn, P. (1999). Data clustering: A review. ACM ComputingSurveys, 31, 264–323.

Ji, Y., Wu, C., Liu, P., Wang, J., & Coombes, K. R. (2005). Applications of beta-mixturemodels in bioinformatics. Bioinformatics, 21, 2118–2122.

Kim, M., Cho, S. B., & Kim, J. H. (2010). Mixture-model based estimation of geneexpression variance from public database improves identification ofdifferentially expressed genes in small sized microarray data. Bioinformatics,26, 486–492.

Koestler, D. C., Marsit, C. J., Christensen, B. C., Karagas, M. R., Bueno, R., Sugarbaker,D. J., et al. (2010). Semi-supervised recursively partitioned mixture models foridentifying cancer subtypes. Bioinformatics, 26, 2578–2585.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimationand model selection. Proceedings of the 14th international joint conference onArtificial Intelligence (Vol. 2, pp. 1137–1143). Morgan Kaufmann Publishers Inc..

Meinicke, P., Ahauer, K. P., & Lingner, T. (2011). Mixture models for analysis of thetaxonomic composition of metagenomes. Bioinformatics, 27, 1618–1624.

Morinaga, S., Yamanishi, K. (2004). Tracking dynamics of topic trends using a finitemixture model. In Proceedings of the 10th ACM SIGKDD international conferenceon knowledge discovery and data mining (pp. 811–816).

Neelon, B., Swamy, G. K., Burgette, L. F., & Miranda, M. L. (2011). A bayesian growthmixture model to examine maternal hypertension and birth outcomes. Statisticsin Medicine, 30, 2721–2735.

Permuter, H., Francos, J., Jermyn, H. (2003). Gaussian mixture models of texture andcolour for image database retrieval. In Proceedings of the IEEE internationalconference on acoustics, speech, and signal processing (ICASSP) (Vol. 3, pp. 569–572).

Rattanasiri, S., Böhning, D., Rojanavipart, P., & Athipanyakom, S. (2004). A mixturemodel application in disease mapping of malaria. The Southeast Asian Journal ofTropical Medicine and Public Health, 35, 38–47.

Rissanen, J. (1999). Hypothesis selection and testing by the MDL principle. TheComputer Journal, 42, 260–269.

Schlattmann, P. (2009). Medical applications of finite mixture models, statistics forbiology and health. Berlin Heidelberg: Springer-Verlag.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6,461–464.

Sefidpour, A., & Bouguila, N. (2012). Spatial color image segmentation based onfinite non-Gaussian mixture models. Expert Systems With Applications, 39,8993–9001.

Stauffer, C., & Grimson, W. E. L. (2000). Learning patterns of activity using real-timetracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22,747–757.

Tao, W., Cheng, I., Basu, A. (2010). Fully automatic brain tumor segmentation usinga normalized gaussian bayesian classifier and 3d fluid vector flow. InProceedings of the 17th IEEE international conference on image processing (ICIP)(pp. 2553–2556).

Tiao, G. G., & Cuttman, I. (1965). The inverted dirichlet distribution withapplications. Journal of the American Statistical Association, 60, 793–805.

Wallace, C. S. (2005). Statistical and inductive inference by minimum message length.Springer-Verlag.

Weiling, C., Lei, L., Yang, M. (2010). A gaussian mixture model-based clusteringalgorithm for image segmentation using dependable spatial constraints. InProceedings of the 3rd international congress on image and signal processing (CISP)(Vol. 3, pp. 1268–1272).

Yang, M., Ahuja, N. (1999). Gaussian mixture model for human skin color and itsapplications in image and video databases. In Proceedings of the internationalsociety for optics and photonics (SPIE) (pp. 458–466).

Ziou, D., Hamri, T., & Boutemedjet, S. (2009). A hybrid probabilistic framework forcontent-based image retrieval with feature weighting. Pattern Recognition, 42,1511–1519.

Zivkovic, Z. 2004. Improved adaptive gaussian mixture model for backgroundsubtraction. In Proceedings of the 17th international conference on patternrecognition (ICPR) (Vol. 2, pp. 28–31).