Object Categorization Using Hierarchical Wavelet Packet Texture Descriptors

8
Object Categorization Using Hierarchical Wavelet Packet Texture Descriptors Xueming Qian, Guizhong Liu, Danping Guo, Zhi Li, Zhe Wang, Huan Wang Suchool of Electronic and Information Engineering Xian Jiaotong University {qianxm,liugz}@mail.xjtu.edu.cn Abstract Object categorization plays an important role in computer vision, semantic based image content understanding, and image retrieval. Wavelet packet transform provides a very good observation for the images by sub-band filtering. Different objects have distinctive characteristics in the sub-bands of wavelet packets, which should be discriminative for objects classification. In this paper, an object categorization method using hierarchical wavelet packet texture descriptors is proposed. Comparisons between Gabor texture descriptor, pyramid of histograms of orientation gradients (PHOG) and the proposed hierarchical wavelet packet texture descriptors on the widely used OT, Scene-13 and Sport event datasets are also given. Experimental results show that object categorization performances of the proposed texture descriptors are better than that of Gabor texture descriptor and as good as that of PHOG shape descriptor. Object categorization performances of the texture descriptors under various decomposition levels and wavelet bases are discussed. Performances of texture descriptors of global and local images with different partition patterns are also analyzed. 1. Introduction Nowadays we face an information explosion age. There are millions of images on the Internet. Image search engines can easily gather huge amount of images by worldwide users. However, finding the appropriate image resource from Internet is a challenging problem. Usually content-based and text- based approaches are utilized in image retrieval [1]. The text based image retrieval (TBIR) is performed by matching the query texts and the auxiliary descriptive texts of the images. The image searching performances are inevitably influenced by the subjectivity, in-completeness and ambiguity of the annotation texts. In the content-based image retrieval (CBIR), images are returned for users according to the similarities of low level visual features. Due to the semantic gaps, the performances of TBIR and CBIR are not satisfactory. To bridge the semantic gaps, multimodal source information are often fused. Object categorization is one of the promising ways to bridge the semantic gaps in image retrieval [1]-[7]. Clustering images into semantic categories can facilitate image retrieval and personalized image browsing. Object categorization can be performed by global features and local features. Although the methods based on the global features are successful in some coarse classifications, they are not robust to luminance, scale and orientation variations. Local features represented by scale invariant feature transform, salient region and local patches are far more robust in object categorization. Bag-of-words (BOW) models such as the probabilistic latent semantic analysis (pLSA) [8] and latent Dirichlet allocation [9] have been widely adopted in object categorization and impressive results are achieved [10]-[18]. The BOW models objects as geometric-free structures, which are represented by the spatial constraints of local patches. Usually, local patches are often described by scale invariant regions and scale invariant feature transform [19]. Thus the BOW models are robust to the illumination, occlusion, and scale variations. Statistical learning based methods are often utilized to improve object categorization performance by discovering the salient structures of objects [20]. Hence, the local appearance, shape and texture information are usually fused by generative and discriminative models [17], [18], [20], [21]. Objects usually have salient local structure, shape and texture distributions which are helpful for their discrimination. The BOW and shape based methods show their effectiveness in object classification [18]. However, texture information usually plays assistant roles [18], [22]. Images with different categories have distinctive texture information. For example, the texture of tree looks so different from that of face. Objects in the same class may have different visual appearances and arbitrary shapes, and objects in

Transcript of Object Categorization Using Hierarchical Wavelet Packet Texture Descriptors

Object Categorization Using Hierarchical Wavelet Packet Texture Descriptors

Xueming Qian, Guizhong Liu, Danping Guo, Zhi Li, Zhe Wang, Huan Wang Suchool of Electronic and Information Engineering Xi’an Jiaotong University

{qianxm,liugz}@mail.xjtu.edu.cn

Abstract

Object categorization plays an important role in

computer vision, semantic based image content understanding, and image retrieval. Wavelet packet transform provides a very good observation for the images by sub-band filtering. Different objects have distinctive characteristics in the sub-bands of wavelet packets, which should be discriminative for objects classification. In this paper, an object categorization method using hierarchical wavelet packet texture descriptors is proposed. Comparisons between Gabor texture descriptor, pyramid of histograms of orientation gradients (PHOG) and the proposed hierarchical wavelet packet texture descriptors on the widely used OT, Scene-13 and Sport event datasets are also given. Experimental results show that object categorization performances of the proposed texture descriptors are better than that of Gabor texture descriptor and as good as that of PHOG shape descriptor. Object categorization performances of the texture descriptors under various decomposition levels and wavelet bases are discussed. Performances of texture descriptors of global and local images with different partition patterns are also analyzed. 1. Introduction

Nowadays we face an information explosion age. There are millions of images on the Internet. Image search engines can easily gather huge amount of images by worldwide users. However, finding the appropriate image resource from Internet is a challenging problem. Usually content-based and text-based approaches are utilized in image retrieval [1]. The text based image retrieval (TBIR) is performed by matching the query texts and the auxiliary descriptive texts of the images. The image searching performances are inevitably influenced by the subjectivity, in-completeness and ambiguity of the annotation texts. In the content-based image retrieval (CBIR), images are returned for users according to the

similarities of low level visual features. Due to the semantic gaps, the performances of TBIR and CBIR are not satisfactory.

To bridge the semantic gaps, multimodal source information are often fused. Object categorization is one of the promising ways to bridge the semantic gaps in image retrieval [1]-[7]. Clustering images into semantic categories can facilitate image retrieval and personalized image browsing. Object categorization can be performed by global features and local features. Although the methods based on the global features are successful in some coarse classifications, they are not robust to luminance, scale and orientation variations. Local features represented by scale invariant feature transform, salient region and local patches are far more robust in object categorization. Bag-of-words (BOW) models such as the probabilistic latent semantic analysis (pLSA) [8] and latent Dirichlet allocation [9] have been widely adopted in object categorization and impressive results are achieved [10]-[18]. The BOW models objects as geometric-free structures, which are represented by the spatial constraints of local patches. Usually, local patches are often described by scale invariant regions and scale invariant feature transform [19]. Thus the BOW models are robust to the illumination, occlusion, and scale variations.

Statistical learning based methods are often utilized to improve object categorization performance by discovering the salient structures of objects [20]. Hence, the local appearance, shape and texture information are usually fused by generative and discriminative models [17], [18], [20], [21].

Objects usually have salient local structure, shape and texture distributions which are helpful for their discrimination. The BOW and shape based methods show their effectiveness in object classification [18]. However, texture information usually plays assistant roles [18], [22]. Images with different categories have distinctive texture information. For example, the texture of tree looks so different from that of face. Objects in the same class may have different visual appearances and arbitrary shapes, and objects in

different classes may share similar appearances and shapes. Sometimes the texture information is most discriminative. In our opinion, texture information should be as effective as the local appearance and shape features.

The main goal of this paper is to show how object classification can benefit from the texture descriptors of hierarchical wavelet packet transform. As we known that Gabor texture descriptor is nominated in MPEG-7 standard, due to its good performance in extracting salient texture information by scale and oriented filtering. The proposed hierarchical wavelet packet texture descriptors based object categorization method is mainly inspired by the following three aspects: 1) shape descriptors based on the pyramid of histograms of orientation gradients (PHOG) [23],[24], 2) excellent characteristics of wavelet packet transform for texture classification [25], 3) good performance of Gabor texture descriptors in texture image classification and retrieval [27], [28].

The rest of this paper is organized as follows. In Section 2 related works on object categorization are briefly overviewed. In Section 3 object categorization method based on global and local texture descriptors is described in detail. The test datasets are given in Section 4. Experimental results and discussions are given in Section 5. And finally conclusions are drawn in Section 6.

2. Related works on object categorization

Bag-of-Words models have been paid much attention by many researches in the newly studies of object categorization due to their simplicities and good performances. There are also some discriminative part-based models [29], [30], which are effective in representing objects with rigorous geometric structures by modeling the relationships between different parts. Usually each image is represented by a set of local patches. And each patch is illustrated by local descriptors which are robust to illumination, scale, orientation and transform variations. In some of the related works, the local patches of an image are assumed to be independent from each other [11], [12], [31]. This assumption simplifies the computations for the ignorance of the spatial co-occurrences and dependences of local patches. However, one of the shortcomings of BOW models is that objects with different appearances may have similar statistics of visual words which tend to be confused during classification.

To tackle this problem, some approaches are carried out by modeling the co-occurrences, dependences or linkages of the salient parts of images [10], [32], [33].

For example, Berg et al. utilize the relationship of local shape correspondences for object categorization [33], while others make full use of the co-occurrence and dependence of the BOW to improve the performance of image classification [34]. Probabilities of co-occurrence of visual words are also taken into consideration in the training of BOW models [12], [35]-[38]. Despite of using the co-occurrences, some of models employ the spatial relationships of the local patches for object categorization [16], [18], [34], [36]-[41]. Hierarchical Dirichlet process (HDP) is a nonparametric Bayesian model that infers latent themes from the training samples under the assumption that a hierarchical structure in different groups should share the same themes [39]. Extensions of the HDP have been proposed by modeling the relative spatial locations of local patches [40] and using dependent hierarchical Dirichlet process (DHDP) [34]. The DHDP is performed by introducing a linkage structure over the latent themes to encode the dependencies of the patches. The linkage structure enforces the semantic connections among the patches by facilitating better clustering of the themes. A visual language modeling method is utilized to incorporate the spatial context of the local appearance features into statistical language model [42]. The visual language models capture both co-occurrence and spatial proximity of local image features [42]-[44].

The spatial dependency between neighboring patches is modeled by a two-dimensional multi-resolution hidden Markov model during image classification [20]. Markov random fields [48] and conditional random fields (CRFs) are adopted to model the dependencies of local patches. Statistical learning models maximize contextual constraints over the object labels and reduce the ambiguities during object categorization [44]-[46]. A generative model is utilized to determine object categories and carry out object segmentation in a unified framework [47]. Zhang et al. utilize SVM classifiers to integrate BOW features for image classification [17]. Pyramid histogram of oriented gradients (PHOG) is good at representing the shapes and spatial layouts of objects [23]. SVM classifiers with spatial pyramid kernels are utilized to improve the object classification performance [23].

3. Object categorization using hierarchical wavelet packet texture descriptors

Wavelet packet analysis has been successfully used for data compression, texture classification [4] and face recognition [25], [26]. Wavelet packet decomposition is performed by filtering the original signal with a set

of sub-band filters. Images of different categories have different properties to the wavelet packet filters. Texture information of the sub-bands should be effective for object categorization. The aim of this paper is to show the effectiveness of the proposed hierarchical wavelet packet texture descriptors for object categorization.

3.1. Brief overview of wavelet packet

Wavelet packets consist of orthonormal and compactly supported wavelets. Wavelet packets represent a generalization of multi-resolution decomposition, and comprise the entire family of sub-band tree decompositions which permit the choice of any decomposition topological structures [4]. In the standard wavelet decomposition, only the leftmost node has two children, as shown in Fig.1 (d). The difference between wavelet packet decomposition and the standard wavelet transform is that the former also recursively decomposes the high-frequency components. Wavelet packet transform constructs a tree-structured and multi-band extension of the wavelet transform. Wavelet packets are described by the collection of functions as follows [4]:

(1)

(2)

where s and t denote the scale and translation index, , . is a scaling function

and is a basic wavelet. hk and gk are the low-pass and high-pass filters. The basis functions are obtained by scale and translation changes. They remain well localized in both spatial and frequency domains. The inverse relationship between wavelet packets of different scales are as follows:

(3)

3.2. 2-D wavelet packet transform

The extension of wavelet packet transform from 1-D signal to 2-D image is straight forward by using separable 2-D wavelet packets. Generally, a low-pass filter (denoted H) and a high-pass filter (denoted G) are used. The convolutions of original image with the low pass filter results in an approximation and the convolutions with the high-pass filter in specific directions result in details [25]. In wavelet decomposition, an image is firstly split into an approximation and three details. And then only the approximation is continuously carried out

decomposition in the next level. In wavelet packet transform, the approximation and the details are further split into a second level of approximation and details respectively. For an L-level decomposition, the 2-D wavelet transform is carried out as follows:

(4)

(5)

(6)

(7) Where * is the convolution operator, denotes sub-sampling along the rows (or columns) and FL-1 represents one of the wavelet packet images at level (L-1) , Hx and Hy (or Gx and Gy) denote the separated low (or high) pass filters of H (or G) in x- and y- directions respectively. F0 is the original image. AA

LF is obtained by low pass filtering with sub-band image FL-1. The details AD

LF , DALF and DD

LF are obtained by band-pass filtering in vertical, horizontal and diagonal sub-bands respectively.

By hierarchical wavelet packet decomposition, the original image is thus represented by a complete set of sub-band images. There are 4L sub-band images at level L of the regular wavelet decomposition tree topology. Fig.1 shows the sub-band images of the regular wavelet packet transform under decomposition level L = 0 (i.e. original image), 1 and 2 respectively. The texture features of the sub-bands of wavelet packet transform under several decomposition levels are all collected and we call them hierarchical wavelet packet texture descriptors. Hereinafter, we call the sub-band of the top-left corner of a decomposition level as approximation and the others as details. There are 3 and 15 details for wavelet packet transform under L=1 and L=2 as shown in Fig.1(b) and (c) respectively.

3.3. Hierarchical Wavelet Packet Texture Descriptors for Object Categorization

In this paper, the mean and standard deviation of each sub-band are used for texture descriptions. Let and denote the mean and standard deviation of a sub-band image under a specified decomposition level l respectively, which are calculated as follows.

(8)

(9)

(a) original image (b) L=1 (c)L=2 Fig.1. Wavelet packet decomposition for an image under different levels.

where is the coefficient of the point (s,t) of the b-th sub-band of the l-th level, S and T are the height and width of the sub-band image .

In the following part of this paper, the texture feature is extracted from all the sub-bands of regular wavelet tree topology. We denote the texture feature of each level l as . Four different hierarchical wavelet packet texture descriptors are used and compared in this paper: 1) texture information extracted from the sub-bands of the last level (denoted WVPK), 2) texture information of the details of WVPK (denoted NoDC), 3) texture information of all the details of the decomposition level

(denoted HIGH), 4) texture information of all the sub-bands of the decomposition level

(denoted HWVP). Let XWVPK(L), XNoDC(L), XHIGH(L) and XHWVP(L) denote the hierarchical wavelet texture descriptors of WVPK, NoDC, HIGH and HWVP at decomposition level L. we have: ,

, , and

. In order to show the effectiveness of the proposed

hierarchical texture descriptors for object categorization, a simple feature centroid based classified approach is utilized, which consists of following two steps: 1) feature centroids of the predefined categories are determined from training samples; 2) the category of a given test image is classified according to the distances of the feature to the centroids of the categories.

Let denote the corresponding texture feature of the i-th image of the k-th category, be the feature centroid of a category obtained by N training samples per category as follow

(10)

where is a d-dimensional vector (d is related to the texture descriptors), K is the number of categories. Each element is normalized as follows

(11)

(12)

(13)

where I is the set of all images. For a given testing image with the feature , we classify it into the k0-th category according to the minimum distance as follow

(14) where is the Euclidean distance of and

, which is calculated as follow (15)

3.4. Local Hierarchical Wavelet Packet Texture Descriptors

The local texture descriptors can capture the salient texture information and can decrease the influence of complex background. Usually, local features are robust for object recognition, in this paper local and global hierarchical wavelet packet texture descriptors are compared in object categorization. Texture descriptors of four portioning patterns are compared. The four patterns are Global, Local4 and Local5 which are shown in Fig.2(a)~(c) respectively. In Local4 each image is equally segmented into 2*2 grids. Local5 consists of Local4 and a centralized sub-image, which is with the same sizes of the four patches in Local4. Fig.3 (a), (b) and (c) show the local hierarchical wavelet packet transform of Local4 at decomposition level: L=0, 1 and 2 respectively. Similar to the global hierarchical texture descriptors, the mean and standard deviation of each sub-band of the local regions are used. Hence, in a specified decomposition level L, the dimensions of Local4 and Local5 are four and five times of the corresponding Global texture descriptors. For example, under wavelet packet decomposition level L=4, the dimensions of the texture descriptor NoDC of Global, Local4 and Local5 are 510, 2040 and 2550 respectively.

4. Datasets for Object Categorization

We evaluate object categorization performance on four widely used datasets:

(a) Global (b) Local4 (c) Local5

Fig.2 Different partitioning patterns.

(a) L=0 (b) L=1 (c) L=2

Fig.3 Local hierarchical wavelet packet transform for the sub-images of Local4 under L=0, 1, and 2.

1) Oliva and Torralba dataset (denoted OT) [6]. OT has 2,688 images with eight categories: 360 coast, 328 forest, 374 mountain, 260 highway, 308 insidecity, 410 open country, 292 street, and 356 tallbuilding. Each image in this dataset are with the same sizes 256*256.

2) Scene-13 dataset [11]. This dataset consists of the 2688 images of the eight categories of the OT dataset and another five categories with 1071 images: 241 suburb, 174 bedroom, 151 kitchen, 289 living room, and 216 office. Totally there are 3759 images in this dataset.

3) Sport event dataset [2]. This dataset contains 1579 images of eight Sport event classes: 200 badminton, 137 bocce, 236 croquet, 182 polo, 194 rock climbing, 250 rowing, 190 sailing, and 190 snowboarding.

In the following parts of this paper, N ( ) images per category are randomly selected from the dataset and used as training set. The test set consists of all the remaining images of each category. In order to make objective comparisons for PHOG, Gabor and the proposed hierarchical wavelet packet texture descriptors, we try to make the testing conditions the same. Hereinafter, the average recognition rate is utilized for object categorization performance evaluation. The average recognition rate is obtained by running the classification process Q (in this paper Q is set to be 30) times and averaging the recognition rates. The main goal of this paper is to show the effectiveness of the proposed hierarchical wavelet packet texture descriptors, a simple distance measure method is used in this paper. The statistic learning methods can improve the object categorization performances compared to the centroid based methods [8], [23]. However, this is not helpful for showing the effectiveness of the features. Thus, no learning approaches such as SVM [5], [23], pLSA [8],[18], LDA[9] etc are adopted and no comparisons are made with the existing methods.

5. Experimental results and discussions

The image classification task is to assign each test image to one of the pre-defined categories. In this section, both the global and local texture descriptors on the OT, Scene-13, and Sport event datasets are evaluated. Comparisons with the Gabor and PHOG on object categorization are given. The mean and standard deviation of each sub-band of Gabor transform are used for texture description [27], [28]. The Gobr texture based method is similar to that of the proposed wavelet packet texture descriptors as described in Eq.(8)~Eq.(15). The PHOG based method is carried out similarly.

5.1. Comparisons with PHOG and Gabor

In TABLE 1, we compare the performances of Gabor texture descriptors with five scales and various orientations (the orientation number is denoted as Ort), PHOG descriptors (under decomposition level L=1, 2 and 3) and hierarchical wavelet packet texture descriptors under L=3 (under portioning patterns: Global, Local4 and Local5) with various training images on the Sport event dataset. We find that the increase of orientation does not improve the object categorization performances of Gabor texture descriptors. However, the proposed hierarchical wavelet packet texture descriptors are more effective.

The recognition rates of Gabor (under Ort=12), PHOG (under L=3) and our HIGH descriptors (under Global, Local4, and Local5 with L=4) on Sport event, Scene13 and OT datasets are shown in Fig.4 - Fig.6 respectively. From Table 1, Fig.4 - Fig.6, we find that the proposed hierarchical wavelet packet texture descriptors are as effective as the PHOG shape descriptors. Compared with Gabor, the proposed hierarchical wavelet packet descriptors are more robust and effective. The average recognition rates of the proposed texture descriptors WVPK, NoDC, HIGH and HWVP outperform that of Gabor very much.

Table 1. Comparisons for Gabor (with five scales and orientation number Ort is set to be 6, 12,18, and 36 ), PHOG (under decomposition level L=1,2,and 3), and hierarchical wavelet packet texture descriptors (WVPK, NoDC, HIGH and HWVP under decomposition level L=3) on Sport event dataset.

d N=1 N=10 N=20 N=30 Ort=6 60 15.6 20.3 20.7 21.6

Ort=12 120 15.7 20.4 21.0 21.8 Ort=18 180 15.7 20.3 20.8 21.6

Garbor

Ort=36 360 15.6 20.2 20.5 21.6 L=1 50 26.9 43.1 45.9 48.8 L=2 210 26.4 46.6 50.1 51.3 PHOG L=3 850 24.4 46.9 50.9 52.8

WVPK 128 26.8 44.0 46.7 48.7 NoDC 126 27.0 44.8 47.2 48.8 HIGH 162 27.0 44.6 47.2 48.8

Global (L=3)

HWVP 170 25.7 41.2 43.4 45.0 WVPK 512 28.0 47.3 50.6 52.1 NoDC 504 26.9 45.2 48.4 49.6 HIGH 648 27.1 45.1 48.5 49.5

Loal4 (L=3)

HWVP 680 28.7 48.3 51.3 52.8 WVPK 640 28.8 50.2 53.6 55.1 NoDC 630 27.6 48.2 51.7 52.8 HIGH 810 27.9 48.4 51.7 52.8

Loal5 (L=3)

HWVP 850 29.4 50.6 53.6 55.3

5.2. Wavelet packet bases and object categorization performances

The data of HWVP provided in above tables and figures are obtained under the wavelet bases DB5. Different wavelet packet bases may influence the object category performances. In this section, discussions on wavelet bases and the performances of object categorization are given. Different Daubechies wavelet bases are compared in performance evaluation as shown in Table 2. Let DBk denote the wavelet bases with filter length 2k. As we known, the wavelet filters with short length take less spatial information, while longer filters have more spatial relations during filtering, thus achieving smooth results. However, when the filters are longer enough, the details of the image will be smoothed. Table 2 shows the recognition rates of WVPK, NoDC, NoDC and HWVP based object categorization method under wavelet packet bases DB1, DB3, DB5, DB7 and DB9. The training image number is set to be 20 per category. Good performances are achieved by the DB5 which outperforms DB1 by about 2% in average.

Fig.4. Comparisons on Sport event dataset.

Fig.5. Comparisons on Scene-13 dataset.

Table 2. Recognition rates of the texture descriptors WVPK, NoDC, HIGH and HWVP under the wavelet packet bases with decomposition L=3 on Sport event dataset. Texture descriptors are extracted under Global pattern.

DB1 DB3 DB5 DB7 DB9 WVPK 44.2 45.5 46.4 46.1 46.3 NoDC 44.6 45.7 46.7 46.2 46.4 HIGH 44.7 45.8 46.7 46.3 46.5 HWVP 40.3 41.2 42.5 42.0 42.1

Fig.6. Comparisons on OT dataset.

Fig.7. Comparison for HWVP, NoDC, WVPK and HIGH on OT dataset

Fig.8. Comparisons of HIGH under Global and Local5 on Scene-13.

5.3. Discussion of L and performances

In order to select the appropriate texture descriptors of the proposed hierarchical wavelet packet transform, we compare WVPK, NoDC, HIGH and HWVP under different training images with a specified different decomposition level L. The average recognition rates of WVPK, NoDC, HIGH and HWVP (with L=4) under Global and Local5 on OT dataset with N (N =1, 5, 10, 15, 20, 25 and 30) training images per category are shown in Fig.7.

Fig.8 shows the comparisons of HIGH under Global and Local5 with various decomposition levels L (L=1, 2, 3, and 4) and training image numbers N (N =1, 5, 10, 15, 20, 25, and 30) per category on Scene-13 dataset. We find that with the increment of decomposition depth L, the performances improve obviously. From L=1 to L=2, the average recognition rate improves by about 5%. However, the improvement from L=2 to L=3 is about 3% in average. The improvements of L=3 to L=4 are very small for Global. In our opinion, this is caused by the fact that the image response to the subtle sub-band filters very weakly when the decomposition depth is more than three. Thus, their contributions are comparatively smaller. From this point of view, L=4 is enough for hierarchical wavelet packet based object categorization. Good performances are achieved by local texture descriptors.

From above Tables and figures, it is clear that, HIGH outperforms WVPK, NoDC and HWVP. This is the reason why details are some robust to the variations of luminance. HIGH outperforms NoDC, which is based on the fact that details of any other decomposition level have positive contributions to object categorization. The recognition rate of NoDC is better than that of WVPK by 0.5% in average. This is the reason why the texture information from the details of the sub-band low-pass is negative for object categorization.

6. Conclusion

In this paper, an effective object categorization method is proposed based on hierarchical wavelet packet texture descriptors. Compared with Gabor texture descriptors and PHOG based shape descriptor on the widely used OT, Scene-13 and Sport event datasets, the proposed hierarchical wavelet packet texture descriptors are more effective and robust for object categorization.

Details of hierarchical wavelet packet provide positive contribution to object discrimination. When the decomposition level is appropriately selected, satisfactory object categorization results can be achieved. With the increase of decomposition level,

object categorization performance improved. However, the computation cost and feature dimension increased dramatically. Four level wavelet packet transform is optimal for both the computation and effectiveness of the proposed hierarchical wavelet packet texture descriptors. Wavelet packet bases are also important for texture descriptors. Wavelet packet bases with appropriate lengths, such as DB5, can improve object categorization performance. Local texture descriptors extracted from the sub-images can improve the classification performance. The influence of wavelet packet bases of texture descriptors to the object categorization performances is discussed. This gives us some guidance to select appropriate wavelet packet bases.

The contribution of each sub-band of the hierarchical wavelet packet transform is assumed to be the same during object categorization. Hence, we think object categorization performance can be further improved in our future works from the following three aspects: 1) fusing the proposed texture descriptors with appearance and shape features; 2) embedding the texture descriptors into statistical learning framework; 3) learning salient sub-bands for each category and giving them stronger weights.

ACKNOWLEDGMENT

This work is supported in part by National 973 Project No.2007CB311002 and National Natural Science Foundation of China (CNSF) Project No.60572045. . References [1] A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content based image retrieval at the end of the early years,” IEEE Trans. Pattern Anal. Mach. Intell., vol.22, no.12, 2000, pp.1349-1380. [2] L. Li, and F. Li, “What, where and who? Classifying events by scene and object recognition,” in Proc. ICCV, 2007. [3] A. Holub, M. Welling, and P. Perona, “Combining generative models and fisher kernels for object class recognition,” in Proc. CVPR, 2005. [4] E. Sudderth, and J. Fan, “Texture classification by wavelet packet signatures,” IEEE Trans. Pattern Anal. Mach. Intell., vol.15, no.11, 1993, pp.1186-1193. [5] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc.CVPR, 2006. [6] A. Oliva, and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” IJCV, vol. 42, no. 3, pp. 145-175, 2001. [7] X. Zhou, M. Wang, Q. Zhang, J. Zhang, and B. Shi, “Automatic image annotation by an iterative approach incorporating keyword correlations and region matching,” in Proc. CIVR 2007, pp. 25-32. [8] T. Hofmann, “Unsupervised learning by probabilistic latent semantic analysis,” Machine Learning, vol.42, no.1,

pp.177–196, 2001. [9] D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, no.3, pp.993–1022, 2003. [10] F. Li, R. Fergus, and P. Perona, “One-Shot learning of object categories,” IEEE Trans. Pattern Anal. Mach. Intell., vol.28, no.4, 2006, pp. 594-611. [11] F. Li, and P. Perona, “A Bayesian hierarchy model for learning natural scene categories,” in Proc. CVPR, 2005. [12] J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman, “Discovering object categories in image collections,” in Proc. ICCV, 2005. [13] B. Russell, A. Efros, J. Sivic, et al., “Using multiple segmentations to discover objects and their extent in image collections,” in Proc. CVPR, 2006. [14] D. Liu, and T. Chen, “Semantic-shift for unsupervised object detection,” in Proc. CVPR workshop, 2006. [15] F. Monay, and D. Gatica-Perez, “PLSA-based image auto-annotation:constraining the latent space,” in Proc. ACM Multimedia, 2004. [16]Y. Zheng, M. Zhao, S. Neo, T. Chua, and Q. Tian, “Visual synset: towards a higher-level visual representation,” in Proc. CVPR, 2008. [17] J. Zhang, M. MarszaÃlek, S. Lazebnik, and C. Schmid, “Local features and kernels for classification of texture and object categories: a comprehensive study,” International Journal of Computer Vision, 2007. [18] A. Bosch, A. Zisserman, and X. Munoz, “Scene classification using a hybrid generative/discriminative approach,” IEEE Trans. Pattern Anal. Mach. Intell., vol.30, no.4, 2008, pp. 712-727. [19] T. Kadir, and M. Brady, “Saliency, scale and image description”, International Journal of Computer Vision, vol.45, no.2, 2001, pp.83-105. [20] J. Li, and J. Wang, “Automatic linguistic indexing of pictures by a statistical modeling approach,” IEEE Trans. Pattern Anal. Mach. Intell., vol.25, no.9, 2003, pp.1075-1088. [21] J. Bi, Y. Chen, and J. Wang, “A Sparse Support Vector Machine Approach to Region-Based Image Categorization,” in Proc. CVPR, 2005. [22] H. Zhang, A. Berg, M. Maire, and J. Malik, “Svm-knn: Discriminative nearest neighbor classification for visual category recognition,” in Proc. CVPR, 2006. [23]A. Bosch, A. Zisserman, and X. Munoz, “Representing shape with a spatial pyramid kernel,” in Proc. CIVR, 2007. [24] S. Fidler, M. Boben, and A. Leonardis, “Similarity-based cross-layered hierarchical representation for object categorization,” in Proc. CVPR, 2008. [25] C. Garcia, G. Zikos, and G. Tziritas, “Wavelet packet analysis for face recognition,” Image and Vision Computing, vol. 18, 2000, pp.289–297. [26] C. Garcia, and G. Tziritas, “Face detection using quantized skin color regions merging and wavelet packet analysis,” IEEE Trans. Multimedia, vol.1, no.3, 1999, pp.264-277. [27] Y. Ro, M. Kim, H. Kang, B. Manjunath, and J. Kim, “MPEG-7 Homogeneous Texture Descriptor,” ETRI Journal, vol.23, no.2, 2001, pp. 41-51. [28] D. Zhang, and G. Lu, “Content-based image retrieval using Gabor texture features,” in Proc. PCM, 2000, pp.1139-

1142. [29] A. Quattoni, M. Collins, and T. Darrell, “Conditional random fields for object recognition,” in NIPS, 2004. [30] A. Holub and P. Perona, “A discriminative framework for modeling object classes,” in Proc. ICCV, 2005. [31] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in Proc. ECCV, 2004. [32] D. Crandall, P. Felzenszwalb, and D. Huttenlocher, “Spatial priors for part-based recognition using statistical models,” in Proc. CVPR, 2005. [33] A. Berg, T. Berg, and J. Malik, “Shape matching and object recognition using low distortion correspondences,” in Proc. CVPR, 2005. [34] G. Wang, Y. Zhang, and F. Li, “Using dependent regions for object categorization in a generative framework,” in Proc. CVPR 2006. [35] S. Lazebnik, C. Schmid, and J. Ponce, “A maximum entropy framework for part-based texture and object recognition,” in Proc. ICCV, vol. 1, 2005, pp. 832–838. [36] S. Savarese, J. Winn, and A. Criminisi, “Discriminative object class models of appearance and shape by correlations,” in Proc. CVPR, 2006, pp. 2033–2040. [37] A. Agarwal, and B. Triggs, “Hyperfeatures-multilevel local coding for visual recognition,” in Proc. ECCV, 2006. [38] J. Yuan, Y. Wu, and M. Yang, “Discovery of collocation patterns: From visual words to visual phrases,” in Proc. CVPR, 2007. [39] Y. Teh, M. Jordan, M. Beal, and D. Blei, “Hierarchical Dirichlet processes,” Journal of the American Statistical Association, 2006. [40] E. Sudderth, A. Torralba, W. Freeman, and A. Willsky, “Describing visual scenes using transformed dirichlet processes,” in NIPS, 2005. [41] P. Gosselin, M. Cord, and S. Philipp-Foliguet, “Combining visual dictionary, kernel-based similarity and learning strategy for image category retrieval,” Computer Vision and Image Understanding, vol.110, 2008, pp. 403-417. [42] L. Wu, Y. Hu, , M. Li, N. Yu, and X. Hua, “Scale-invariant visual language modeling for object categorization,” IEEE Trans. Multimedia, vol.11, no.2, 2009, pp.286-294. [43] L. Wu, M. Li, Z. Li, W. Ma, and N. Yu, “ Visual language modeling for image classification,” in Proc. MIR, 2007, pp.115-124. [44] C. Galleguillos, A. Rabinovich, and S. Belongie, “Object categorization using co-occurrence, location and appearance,” in Proc. CVPR, 2008. [45] S. Kumar, and M. Hebert, “A hierarchical field framework for unified context-based classification,” in Proc. ICCV, 2005. [46] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie, “Object in context,” in Proc. ICCV, 2007. [47] L. Cao, and F. Li, “Spatially coherent latent topic model for concurrent object segmentation and classification,” in Proc. ICCV, 2007. [48] D. Larlus, and F. Jurie, “Combining appearance models and markov random fields for category level object segmentation,” in Proc. CVPR, 2008.