Outdoor scene image segmentation based on background recognition and perceptual organization

13
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012 1007 Outdoor Scene Image Segmentation Based on Background Recognition and Perceptual Organization Chang Cheng, Andreas Koschan, Member, IEEE, Chung-Hao Chen, David L. Page, and Mongi A. Abidi Abstract—In this paper, we propose a novel outdoor scene image segmentation algorithm based on background recognition and perceptual organization. We recognize the background objects such as the sky, the ground, and vegetation based on the color and texture information. For the structurally challenging objects, which usually consist of multiple constituent parts, we developed a perceptual organization model that can capture the nonacci- dental structural relationships among the constituent parts of the structured objects and, hence, group them together accordingly without depending on a priori knowledge of the specic objects. Our experimental results show that our proposed method outper- formed two state-of-the-art image segmentation approaches on two challenging outdoor databases (Gould data set and Berkeley segmentation data set) and achieved accurate segmentation quality on various outdoor natural scene environments. Index Terms—Boundary energy, image segmentation, percep- tual organization. I. INTRODUCTION I MAGE segmentation is considered to be one of the funda- mental problems for computer vision. A primary goal of image segmentation is to partition an image into regions of co- herent properties so that each region corresponds to an object or area of interest [30]. In general, objects in outdoor scenes can be divided into two categories, namely, unstructured objects (e.g., sky, roads, trees, grass, etc.) and structured objects (e.g., cars, buildings, people, etc.). Unstructured objects usually comprise the backgrounds of images. The background objects usually have nearly homogenous surfaces and are distinct from the structured objects in images. Many recent appearance-based methods have achieved high accuracy in recognizing these background object classes [40], [41], [53]. The challenge for Manuscript received August 17, 2010; revised April 16, 2011 and July 22, 2011; accepted August 16, 2011 Date of publication September 22, 2011; date of current version February 17, 2012. This work was sup- ported in part by the University Research Program in Robotics under Grant DOE-DE-FG52-2004NA25589 and in part by the U.S. Air Force under Grant FA8650-10-1-5902. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Xilin Chen. C. Cheng is with Riverbed Technology, Sunnyvale, CA 94085 USA (e-mail: [email protected]). A. Koschan and M. A. Abidi are with the Imaging, Robotics, and Intelligent Systems Laboratory, Department of Electrical Engineering and Computer Science, The University of Tennessee, Knoxville, TN 37996 USA (e-mail: [email protected]; [email protected]). C.-H. Chen is with the Department of Electrical and Computer Engineering, Old Dominion University, Norfolk, VA 23529 USA (e-mail: [email protected]). D. L. Page is with Third Dimension Technologies LLC, Knoxville, TN 37931 USA (e-mail: [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TIP.2011.2169268 outdoor segmentation comes from the structured objects that are often composed of multiple parts, with each part having distinct surface characteristics (e.g., colors, textures, etc.). Without certain knowledge about an object, it is difcult to group these parts together. Some studies [2], [4], [40], [41], [53] tackle this difculty by using object-specic models. However, these methods do not perform well when the images contain objects that have not been seen before. Different from these studies, in this paper, our research objective is to explore detecting object boundaries in outdoor scene images solely based on some general properties of the real-world objects, such as perceptual organization laws, without depending on a priori knowledge of the specic objects. It has been long known that perceptual organization plays a powerful role in human visual perception. Perceptual organ- ization, in general, refers to a basic capability of the human visual system to derive relevant groupings and structures from an image without prior knowledge of its contents. The Gestalt psychologists summarized some underlying principles (e.g., proximity, similarity, continuity, symmetry, etc.) that lead to human perceptual grouping. They believed that these laws capture some basic abilities of the human mind to proceed from the part to the whole [8]. In addition to the classic Gestalt laws, recently, Jacobs [20] and Jacobs et al. [32] have pointed out that convexity also plays an important role on perceptual organization because many real-world objects such as build- ings, vehicles, and furniture tend to have convex shapes. These Gestalt laws can be summarized by a single principle, i.e., the principle of nonaccidentalness, which means that these structures are most likely produced by an object or process, and are unlikely to arise at random [11], [33]. In other words, the validation of Gestalt laws originates from the fact that these laws reect the general properties of the man-made and biological objects in the world [34]. However, there are several challenges for applying Gestalt laws to real-world applications. One challenge is to nd quan- titative and objective measures of these grouping laws. The Gestalt laws are in descriptive forms. Therefore, one needs to quantify them for scientic use. Another challenge consists of nding a way to combine the various grouping factors since object parts can be attached in many different ways. Under different situations, different laws may be applied. Therefore, a perceptual organization system requires combining as many Gestalt laws as possible. The greater the number of Gestalt laws incorporated, the better chance the perceptual organiza- tion systems may apply appropriate Gestalt laws in practices. Previous studies do not nd elegant solutions for handling these two challenges. 1057-7149/$26.00 © 2011 IEEE

Transcript of Outdoor scene image segmentation based on background recognition and perceptual organization

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012 1007

Outdoor Scene Image Segmentation Based onBackground Recognition and Perceptual Organization

Chang Cheng, Andreas Koschan, Member, IEEE, Chung-Hao Chen, David L. Page, and Mongi A. Abidi

Abstract—In this paper, we propose a novel outdoor scene imagesegmentation algorithm based on background recognition andperceptual organization. We recognize the background objectssuch as the sky, the ground, and vegetation based on the colorand texture information. For the structurally challenging objects,which usually consist of multiple constituent parts, we developeda perceptual organization model that can capture the nonacci-dental structural relationships among the constituent parts of thestructured objects and, hence, group them together accordinglywithout depending on a priori knowledge of the specific objects.Our experimental results show that our proposed method outper-formed two state-of-the-art image segmentation approaches ontwo challenging outdoor databases (Gould data set and Berkeleysegmentation data set) and achieved accurate segmentationquality on various outdoor natural scene environments.

Index Terms—Boundary energy, image segmentation, percep-tual organization.

I. INTRODUCTION

I MAGE segmentation is considered to be one of the funda-mental problems for computer vision. A primary goal of

image segmentation is to partition an image into regions of co-herent properties so that each region corresponds to an object orarea of interest [30]. In general, objects in outdoor scenes can bedivided into two categories, namely, unstructured objects (e.g.,sky, roads, trees, grass, etc.) and structured objects (e.g., cars,buildings, people, etc.). Unstructured objects usually comprisethe backgrounds of images. The background objects usuallyhave nearly homogenous surfaces and are distinct from thestructured objects in images. Many recent appearance-basedmethods have achieved high accuracy in recognizing thesebackground object classes [40], [41], [53]. The challenge for

Manuscript received August 17, 2010; revised April 16, 2011 and July22, 2011; accepted August 16, 2011 Date of publication September 22,2011; date of current version February 17, 2012. This work was sup-ported in part by the University Research Program in Robotics under GrantDOE-DE-FG52-2004NA25589 and in part by the U.S. Air Force under GrantFA8650-10-1-5902. The associate editor coordinating the review of thismanuscript and approving it for publication was Prof. Xilin Chen.C. Cheng is with Riverbed Technology, Sunnyvale, CA 94085 USA (e-mail:

[email protected]).A. Koschan and M. A. Abidi are with the Imaging, Robotics, and Intelligent

Systems Laboratory, Department of Electrical Engineering and ComputerScience, The University of Tennessee, Knoxville, TN 37996 USA (e-mail:[email protected]; [email protected]).C.-H. Chen is with the Department of Electrical and Computer Engineering,

Old Dominion University, Norfolk, VA 23529 USA (e-mail: [email protected]).D. L. Page is with Third Dimension Technologies LLC, Knoxville, TN 37931

USA (e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TIP.2011.2169268

outdoor segmentation comes from the structured objects thatare often composed of multiple parts, with each part havingdistinct surface characteristics (e.g., colors, textures, etc.).Without certain knowledge about an object, it is difficult togroup these parts together. Some studies [2], [4], [40], [41],[53] tackle this difficulty by using object-specific models.However, these methods do not perform well when the imagescontain objects that have not been seen before. Different fromthese studies, in this paper, our research objective is to exploredetecting object boundaries in outdoor scene images solelybased on some general properties of the real-world objects,such as perceptual organization laws, without depending on apriori knowledge of the specific objects.It has been long known that perceptual organization plays a

powerful role in human visual perception. Perceptual organ-ization, in general, refers to a basic capability of the humanvisual system to derive relevant groupings and structures froman image without prior knowledge of its contents. The Gestaltpsychologists summarized some underlying principles (e.g.,proximity, similarity, continuity, symmetry, etc.) that lead tohuman perceptual grouping. They believed that these lawscapture some basic abilities of the human mind to proceedfrom the part to the whole [8]. In addition to the classic Gestaltlaws, recently, Jacobs [20] and Jacobs et al. [32] have pointedout that convexity also plays an important role on perceptualorganization because many real-world objects such as build-ings, vehicles, and furniture tend to have convex shapes. TheseGestalt laws can be summarized by a single principle, i.e.,the principle of nonaccidentalness, which means that thesestructures are most likely produced by an object or process,and are unlikely to arise at random [11], [33]. In other words,the validation of Gestalt laws originates from the fact thatthese laws reflect the general properties of the man-made andbiological objects in the world [34].However, there are several challenges for applying Gestalt

laws to real-world applications. One challenge is to find quan-titative and objective measures of these grouping laws. TheGestalt laws are in descriptive forms. Therefore, one needs toquantify them for scientific use. Another challenge consists offinding a way to combine the various grouping factors sinceobject parts can be attached in many different ways. Underdifferent situations, different laws may be applied. Therefore,a perceptual organization system requires combining as manyGestalt laws as possible. The greater the number of Gestaltlaws incorporated, the better chance the perceptual organiza-tion systems may apply appropriate Gestalt laws in practices.Previous studies do not find elegant solutions for handling thesetwo challenges.

1057-7149/$26.00 © 2011 IEEE

1008 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012

The main contribution of this paper is a developed perceptualorganization model (POM) for boundary detection. The POMquantitatively incorporates a list of Gestalt laws and therefore isable to capture the nonaccidental structural relationships amongthe constituent parts of a structured object. With this model,we are able to detect the boundaries of various salient struc-tured objects under different outdoor environments. The exper-imental results show that our proposed method outperformedtwo state-of-the-art studies [49], [61] on two challenging imagedatabases consisting of a wide variety of outdoor scenes and ob-ject classes. An earlier version of this paper appeared in [56].The remainder of this paper is organized as follows: In

Section II, we discuss some related studies, including imagesegmentation and boundary detection methods. In Section III,we describe our POM and scene image segmentation algo-rithm. The experimental results are presented in Section IV,and Section V concludes this paper.

II. RELATED WORK

Bottom-up image segmentation methods only utilizelow-level features such as colors, textures, and edges todecompose an image into uniform regions. Bottom-up methodscan be divided into two categories, namely, region-basedand contour-based approaches. A group of approaches treatsimage segmentation as a graph cut problem. Shi and Malik[6] proposed the normalized cut criterion that removes thetrivial solutions of cutting small sets of isolated nodes in thegraph. Felzenszwalb and Huttenlocher [5] proposed an efficientgraph-based generic image segmentation algorithm. As with thenormalized cut method, this method also tries to capture non-local image characteristics. Comaniciu and Meer [48] treatedimage segmentation as a cluster problem in a spatial-rangefeature space. Their mean-shift segmentation algorithm hasillustrated excellent performance on different image data setsand has been considered as one of the best bottom-up imagesegmentation methods. Some of these region-based methodshave been widely used to generate coherent regions calledsuperpixels for many applications [1], [42], [51], [53].Contour closure is one of the important grouping factors

identified by Gestalt psychologists. Early contour-based studiessuch as active contour methods only utilize boundary propertiessuch as intensity gradients. Zhu and Yuille [29] first used bothboundary and region information within an energy optimizationmodel. For their method to achieve good performance, a set ofinitial seeds needs to be placed inside each homogenous region.Jermyn and Ishikawa [21] proposed a new form of energy

function. Their energy function is defined on the space of bound-aries in the image domain by a ratio of two integrals aroundthe boundary. The numerator of the energy function is a mea-sure of the “flow” of some quantity into or out of the region.The denominator is a generalized measure of the length of theboundary. The main contribution of this energy function is to in-corporate the general types of region information. Our method isbased on this form of energy function, which is addressed in de-tail in Section III. In addition to the above energy function, somestudies [45], [46] are built on different energy functions suchas the Mumford–Shah segmentation model [47].Recently, var-

ious boundary detection methods based on statistical learninghave been proposed in technical literature. Martin et al. [24]treated boundary detection as a supervised learning problem.They used a large data set of human-labeled boundaries in nat-ural images to train a boundary model. Their model can thenpredict the possibility of a pixel being a boundary pixel basedon a set of low-level cues such as brightness, color, and textureextracted from local image patches. Dollar et al. [36] and Hoiemet al. [13] followed a similar idea. Noticing the importance ofcontext information, Dollar et al. [36] designed their boundarydetection algorithm based on a large number of generic featurescalculated over a large image patch. This algorithm expects thecontext information to be provided by a large aperture. Hoiem etal. [13] estimated occlusion boundaries based on both 2-D per-ceptual cues and 3-D cues such as surface orientation and depthestimates.Multiclass image segmentation (or semantic segmentation)

has become an active research area in recent years. The goal hereis to label each pixel in the image with one of a set of predefinedobject class labels. Many studies operate on pixel level. Shottonet al. [40] assigned a class label to a pixel based on a joint ap-pearance, shape, and context model. In [50], Shotton et al. pro-posed the use of semantic texton forests for fast classification. Anumber of studies utilize superpixels as a starting point for theirtask. Gould et al. [54] proposed a superpixel-based conditionalrandom field to learn the relative location offsets of categories.In their recent work [52], they developed a classification modelthat is defined in terms of a unified energy function over sceneappearance and scene geometry structure. Other notable studiesin this area include Micusik and Kosecka [42], Yang et al. [53],and He et al. [55].Finally, we review some previous efforts attempting to apply

Gestalt laws to guide image segmentation. A number of studies[8], [20], [37], [38] only applied one or two Gestalt laws (e.g.,proximity, curvilinear, continuity, closure, or convexity, etc.) on1-D image features (e.g., lines, curves, and edges) to find closedcontours in images. Lowe [8] and Mahamud et al. [37] inte-grated proximity and continuity laws to detect smooth closedcontour bounding unknown objects in real images. Ren et al.[38] developed a probabilistic model of continuity and closurebuilt on a scale-invariant geometric structure to estimate objectboundaries. Jacobs [20] emphasized that convexity plays an im-portant role in perceptual organization and, in many cases, over-rules other laws such as closure. Mohan and Nevatia [10] incor-porated several Gestalt laws to detect a group of collated fea-tures describing objects. Their segmentation algorithm is basedon a set of ad hoc geometric relationships among these collatedfeatures and is not based on the optimization of a measure ofthe value of a group. McCafferty [9] formulated the groupingproblem in perceptual organization as an energy minimizationproblem where the energy of a grouping is defined as a func-tion of how well it obeys the Gestalt laws. The total energy ofa grouping is treated as the linear combination of the individualgrouping energy values corresponding to the Gestalt laws. Des-olneux et al. [39] studied four Gestalt laws, namely, similarityin colors, similarity in sizes, alignments and proximity in point,and line and curve image features. They proposed the corre-sponding quantitative measurements for the significance of the

CHENG et al.: IMAGE SEGMENTATION BASED ON BACKGROUND RECOGNITION AND PERCEPTUAL ORGANIZATION 1009

four Gestalt laws and also showed the importance of the collab-oration of Gestalt laws in the perceptual organization process.

III. IMAGE SEGMENTATION ALGORITHM

Here, we present a novel image segmentation algorithm foroutdoor scenes. Our research objective here is to explore de-tecting object boundaries solely based on some general prop-erties of the real-world objects, such as perceptual organiza-tion laws, without depending on object-specific knowledge. Ourimage segmentation algorithm is inspired by a POM, which isthe main contribution of this paper. The POM quantitatively in-corporates a list of Gestalt cues. By doing this, the POM candetect many structured object boundaries without having anyobject-specific knowledge of these objects.Most studies to date apply Gestalt laws on zero- or 1-D

image features (e.g., points, lines, curves, etc.). Different tothese studies, our method applies Gestalt laws on 2-D imagefeatures, i.e., object parts. We first give the formal definition ofsalient structured objects and object parts in images:Definition 1: A salient structured object refers to a structured

object with an independent and detectable physical boundary.An independent physical boundary means that the boundary

of the object should not be contained in another structured ob-ject. For example, the window of a building should be treatedas a part of the building because the whole physical boundaryof the window is contained in the building’s physical boundary.In addition, the physical boundary of a salient object should bedetectable by state-of-the-art computer vision algorithms. For agroup of people, if each individual is too small or several peoplewear the same color clothes, making it difficult to clearly detecteach individual’s boundary with today’s computer vision tech-nology, then the whole group of people should be treated as asalient object.Definition 2: An object part refers to a homogenous portion

of a salient structured object surface in an image.Based on our empirical observation, most object parts have

approximately homogenous surfaces (e.g., color, texture, etc.).Therefore, the homogenous patches in an image approximatelycorrespond to the parts of the objects in the image. Throughoutthis paper, we use this definition for object parts.In the remainder of this section, we first introduce how to

recognize the common background objects such as skies, roads,and vegetation in outdoor natural scenes. Then, we present ourPOM and the boundary detection algorithm. Finally, we de-scribe our image segmentation algorithm based on the POM.

A. Background Identification in Outdoor Natural Scenes

According to [40], objects appearing in natural scenes can beroughly divided into two categories, namely, unstructured andstructured objects. Unstructured objects typically have nearlyhomogenous surfaces, whereas structured objects typically con-sist of multiple constituent parts, with each part having dis-tinct appearances (e.g., color, texture, etc.). The common back-grounds in outdoor natural scenes are those unstructured ob-jects such as skies, roads, trees, and grasses. These background

objects have low visual variability and, in most cases, are dis-tinguishable from other structured objects in an image. For in-stance, a sky usually has a uniform appearance with blue orwhite colors; a tree or a grass usually has a textured appearancewith green colors. Therefore, these background objects can beaccurately recognized solely based on appearance information.Supposewe can use a bottom-up segmentationmethod to seg-

ment an outdoor image into uniform regions. Then, some of theregions must belong to the background objects. To recognizethese background regions, we use a technique similar to [40].The key for this method is to use textons to represent objectappearance information. The term texton is first presented in[44] for describing human textural perception. The whole tex-tonization process proceeds as follows: First, the training im-ages are converted to the perceptually uniform CIE colorspace. Then, the training images are convolved with a 17-Dfilter bank. We use the same filter bank as that in [41], whichconsists of Gaussians at scales 1, 2, and 4; the and deriva-tives of Gaussians at scales 2 and 4; and Laplacians of Gaus-sians at scales 1, 2, 4, and 8. The Gaussians are applied to allthree color channels, whereas the other filters are applied onlyto the luminance channel. By doing so, we obtain a 17-D re-sponse for each training pixel. The 17-D response is then aug-mented with the CIE , , channels to form a 20-D vector.This is different from [41] because we found that, after aug-menting the three color channels, we can achieve slightly higherclassification accuracy. Then, the Euclidean-distance -meansclustering algorithm is performed on the 20-D vectors collectedfrom the training images to generate cluster centers. Thesecluster centers are called textons. Finally, each pixel in eachimage is assigned to the nearest cluster center, producing thetexton map. More details about the textonization process are re-ferred to [41]. After this textonization process, each image re-gion of the training images is represented by a histogram of tex-tons. We then use these training data to train a set of binary Ad-aboost classifiers [43] to classify the unstructured objects (e.g.,skies, roads, trees, grasses, etc.). Similar to the result in [40], ourclassifiers also achieve high accuracy on classifying these back-ground objects in outdoor images. An example of backgroundidentification is illustrated in Fig. 4(b).

B. POM

Most images consist of background and foreground objects.Most foreground objects are structured objects that are oftencomposed of multiple parts, with each part having distinct sur-face characteristics (e.g., color, texture, etc.). Assume that wecan use a bottom-up method to segment an image into uniformpatches, then most structured objects should be oversegmentedto multiple patches (parts). After the background patches areidentified in the image, the majority of the remaining imagepatches correspond to the constituent parts of structured ob-jects [see Fig. 4(b)] for an example). The challenge here is howto piece the set of constituted parts of a structured object to-gether to form a region that corresponds to the structured ob-ject without any object-specific knowledge of the object. Totackle this problem, we develop a POM. Accordingly, our image

1010 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012

segmentation algorithm can be divided into the following threesteps.1) Given an image, use a bottom-up method to segment it intouniform patches.

2) Use background classifiers to identify background patches.3) Use POM to group the remaining patches (parts) to largerregions that correspond to structured objects or semanti-cally meaningful parts of structured objects.

We now go through the details of our POM. Even after back-ground identification, there is still a large number of patches(parts) remaining. Different combinations of the parts form dif-ferent regions. How can we find a region that corresponds to astructured object? We want to use the Gestalt laws to guide usto find and group these kinds of regions. Our strategy is that,since there always exist some special structural relationshipsthat obey the principle of nonaccidentalness among the con-stituent parts of a structured object, we may be able to piecethe set of parts together by capturing these special structural re-lationships. The whole process works as follows: We first pickone part and then keep growing the region by trying to group itsneighbors with the region. The process stops when none of theregion’s neighbors can be grouped with the region. To achievethis, we develop a measurement to measure how accurately a re-gion is grouped. The region goodness directly depends on howwell the structural relationships of parts contained in the re-gion obey Gestalt laws. In other words, the region goodness isdefined from perceptual organization perspective. With the re-gion measurement, we can go find the best region that containsthe initial part. In most cases, the best region corresponds to asingle structured object or the semantically meaningful part ofthe structured object. This problem is formalized as follows.Problem Definition: Let represent a whole image that con-

sists of the regions that belong to backgrounds and the re-gions that belong to structured objects , . Afterthe background identification, we know that most of the struc-tured objects in the image are contained in a subregion . Letbe the initial partition of from a bottom-up segmentation

method. Let denote a uniform patch from initial partition .For , is one of the constituent parts ofan unknown structured object. Based on initial part , we wantto find the maximum region so that the initial part

and for any uniform patch , where ,should have some special structural relationships that obey thenonaccidentalness principle with the remaining patches in .This is formulated as follows:

with (1)

where is a region in , is the boundary of , andis a boundary energy function. The boundary energy functionprovides a tool for measuring how good a region is. The goalhere is to find the best region in that contains initial part. The boundary energy function is defined as follows [21]:

(2)

where is the boundary length of . is the weightfunction in region . The criterion of region goodness depends

on how weight function is defined and also boundarylength . One can use to encode any information(e.g., color or texture) over region . In our case, we wantto use to encode the local structural relationships be-tween neighboring parts contained in region . Boundary length

, on the other hand, reflects the global property of region. In the following, the boundary energy function has been de-

signed in a way that the four Gestalt laws, i.e., similarity, sym-metry, alignment, and proximity, affect boundary energyby affecting weight function . The convexity law, whichis a global property, affects boundary energy based onthe characteristic of boundary length . First, we definethe weight function in patch as follows:

with (3)

where is a weight vector. We empirically set inour implementation. Vector , which is a point inthe structural context space encoding the structure informationof image patch . is a reference point in the structure’s contextspaces, which encodes the structural information of initial part. Since is the only known information that we have aboutthe unknown structured object, we use as a reference point.Notice that, at the beginning of the grouping process, regiononly contains part . In this case, , , whichmeans that we always set the largest weighted element to initialpart . Then, we try to grow region by including some of theneighbors of region and assign weights to the included neigh-bors accordingly so that we can measure if the growing region isbetter than the old one. Function calculates the absolute valuefor each vector element. Weight function having a largevalue inside a newly included patch means that current imagepatch has a strong structural relationship with the constituentparts of the unknown structured object that contains initial part. is the cohesiveness strength, which we will define later.is the boundary complexity of image patch , which can be

measured as [15]

(4)

(5)

(6)

where is the number of pixels of the boundary of imagepatch , is the length of a sliding window over the entireboundary of patch . and are the respectivestrength and frequency of the singularity at scale (step) .and are the two end pixels of a segment of the boundaryin the window. and are the pixels betweenand . is the number of notches in the window. A notchmeans a nonconvex portion of a polygon, which is defined as avertex with an inner angle larger than 180 (see details in [15]and [18]). Examples of the boundary complexity of regular andirregular shapes are shown in Fig. 1. Based on the similarityof the boundary complexity, we can distinguish man-made ob-ject parts, which usually have regular shapes, from vegetations,

CHENG et al.: IMAGE SEGMENTATION BASED ON BACKGROUND RECOGNITION AND PERCEPTUAL ORGANIZATION 1011

Fig. 1. Examples of shape regularity. First row: regular shapes. Second row:irregular shapes. Notice that regular shape objects have smaller values thanirregular shape objects.

which usually have irregular shapes. This is especially useful fordistinguishing the vegetation that may not be recognized solelybased on appearance. Therefore, the first Gestalt law we encodein the POM is the similarity law.After obtaining the boundary complexity of image patch ,

we then measure how tightly image patch is attached to theparts of the unknown structured object that contains initial part. is the cohesiveness strength and is calculated as

forfor neighbors

(7)where is a neighboring patch of patch . The maximum valueof cohesiveness strength is one. For patch , the cohesivenessis always set as the maximum value since we know for surethat the patch belongs to the unknown structured object. Assumethe cohesiveness strength of patch is known. measures thesymmetry of and along a vertical axis and is defined as

(8)

where is the Kronecker delta function; and are thecolumn coordinates of the centroids of and . If andare very close, this means that patches and are approxi-mately symmetric along a vertical axis. If patch has a strongcohesiveness strength to the unknown object containing patch, then patch also has a strong cohesiveness strength to theunknown object containing patch . This means that patchis tightly attached to some parts of the unknown object. Thisis because parts that are approximately symmetric along avertical axis are very likely belonging to the same object. Thus,this test encodes the symmetry law. Examples of symmetricrelationships are shown in Fig. 2(a). measures the alignmentof patches and

(9)

where and are the boundaries of and , respectively, andis the extension of the common boundary between and

. denotes the empty set. This alignment test encodes the con-tinuity law. If two object parts are strictly aligned along a direc-tion, then the boundary of the union of the two components willhave a good continuation. Accordingly, alignment is a strongindication that these two parts may belong to the same object.Examples of alignment relationships are shown in Fig. 2(b).

Fig. 2. (a) Symmetry relationships: the red dots indicate the centroids of thecomponents. (b) Alignment relationships. (c) Two components are strongly at-tached. (d) Two components are weakly attached.

If and are neither symmetric nor aligned, then the cohe-siveness strength of image patch depends on how it is attachedto . measures the attachment strength between and . It isdefined as

(10)

where and are constants (we empirically set andin our implementation). The attachment strength de-

pends on the ratio of the common boundary length betweenand to the sum of the boundary lengths of and . If there isa large size difference between and (i.e., or

), it usually means that the large one belongsto the background such as a wall or a big vegetation. If andhave similar sizes and share a long common boundary, this

means that and might be adjacent in the 3-D world. In otherwords, is considered to be in close proximity to . For this case,and are considered to be strongly attached. is the angle be-tween the line connecting the two ends of and the horizontalline starting from one end of . Many objects may be locatednext to each other in natural outdoor scenes. Therefore, even ifpatches and are tightly attached along the horizontal direc-tion, patches and may still belong to two neighboring objects.We use “ ” in (10) to control the cohesiveness strengthof two attaching patches according to the attachment orienta-tion. In general, vertically attached parts have larger attachmentstrength than those horizontally attached. Examples of strongand weak attachments are shown in Fig. 2(c) and (d), respec-tively.We have explicitly encoded four Gestalt laws (i.e., similarity,

symmetry, continuity, and proximity) into our POM. The fourGestalt laws affect weight function assigned to differentparts and hence affect the boundary energy for differentregions . The convexity law also affects the boundary energy

for different regions . Since convexity is a global prop-erty, it affects the boundary energy in a different way. Asshown in Fig. 3, patch is embedded into patch , which causesa concavity on patch . The boundary length of the region thatcontains and is shorter than that of the region that containsonly patch due to the concavity on patch . As a result, theboundary energy of the region that contains patches and issmaller than that of the region that only contains patch . There-fore, patches and are treated as one entity by our POM. Insummary, any parts that are embedded into a big entity will be

1012 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012

Fig. 3. Example of convexity relationship. The boundary energy of(middle) region is larger than that of the (right) region that contains and. Thus, our POM will group and together.

grouped together with the big entity due to their contribution ofdecreasing the boundary length of the new-formed big entity.This is because the embedded components increase the degreeof convexity of the big entity they are embedded in.Similar to the human visual system, our POM can “perceive”

a list of special structural relationships that obey the principle ofnonaccidentalness such as similarity in shape regularity, sym-metry, alignment, adjacency, and embedment. The “perception”is quantified by boundary energy whenever a new member isadded to a group. If the new member has some structural rela-tionships obeying the principle of nonaccidentalness with othermembers in the group, then the boundary energy of the new-formed group is smaller than the one of the old group. Other-wise, the boundary energy of the new-formed group is largerthan the one of the old group.The remaining task is to find the region in (1) that has the

minimum boundary energy among all the regions that containimage patch . In other words, we want to find the best regionthat contains image patch and all image patches containedin the region that have some special structural relationshipsobeying the principle of nonaccidentalness with each other.This region often corresponds to the whole structured object orthe semantically meaningful portion of the structured object.The challenge is that there may exist a large number of possibleregions that contain image patch in , and it is compu-tationally expensive to search all the possible regions to findthe one with the global minimum boundary energy. Therefore,we develop an efficient boundary detection algorithm basedon a breadth-first search strategy. Instead of finding the regionwith the global minimum boundary energy, the algorithm triesto find a region with the local minimum boundary energy.Although it is not guaranteed that the algorithm is always ableto find the region with the optimal boundary energy, we havefound that it works quite well in practice.

Algorithm 1 Boundary detection based on perceptualorganization

INPUT: , , and reference region

OUTPUT: region that contains with the minimalboundary energy in a local area of neighbors

1. Let .

2. Letneighbors .

3. Repeat steps 4–7 for .

4. Select a subset of with regions: ,so that , , there exists a path in connectingto .

5. Measure the boundary energy of with (2).

6. If , set , GOTOstep 2.

7. Otherwise select the next set of from and repeatsteps 4–7 until all possible have been tested.

8. Return .

At the beginning, only contains image patch . Then, thealgorithm measures the boundary energy of the combinations ofand its immediate neighbors. The algorithm stops when no

combination of and its immediate neighbors have smallerboundary energy than that of . The “ ” in step 3 teststhe combination of and a single neighboring region of .The “ ” tests the combination of and a pair of connectedneighboring regions of , and so on. In practice, we have foundthat even when , the algorithm performs well in general.

C. Image Segmentation Algorithm

The POM introduced in Section III-B can capture the specialstructural relationships that obey the principle of nonacciden-talness among the constituent parts of a structured object. Toapply the proposed POM to real-world natural scene images,we need to first segment an image into regions so that each re-gion approximately corresponds to an object part. In our im-plementation, we make use of Felzenszwalb and Huttenlocher’s[5] approach to generate initial superpixels for an outdoor sceneimage.We choose this method because it is very efficient and theresult of the method is comparable to the mean-shift [48] algo-rithm. However, the initial superpixels are, in many cases, stilltoo noisy. To further improve the segmentation quality, we applya segment-merge method on the initial superpixels to merge thesmall size regions (i.e., region size 0.03% of the image size)with their neighbors. These small size regions are often causedby the texture of surfaces or by the inhomogeneous portions ofsome part surfaces. Since these small size image regions con-tribute little to the structure information (shape and size) of ob-ject parts, we merge them together with their larger neighbors toimprove the performance of our POM. In addition, if two adja-cent regions have similar colors, we also merge them together.By doing so, we obtain a set of improved superpixels. Most ofthese improved superpixels approximately correspond to objectparts.We now turn to the image segmentation algorithm. Given

an outdoor scene image, we first apply the segment-mergetechnique described above to generate a set of improved su-perpixels. Most of the superpixels approximately correspondto object parts in that scene. We build a graph to representthese superpixels: Let be an undirected graph.Each vertex corresponds to a superpixel, and each edge

corresponds to a pair of neighboring vertices. Wethen use our background classifiers described in Section III-Ato divide into two parts: backgrounds such as sky, roads,grasses, and trees and structured parts . Most of the

CHENG et al.: IMAGE SEGMENTATION BASED ON BACKGROUND RECOGNITION AND PERCEPTUAL ORGANIZATION 1013

Fig. 4. Illustration of our segmentation pipeline. (a) Bottom: input images. Top: initial superpixels from [5]. (b) Top: improved superpixels with backgroundobjects identified. Sky is labeled as blue, ground is labeled as yellow, and vegetations (tree or grass) are labeled as green. Bottom: an example of perceptualorganization process. stands for boundary energy. First, the bottom white part of the white car is selected. The for the part is measured as 5.06.Then, our POM groups the two pieces of front windows with the white part based on convexity laws. The for these two regions are measured as 5.39 and5.46, respectively. Except for these two parts and the small segment of the front wheel, other parts do not have special geometric relations with the white part,

such as the bottom part of the white sign behind the front part of the car. The for the region containing the sign part is measured as 5.09. Therefore, theregion with of 5.46 is detected as the best region for the white part of the car. (c) Top: result of the first round perceptual organization. Notice that thedifferent parts of the white car now have been grouped into two big pieces. These two big pieces are aligned. Bottom: final segmentation result after second roundperceptual organization. Notice that the different parts of the white car are grouped together as a single object. (This figure is best viewed in color.)

structured objects in the scene are therefore contained in .We then apply our perceptual organization algorithm on .At the beginning, all the components in are marked asunprocessed. Then, for each unprocessed component in, we use the boundary detection algorithm described in

Section III-B to detect the best region that contains vertex. Region may correspond to a single structured object

or the semantically meaningful part of a structured object. Wemark all the components comprising as processed. Thealgorithm gradually moves from the ground plane up to the skyuntil all the components in are processed. Then, we finishone round of perceptual organization procedure and use thegrouped regions in this round as inputs for the next round ofperceptual organization on . At the beginning of a new roundof perceptual organization, we merge the adjacent componentsif they have similar colors and build a new graph for the newcomponents in . This perceptual organization procedure isrepeated for multiple rounds until no components in can begrouped with other components. In practice, we find that theresult of two rounds of grouping is good enough in most cases.At last, in a postprocess procedure, we merge all the adjacentsky and ground objects together to generate final segmentation.An illustration of the algorithm’s pipeline is shown in Fig. 4.

IV. EXPERIMENTAL RESULTS

A. Gould Database

We first test our image segmentation algorithm on a recentlyreleased Gould image data set (GDS) [52]. This data set contains

715 images of urban and rural scenes assembled from a collec-tion of public image data sets: LabelMe [57], MSRC-21 [40],PASCAL [58], and geometric context [12]. The images on thisdata set are downsampled to approximately 320 pixels 240pixels. The images contain a wide variety of man-made and bi-ological objects such as buildings, signs, cars, people, cows, andsheep. This data set provides ground truth object class segmen-tations that associate each region with one of eight semanticclasses (sky, tree, road, grass, water, building, mountain, or fore-ground). In addition, the object class labels, the ground truth ob-ject segmentations that associate each segment with one phys-ical object, are also provided. The data set is publically availablefrom the first author’s [52] website. Following the same setupin [52], we randomly split the data set into 572 training imagesand 143 testing images.We benchmarked a state-of-the-art class segmentation

method Gould09 [49] for reference on this data set. Like ourmethod, Gould09 also used superpixels as a starting point. Weused the normalized cut algorithm [7] to generate 400 super-pixels (per image) for use in the Gould09 method. The Gould09method is a slight variant of the baseline method described in[54]. The baseline method in [54] achieved comparable resultagainst the relative location prior method in [54], Shotton’smethod [40], and Yang’s method [53] on the MSRC-21 [40]data set. Gould09 is trained on the training set and tested onthe testing set. We first use the training images to train fivebackground classifiers for background identification. Then, wetest our POM method on both the testing set and the full GDSdata set.

1014 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012

TABLE ISEGMENTATION ACCURACY SCORE ON GDS

We choose the method proposed by Martin [25] as the mea-surement for segmentation accuracy. A segmentation accuracyscore is defined as

(11)

where and represent the set of pixels in the groundtruth segment of an object and the machine-generated objectsegment, respectively.Because all the images in this data set are downsized to 320

pixels 240 pixels, we set the parameters of Felzenszwalb’s al-gorithm [29] to small values ( , , and )to generate the initial superpixels from the input images. Wefound that Felzenszwalb’s algorithm with this set of parametersworks well for small size images (320 240). We set parame-ters , , and for our POM. We usedthe 572 training images to learn five binary Adaboost [59] clas-sifiers to identify five background object classes (i.e., sky, road,grass, trees, and water). The software of our work will be avail-able from the first author’s website.Table I compares the performance of our method with that of

the baseline method (Gould09) on the GDS. The table presentsthe segmentation accuracy on 8-class and overall objects (av-erage), respectively . The segmentation accuracy measurementis based on (11). For each class, the score is averaged over allthe salient object segments in the class. For overall objects, thescore is averaged over all the detected salient object segments.If the size of a ground truth object segment is smaller than 0.5%of the image size, it is not a salient object and will not be ac-counted for in the segmentation accuracy. In total, we detected2757 salient objects from 143 testing images and, on average, 19objects per image. We are able to achieve an average improve-ment of 16.2% over the performance of the Gould09 method.Among 2757 salient objects detected in the testing images, thestructured objects (buildings foregrounds) account for 52.6%.Our method significantly outperforms the Gould09 method onsegmenting the structured objects. For the full data set, we de-tected 13 430 salient objects from 715 images and, on average,18.8 objects per image. Structured objects account for 54.8%of the total detected salient objects. For the structured objects,POM does not gain any prior knowledge from training images.Our POM achieves very stable performance on segmenting thedifficultly structured objects on the full data set. This shows thatour POM can successfully handle various structured objects ap-pearing in outdoor scenes. The last column in Table I presentsthe pixel-level accuracy. Pixel-level accuracy reflects how ac-curate the classification is for multiclass segmentation methods.Pixel-level accuracy is computed as the percentage of image

pixels assigned to correct class label (see [40] for detailed def-inition). Our POM is not a multiclass segmentation method be-cause it does not label each pixel of an image with one of eightsemantic classes as Gould09 (see the last row in Fig. 5). There-fore, our POM does not have pixel-level accuracy.Gould09 seems to be adaptable to the variation of the number

of semantic classes. The method achieved 70.1% pixel-levelaccuracy on the 21-class MSRC database according to [54] andachieved impressive 75.4% pixel-level accuracy on the 8-classGDS. However, the foreground class in GDS includes a widevariety of structured object classes such as cars, buses, people,signs, sheep, cows, bicycles, and motorcycles, which have to-tally different appearance and shape characteristics. This makestraining an accurate classifier for classifying the foregroundclasses difficult. As a result, the Gould09 method cannot handlecomplicated environments where multiple foreground objectsmay appear close to each other. In such cases, the Gould09method often labeled the whole group of physically differentobject instances such as people, car, and sign as one continuousforeground class region (see Fig. 5 for examples). This affectsthe performance of Gould09 on the object-level segmentation.If the foreground class can be further divided into more se-mantic object classes, the performance of the Gould09 methodcan be expected to improve on the GDS.The small number of semantic classes does not affect our

method. Our method only requires identifying five backgroundobject classes (i.e., sky, trees, road, grass, and water). The re-maining object classes are treated as structured objects. OurPOM can handle many structured objects without recognizingthem. From this perspective, our method is easy to train com-pared with the class segmentation methods in the literature.To gain a qualitative perspective of the performance of the

two testing methods on this data set, we present several rep-resentative images, along with the ground truth segmentations,our method’s results, and the results of the Gould09 method inFig. 5.The first example (see the first column in Fig. 5) contains

a nearly centered person with vegetation backgrounds. TheGould09 method classified some part of the centered person asforeground. Our method pieced the major portion of the persontogether. The second example is a typical street scene thatcontains several structured objects (e.g., vehicle and building)and background objects. These structured objects are physicallyseparated in the image. Our method segmented most of thevehicles and the buildings, whereas the Gould09 method er-roneously merged the physically separated buildings together.The third image is a cluttered street scene that contains severalvehicles parked in front of a building. These vehicles closely

CHENG et al.: IMAGE SEGMENTATION BASED ON BACKGROUND RECOGNITION AND PERCEPTUAL ORGANIZATION 1015

Fig. 5. Examples of our POM segmentation algorithm on the GDS. (Row 1) Input images. (Row 2) Ground truth segmentations. (Row 3) POM (ours) results.(Row 4) Gould09’s results. (Row 5) Gould09’s class segmentation results. (This figure is best viewed in color.)

stay to each other. Our method segmented most of the vehiclesand the background buildings well. Gould09 misclassifiedsome parts of the buildings as foreground and merges the wholegroup of vehicles together.

B. Berkeley Segmentation Data set

Furthermore, we evaluate our POM image segmentationmethod on the Berkeley segmentation data set (BSDS) [60].BSDS contains a training set of 200 images and a test set of100 images. For each image, BSDS provides a collection ofhand-labeled segmentations from multiple human subjects as

ground truth. BSDS has been widely used as a benchmarkfor many boundary detection and segmentation algorithms intechnical literature.We directly evaluate our POM method on the test set of

BSDS. The sizes of images in this data set are 481 321 ,which are larger than the sizes of images in GDS. We use largerparameters ( , , and ) for Felzen-szwalb’s algorithm [29] to generate the initial superpixels for aninput image. We still set parameters , , and

for our POM. We use the same background classifierstrained in the GDS data set to identify background objects inthis data set.

1016 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012

Fig. 6. Examples of our POM segmentation algorithm on the BSDS data set. (This figure is best viewed in color.)

We apply both region-based and boundary-based measure-ments to evaluate our POM method on the test set of BSDS.The region-based segmentation accuracy measurement is stillbased on (11). For each image, BSDS provides a collection ofmultiple human-labeled segmentations. For simplicity, we onlyselect the first human-labeled segmentation of the collection asground truth for the image. The score is averaged over all thesalient object segments. If the size of a ground truth segmentsize is smaller than % 0.5 of the image size, it is not a salientobject and will not be accounted for segmentation accuracy. Intotal, we detect 681 salient objects from 100 images and, on av-erage, 6.8 objects per image. Our POM achieved an averagedsegmentation accuracy score of 53% on the test set of BSDS.For the boundary-based measurement, we use the preci-

sion–recall framework recommended by BSDS. A preci-sion–recall curve is a parameterized curve that captures thetradeoff between accuracy and noise. Precision is the fractionof detections that are true boundaries, whereas recall is thefraction of true boundaries that are detected. Thus, precisionis the probability that the segmentation algorithm’s signal isvalid, and recall is the probability that the ground truth datais detected. These two quantities can be combined in a singlequality measure, i.e., F-measure, defined as the weightedharmonic mean of precision and recall. Boundary detectionalgorithms usually generate a soft boundary map for an image.A soft boundary map consists of one-pixel-wide boundariesvalued from zero to one where high values signify greater con-fidence in the existence of a boundary. By thresholding the softboundary map with multilevel and computing precision andrecall at each level, a precision–recall curve can be generatedfor a boundary detection algorithm. Different from boundary

detection algorithms, our POM segmentation method gener-ates a binary boundary map. Therefore, the precision–recallcurve for POM degenerates into a single point. POM achieves

with Precision and Recallon the BSDS data set. Our POM outperforms the global proba-bility of boundary [61], which is a boundary detection algorithmachieving the best performance with onthe BSDS data set. The boundary detection algorithm rankingand the software generating precision–recall curve on BSDScan be found on the web page of BSDS [60]. Some examplesof our method’s results on BSDS are presented in Fig. 6.

C. Limitation of Our Method

Fig. 7 shows some bad examples of our method’s results.There are still some mistakes that we would like to address inthe future. The segmentation of our POM is mainly based onthe geometric relationships between different object parts. Thisrequires obtaining the geometric properties (e.g., shape, size,etc.) of object parts. We assume that object parts have nearly ho-mogenous surfaces, and hence, the uniform regions in an imagecorrespond to object parts. Although this assumption holds inmost cases, there are still some exceptions. For example, inFig. 7(a), the black car body is painted into different patterns.As a result, the car body is oversegmented to many small parts.Under this situation, our POM could not detect any special re-lationships between the small parts and hence could not piecethem together. Similar situations can be found on the woman’sclothing in Fig. 7(b) and the leopard in Fig. 7(c).Some object classes such as bicycles, motorcycles, or build-

ings have very complex structures. Some parts of the objectsare not strongly attached to other parts. For these object classes,

CHENG et al.: IMAGE SEGMENTATION BASED ON BACKGROUND RECOGNITION AND PERCEPTUAL ORGANIZATION 1017

Fig. 7. Examples of where our POM segmentation algorithm makes mistakes. (This figure is best viewed in color.)

our POM may not be able to piece the whole object together.Instead, it may only piece some semantically meaningful partsof the objects together [see Fig. 7(d)]. For these objects, higherlevel object-specific knowledge is still required to segment theentire objects.Another problem is caused by strong reflection. An example

is shown in Fig. 7(e). Due to strong reflection, the upper rearpart of the blue bus shows extremely bright white color. Ourmethod identified the region as sky and hence did not piece thepart with the bus. In some cluttered environments, one struc-tured object may stand in front of another structured object.From some viewpoints, some parts of the front object may coin-cidently have special geometric relationships with some parts ofthe background objects . Under these situations, our POM maybe confused and merge these parts together [see the car in themiddle in Fig. 7(f)]). This problem can be addressed by recog-nizing the background structured objects. Currently, our methodcan only identify five homogenous background object classes(i.e., sky, road, trees, grass, and water). The mountains, build-ings, and walls are also common background objects in outdoorscenes. We plan to enhance the background identification capa-bility of our method in the future by training more classifiersto identify mountains, buildings, walls, etc. With the ability ofidentifying more background object classes, the performance ofour method can be expected to be further improved.

V. CONCLUSION AND DISCUSSION

We have presented a novel image segmentation algorithmfor outdoor natural scenes. Our main contribution is that we de-velop a POM. Our experimental results show that our proposed

method outperformed two competing state-of-the-art imagesegmentation approaches (Gould09 [49] and global probabilityof boundary [61]) and achieved good segmentation quality ontwo challenging outdoor scene image data sets (GDS [52] andBSDS [60]).It is well accepted that segmentation and recognition should

not be separated and should be treated as an interleaving pro-cedure. Our method basically follows this scheme and requiresidentifying some background objects as a starting point. Com-pared to the large number of structured object classes, thereare only a few common background objects in outdoor scenes.These background objects have low visual variety and hencecan be reliably recognized. After background objects are iden-tified, we roughly know where the structured objects are anddelimit perceptual organization in certain areas of an image.For many objects with polygonal shapes, such as the major ob-ject classes appearing in the streets (e.g., buildings, vehicles,signs, people, etc.) and many other objects, our method canpiece the whole object or the main portions of the objects to-gether without requiring recognition of the individual objectparts. In other words, for these object classes, our method pro-vides a way to separate segmentation and recognition. This isthe major difference between our method and other class seg-mentation methods that require recognizing an object in orderto segment it. This paper shows that, for many fairly articulatedobjects, recognition may not be a requirement for segmentation.The geometric relationships of the constituent parts of the ob-jects provide useful cues indicating the memberships of theseparts.

1018 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 3, MARCH 2012

REFERENCES

[1] T. Malisiewicz and A. A. Efros, “Improving spatial support for objectsvia multiple segmentations,” in Proc. BMVC, 2007.

[2] E. Borenstein and E. Sharon, “Combining top-down and bottom-upsegmentation,” in Proc. IEEEWorkshop Perceptual Org. Comput. Vis.,CVPR, 2004, pp. 46–53.

[3] X. Ren, “Learning a classification model for segmentation,” in Proc.IEEE ICCV, 2003, vol. 1, pp. 10–17.

[4] U. Rutishauser and D. Walther, “Is bottom-up attention useful for ob-ject recognition?,” in Proc. IEEE CVPR, 2004, vol. 2, pp. 37–44.

[5] P. Felzenszwalb and D. Huttenlocher, “Efficient graph-based imagesegmentation,” Int. J. Comput. Vis., vol. 59, no. 2, pp. 167–181, Sep.2004.

[6] J. B. Shi and J. Malik, “Normalized cuts and image segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905,Aug. 2000.

[7] T. Cour, F. Benezit, and J. B. Shi, “Spectral segmentation with mul-tiscale graph decomposition,” in Proc. IEEE CVPR, 2005, vol. 2, pp.1124–1131.

[8] D. Lowe, Perceptual Organization and Visual Recognition. Dor-drecht, The Netherlands: Kluwer, 1985.

[9] J. D. McCafferty, Human and Machine Vision: Computing PerceptualOrganization. Chichester, U.K.: Ellis Horwood, 1990.

[10] R. Mohan and R. Nevatia, “Perceptual organization for scene segmen-tation and description,” IEEE Trans. Pattern Anal. Mach. Intell., vol.14, no. 6, pp. 616–635, Jun. 1992.

[11] A. Witkin and J. Tenenbaum, “On the role of structure in vision,” inHuman and Machine Vision, J. Beck, B. Hope, and A. Rosenfeld,Eds. New York: Academic, 1983.

[12] D. Hoiem, A. A. Efros, and M. Hebert, “Geometric context from asingle image,” in Proc. IEEE ICCV, 2005, vol. 1, pp. 654–661.

[13] D. Hoiem, A. N. Stein, A. A. Efros, and M. Hebert, “Recovering oc-clusion boundaries from a single image,” in Proc. IEEE ICCV, 2007,pp. 1–8.

[14] D. Hoiem, A. A. Efros, and M. Hebert, “Recovering surface layoutfrom an image,” Int. J. Comput. Vis., vol. 75, no. 1, pp. 151–172, Oct.2007.

[15] H. Su, A. Bouridane, and D. Crookes, “Scale adaptive complexity mea-sure of 2-D shapes,” in Proc. IEEE ICPR, 2006, pp. 134–137.

[16] B. Vasselle and G. Giraudon, “2-D digital curve analysis: A regularitymeasure,” in Proc. IEEE ICCV, 1993, pp. 556–561.

[17] K. Plataniotis and A. Venetsanopoulos, Color Image Processing andApplications. Berlin, Germany: Springer-Verlag, 2000, ch. 1, pp.268–269.

[18] T. Brinkhoff, H. P. Kriegel, and R. Schneider, “Measuring the com-plexity of polygonal objects,” in Proc. 3rd ACM Int. Workshop, 1995,pp. 109–117.

[19] D. D. Hoffman and M. Singh, “Salience of visual parts,” Cognition,vol. 63, no. 1, pp. 29–78, Apr. 1997.

[20] D. W. Jacobs, “Robust and efficient detection of salient convexgroups,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 18, no. 1, pp.23–37, Jan. 1996.

[21] I. H. Jermyn and H. Ishikawa, “Globally optimal regions and bound-aries as minimum ratio weight cycles,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 23, no. 10, pp. 1075–1088, Oct. 2001.

[22] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, La-belMe: A database and web-based tool for image annotation MIT,Cambridge, MA, Tech. Rep. MIT-CSAIL-TR-2005-056, 2005.

[23] B. C. Russell, “Using multiple segmentations to discover objects andtheir extent in image collections,” in Proc. IEEE CVPR, 2006, vol. 2,pp. 1605–1614.

[24] D. R. Martin, C. C. Fowlkes, and J. Malik, “Learning to detect naturalimage boundaries using local brightness, color, and texture cues,” IEEETrans. Pattern Anal. Mach. Intell., vol. 26, no. 5, pp. 530–549, May2004.

[25] D. R. Martin, “An empirical approach to grouping and segmentation,”Ph.D. dissertation, U.C. Berkeley, Berkeley, CA, 2002.

[26] R. Unnikrishnan, C. Pantofaru, and M. Hebert, “A measure for ob-jective evaluation of image segmentation algorithm,” in Proc. IEEECVPR, 2005, vol. 3, pp. 34–41.

[27] H. Zhang, S. Cholleti, S. A. Goldman, and J. E. Fritts, “Meta-evaluationof image segmentation using machine learning,” in Proc. IEEE CVPR,2006, vol. 1, pp. 1138–1145.

[28] F. J. Estrada and A. D. Jepson, “Quantitative evaluation of a novelimage segmentation algorithm,” in Proc. IEEE CVPR, 2005, vol. 2,pp. 1132–1139.

[29] S. C. Zhu and A. Yuille, “Region competition: Unifying snakes, regiongrowing and Bayes/MDL for multi-band image segmentation,” IEEETrans. Pattern Anal. Mach. Intell., vol. 18, no. 9, pp. 884–900, Sep.1996.

[30] S. K. Shah, “Performance modeling and algorithm characterization forrobust image segmentation,” Int. J. Comput. Vis., vol. 80, no. 1, pp.92–103, Oct. 2008.

[31] G. Griffin, A. Holub, and P. Perona, Caltech-256 object categorydataset California Inst. Technol., Pasadena, CA, Tech. Rep. 7694,2007.

[32] Z. L. Liu, D. W. Jacobs, and R. Basri, “The role of convexity in per-ceptual completion: Beyond good continuation,” Vis. Res., vol. 39, no.25, pp. 4244–4257, Dec. 1999.

[33] D. W. Jacobs, “What makes viewpoint-invariant properties perceptu-ally salient?,” J. Opt. Soc. Amer. A, Opt. Image Sci., vol. 20, no. 7, pp.1304–1320, Jul. 2003.

[34] V. Bruce and P. Green, Visual Perception: Physiology, Psychology andEcology. Hillsdale, NJ: Lawrence Erlbaum Associates Ltd., 1990.

[35] Z. Wu and R. Leahy, “An optimal graph theoretic approach to dataclustering: Theory and its application to image segmentation,” IEEETrans. Pattern Anal. Mach. Intell., vol. 15, no. 11, pp. 1101–1113, Nov.1993.

[36] P. Dollar, Z.W. Tu, and S. Belongie, “Supervised learning of edges andobject boundaries,” in Proc. IEEE CVPR, 2006, vol. 2, pp. 1964–1971.

[37] S. Mahamud, L. R. Williams, K. K. Thornber, and K. Xu, “Segmenta-tion of multiple salient closed contours from real images,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 25, no. 4, pp. 433–444, Apr. 2003.

[38] X. F. Ren, C. C. Fowlkes, and J. Malik, “Learning probabilistic modelsfor contour completion in natural images,” Int. J. Comput. Vis., vol. 77,no. 1–3, pp. 47–63, May 2008.

[39] A. Desolneux, L. Moisan, and J. M. Morel, “A grouping principle andfour applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25,no. 4, pp. 508–513, Apr. 2003.

[40] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost forimage understanding: Multi-class object recognition and segmentationby jointly modeling texture, layout, and context,” Int. J. Comput. Vis.,vol. 81, no. 1, pp. 2–23, Jan. 2009.

[41] J. Winn, A. Criminisi, and T. Minka, “Categorization by learneduniversal visual dictionary,” in Proc. IEEE ICCV, 2005, vol. 2, pp.1800–1807.

[42] B. Micusik and J. Kosecka, “Semantic segmentation of street scenes bysuperpixel co-occurrence and 3-D geometry,” in Proc. IEEE WorkshopVOEC, 2009.

[43] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regres-sion: A statistical view of boosting,” Ann. Statist., vol. 38, no. 2, pp.337–374, Apr. 2000.

[44] J. Malik, S. Belongie, T. Leung, and J. Shi, “Contour and texture anal-ysis for image segmentation,” Int. J. Comput. Vis., vol. 43, no. 1, pp.7–27, Jun. 2001.

[45] T. Chan and L. Vese, “Active contours without edges,” IEEE Trans.Image Process., vol. 10, no. 2, pp. 266–277, Feb. 2001.

[46] L. A. Vese and T. F. Chan, “Amultiphase level set framework for imagesegmentation using the Mumford and Shah model,” Int. J. Comput.Vis., vol. 50, no. 3, pp. 271–293, Dec. 2002.

[47] D. Mumford and J. Shah, “Optimal approximations by piecewisesmooth functions and associated variational problems,” Commun.Pure Appl. Math., vol. 42, no. 5, pp. 577–685, Jul. 1989.

[48] D. Comaniciu and P. Meer, “Mean shift: A robust approach towardfeature space analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol.24, no. 5, pp. 603–619, May 2002.

[49] S. Gould, O. Russakovsky, I. Goodfellow, P. Baumstarck, A. Y. Ng,and D. Koller, The STAIR Vision Library (v2.3) 2009 [Online]. Avail-able: http://ai.stanford.edu/~sgould/svl

[50] J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests forimage categorization and segmentation,” in Proc. IEEE CVPR, 2008,pp. 1–8.

[51] C. Pantofaru, C. Schmid, and M. Hebert, “Object recognition by in-tegrating multiple image segmentations,” in Proc. ECCV, 2008, pp.481–494.

[52] S. Gould, R. Fulton, and D. Koller, “Decomposing a scene intogeometric and semantically consistent regions,” in Proc. IEEE ICCV,2009, pp. 1–8.

[53] L. Yang, P. Meer, and D. J. Foran, “Multiple class segmentation usinga unified framework over man-shift patches,” in Proc. IEEE CVPR,2007, pp. 1–8.

[54] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller, “Multi-classsegmentation with relative location prior,” Int. J. Comput. Vis., vol. 80,no. 3, pp. 300–316, Dec. 2008.

CHENG et al.: IMAGE SEGMENTATION BASED ON BACKGROUND RECOGNITION AND PERCEPTUAL ORGANIZATION 1019

[55] X. He, R. Zemel, and M. Carreira-Perpinan, “Multiscale CRFs forimage labeling,” in Proc. IEEE CVPR, 2004, pp. 695–702.

[56] C. Cheng, A. Koschan, D. L. Page, and M. A. Abidi, “Scene imagesegmentation based on perception organization,” in Proc. IEEE ICIP,2009, pp. 1801–1804.

[57] B. C. Russell, A. B. Torralba, K. P. Murphy, and W. T. Freeman, “La-belMe: A database and web-based tool for image annotation,” Int. J.Comput. Vis., vol. 77, no. 1–3, pp. 157–173, May 2008.

[58] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A.Zisserman, The PASCAL Visual Object Classes Challenge 2007(VOC2007) Results. 2007.

[59] [Online]. Available: http://graphics.cs.msu.ru/ru/science/research/ma-chinelearning/adaboosttoolbox

[60] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human seg-mented natural images and its application to evaluating segmentationalgorithms and measuring ecological statistics,” in Proc. IEEE ICCV,2001, vol. 2, pp. 416–423.

[61] M.Maire, P. Arbelaez, C. C. Fowlkes, and J. Malik, “Using contours todetect and localize junctions in natural images,” in Proc. IEEE CVPR,2008, pp. 1–8.

Chang Cheng received the B.E. degree in electron-ical machinery from Hangzhou Dianzi University,Hangzhou, China, in 1995, the M.E. degree in com-puter science from Southeast University, Nanjing,China, in 2001, and the Ph.D. degree in electricalengineering from The University of Tennessee,Knoxville, in 2010.He is currently a Technical Staff with Riverbed

Technology, Sunnyvale, CA. His research interestsare in image processing, computer vision, andmobile robotics.

Andreas Koschan (M’90) received the Diplom(M.S.) degree in computer science and the Dr.-Ing.(Ph.D.) degree in computer engineering from Tech-nical University Berlin, Berlin, Germany, in 1985and 1991, respectively.Currently, he is a Research Associate Professor

with the Department of Electrical and Computer En-gineering, The University of Tennessee, Knoxville.His work focused on color image processing and3-D computer vision, including stereo vision andlaser range-finding techniques. He is a coauthor of

four textbooks on color and 3-D image processing.Dr. Koschan is a member of the Society for Imaging Science and Technology

(IS&T).

Chung-Hao Chen received the B.S. and M.S.degrees in computer science and information engi-neering from Fu Jen Catholic University, New TaipeiCity, Taiwan, in 1997 and 2001, respectively, andthe Ph.D. degree in electrical engineering from TheUniversity of Tennessee, Knoxville, in 2009.In 2009, he joined the Department of Mathematics

and Computer Science, North Carolina Central Uni-versity, Durham, NC, as an Assistant Professor andretained this position until 2011. He is currently anAssistant Professor with the Department of Electrical

and Computer Engineering, Old Dominion University, Norfolk, VA. His re-search interests include object tracking, robotics and image processing.

David L. Page received the B.S. and M.S. degrees inelectrical engineering from Tennessee TechnologicalUniversity, Cookeville, in 1993 and 1995, respec-tively, and the Ph.D. degree from The University ofTennessee, Knoxville, in 2003, through the Imaging,Robotics, and Intelligent Systems Laboratory.He was a Civilian Research Engineer with the

Naval Surface Warfare Center, Dahlgren, VA. Hehas also served as a Research Assistant Professorwith the Imaging, Robotics, and Intelligent SystemsLaboratory, The University of Tennessee, until 2008

. Within this laboratory, he was involved with a variety of research topicsranging from robotic vision systems to multivideo security systems. He is cur-rently a Partner and Chief 3-D Architect with Third Dimension TechnologiesLLC, Knoxville, TN, a technology start-up company developing revolutionary3-D displays. In this capacity, David serves as the Lead Scientist for algorithmdevelopment of 3-D rendering and computer-vision-based calibration.

Mongi A. Abidi received the Principal Engineeringdegree in electrical engineering from the NationalEngineering School of Tunis, Tunisia, in 1981 andthe M.S. and Ph.D. degrees in electrical engineeringfrom The University of Tennessee, Knoxville, in1985 and 1987, respectively.He is a Professor with the Department of Elec-

trical and Computer Engineering, The University ofTennessee, where he directs activities in the Imaging,Robotics, and Intelligent Systems Laboratory as anAssociate Department Head . He conducts research

in the field of 3-D imaging, specifically in the areas of scene building, scenedescription, and data visualization.