As You Like It: Localization via Paired Comparisons - Journal ...

39
Journal of Machine Learning Research 22 (2021) 1-37 Submitted 2/18; Revised 7/19; Published 4/21 As You Like It: Localization via Paired Comparisons Andrew K. Massimino [email protected] Mark A. Davenport [email protected] School of Electrical and Computer Engineering Georgia Institute of Technology 777 Atlantic Dr NW Atlanta, GA 30332 USA Editor: Francis Bach Abstract Suppose that we wish to estimate a vector x from a set of binary paired comparisons of the form β€œx is closer to p than to q” for various choices of vectors p and q. The problem of estimating x from this type of observation arises in a variety of contexts, including nonmetric multidimensional scaling, β€œunfolding,” and ranking problems, often because it provides a powerful and flexible model of preference. We describe theoretical bounds for how well we can expect to estimate x under a randomized model for p and q. We also present results for the case where the comparisons are noisy and subject to some degree of error. Additionally, we show that under a randomized model for p and q, a suitable number of binary paired comparisons yield a stable embedding of the space of target vectors. Finally, we also show that we can achieve significant gains by adaptively changing the distribution used for choosing p and q. Keywords: paired comparisons, ideal point models, recommendation systems, 1-bit sensing, binary embedding 1. Introduction In this paper we consider the problem of determining the location of a point in Euclidean space based on distance comparisons to a set of known points, where our observations are nonmetric. In particular, let x ∈ R be the true position of the point that we are trying to estimate, and let (p 1 , q 1 ),..., (p , q ) be pairs of β€œlandmark” points in R which we assume to be known a priori. Rather than directly observing the raw distances from x, i.e., β€–x βˆ’ p β€– and β€–x βˆ’ q β€–, we instead obtain only paired comparisons of the form β€–x βˆ’ p β€– < β€–x βˆ’ q β€–. Our goal is to estimate x from a set of such inequalities. Nonmetric observations of this type arise in numerous applications and have seen considerable interest in recent literature e.g., (Ailon, 2011; Davenport, 2013; Eriksson, 2013; Shah et al., 2016). These methods are often applied in situations where we have a collection of items and hypothesize that it is possible to embed the items in R in such a way that the Euclidean distance between points corresponds to their β€œdissimilarity,” with small distances corresponding to similar items. Here, we focus on the sub-problem of adding a new point to a known (or previously learned) configuration of landmark points. As a motivating example, we consider the problem of estimating a user’s preferences from limited response data. This is useful, for instance, in recommender systems, information c β—‹2021 Andrew K. Massimino and Mark A. Davenport. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/18-105.html.

Transcript of As You Like It: Localization via Paired Comparisons - Journal ...

Journal of Machine Learning Research 22 (2021) 1-37 Submitted 2/18; Revised 7/19; Published 4/21

As You Like It: Localization via Paired Comparisons

Andrew K. Massimino [email protected] A. Davenport [email protected] of Electrical and Computer EngineeringGeorgia Institute of Technology777 Atlantic Dr NWAtlanta, GA 30332 USA

Editor: Francis Bach

AbstractSuppose that we wish to estimate a vector x from a set of binary paired comparisons ofthe form β€œx is closer to p than to q” for various choices of vectors p and q. The problemof estimating x from this type of observation arises in a variety of contexts, includingnonmetric multidimensional scaling, β€œunfolding,” and ranking problems, often because itprovides a powerful and flexible model of preference. We describe theoretical bounds for howwell we can expect to estimate x under a randomized model for p and q. We also presentresults for the case where the comparisons are noisy and subject to some degree of error.Additionally, we show that under a randomized model for p and q, a suitable number ofbinary paired comparisons yield a stable embedding of the space of target vectors. Finally,we also show that we can achieve significant gains by adaptively changing the distributionused for choosing p and q.Keywords: paired comparisons, ideal point models, recommendation systems, 1-bit sensing,binary embedding

1. Introduction

In this paper we consider the problem of determining the location of a point in Euclideanspace based on distance comparisons to a set of known points, where our observations arenonmetric. In particular, let x ∈ R𝑛 be the true position of the point that we are trying toestimate, and let (p1, q1), . . . , (pπ‘š, qπ‘š) be pairs of β€œlandmark” points in R𝑛 which we assumeto be known a priori. Rather than directly observing the raw distances from x, i.e., β€–xβˆ’p𝑖‖and β€–xβˆ’ q𝑖‖, we instead obtain only paired comparisons of the form β€–xβˆ’ p𝑖‖ < β€–xβˆ’ q𝑖‖.Our goal is to estimate x from a set of such inequalities. Nonmetric observations of thistype arise in numerous applications and have seen considerable interest in recent literaturee.g., (Ailon, 2011; Davenport, 2013; Eriksson, 2013; Shah et al., 2016). These methods areoften applied in situations where we have a collection of items and hypothesize that it ispossible to embed the items in R𝑛 in such a way that the Euclidean distance between pointscorresponds to their β€œdissimilarity,” with small distances corresponding to similar items.

Here, we focus on the sub-problem of adding a new point to a known (or previouslylearned) configuration of landmark points.

As a motivating example, we consider the problem of estimating a user’s preferences fromlimited response data. This is useful, for instance, in recommender systems, information

cβ—‹2021 Andrew K. Massimino and Mark A. Davenport.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are providedat http://jmlr.org/papers/v22/18-105.html.

Massimino and Davenport

qi

pi

x

Figure 1: An illustration of the localization problem from paired comparisons. The infor-mation that x is closer to p𝑖 than q𝑖 tells us which side of a hyperplane x lies.Through many such comparisons we can hope to localize x to a high degree ofaccuracy.

retrieval, targeted advertising, and psychological studies. A common and intuitively appealingway to model preferences is via the ideal point model, which supposes preference for aparticular item varies inversely with Euclidean distance in a feature space (Coombs, 1950).We assume that the items to be rated are represented by points p𝑖 and q𝑖 in an 𝑛-dimensionalEuclidean space. A user’s preference is modeled as an additional point x in this space (calledthe individual’s β€œideal point”). This represents a hypothetical β€œperfect” item satisfying allof the user’s criteria for evaluating items.

Using response data consisting of paired comparisons between items (e.g., β€œuser x prefersitem p𝑖 to item q𝑖”) is a natural approach when dealing with human subjects since it avoidsrequiring people to assign precise numerical scores to different items, which is generally aquite difficult task, especially when preferences may depend on multiple factors (Miller, 1956).In contrast, human subjects often find pairwise judgements much easier to make (David,1963). Data consisting of paired comparisons is often generated implicitly in contexts wherethe user has the option to act on two (or more) alternatives; for instance they may chooseto watch a particular movie, or click a particular advertisement, out of those displayed tothem (Radlinski and Joachims, 2007). In such contexts, the β€œtrue distances” in the idealpoint model’s preference space are generally inaccessible directly, but it is nevertheless stillpossible to obtain an estimate of a user’s ideal point.

1.1 Main Results

The fundamental question which interests us in this paper is how many comparisons weneed (and how should we choose them) to estimate x to a desired degree of accuracy. Thus,we consider the case where we are given an existing embedding of the items (as in a maturerecommender system) and focus on the on-line problem of locating a single new user fromtheir feedback (consisting of binary data generated from paired comparisons). The itemembedding could be generated using various methods, such as multidimensional scalingapplied to a set of item features, or even using the results of previous paired comparisons viaan approach like that in (Agarwal et al., 2007). Given such an embedding of β„“ items, there

2

Localization via Paired Comparisons

are a total of(β„“

2)

= Θ(β„“2) possible paired comparisons. Clearly, in a system with thousands(or more) items, it will be prohibitive to acquire this many comparisons as a typical user willlikely only provide comparisons for a handful of items. Fortunately, in general we can expectthat many, if not most, of the possible comparisons are actually redundant. For example, ofthe comparisons illustrated in Fig. 1, all but four are redundant andβ€”at least in the absenceof noiseβ€”add no additional information.

Any precise answer to this question would depend on the underlying geometry of theitem embedding. Each comparison essentially divides R𝑛 in two, indicating on which side ofa hyperplane x lies, and some arrangements of hyperplanes will yield better tessellationsof the preference space than others. Thus, to gain some intuition on this problem withoutreference to the geometry of a particular embedding, we will instead consider a probabilisticmodel where the items are generated at random from a particular distribution. In this casewe show that under certain natural assumptions on the distribution, it is possible to estimatethe location of any x to within an error of πœ– using a number of comparisons which, up to logfactors, is proportional to 𝑛/πœ–. This is essentially optimal, so that no set of comparisonscan provide a uniform guarantee with significantly fewer comparisons. We then describeseveral stability and robustness guarantees for various settings in which the comparisonsare subject to noise or errors. Finally, we then describe a simple extension to an adaptivescheme where we adaptively select the comparisons (manifested here in adaptively alteringthe mean and variance of the distribution generating the items) to substantially reduce therequired number of comparisons.

1.2 Related Work

It is important to note that the ideal point model, while similar, is distinct from the low-rankmodel used in matrix completion (Rennie and Srebro, 2005; Candès and Recht, 2009).Although both models suppose user choices are guided by a number of attributes, the idealpoint model leads to preferences that are non-monotonic functions of those attributes. Theideal point model suggests that each feature has an ideal level; too much of a feature can bejust as undesirable as too little. It is not possible to obtain this kind of performance with atraditional low-rank model, though if points are limited to the sphere, then the ideal pointmodel can duplicate the performance of a low-rank factorization. There is also empiricalevidence that the ideal point model captures behavior more accurately than factorizationbased approaches do (Dubois, 1975; Maydeu-Olivares and Bâckenholt, 2009).

There is a large body of work that studies the problem of learning to rank items fromvarious sources of data, including paired comparisons of the sort we consider in this paper.See, for example, (Jamieson and Nowak, 2011a,b; Wauthier et al., 2013) and referencestherein. We first note that in most work on rankings, the central focus is on learning a correctrank-ordered list for a particular user, without providing any guarantees on recovering acorrect parameterization for the user’s preferences as we do here. While these two problemsare related, there are natural settings where it might be desirable to guarantee an accuraterecovery of the underlying parameterization (x in our model). For example, one could exploitthese guarantees in the context of an iterative algorithm for nonmetric multidimensionalscaling which aims to refine the underlying embedding by updating each user and item oneat a time (e.g., see O’Shaughnessy and Davenport, 2016), in which case an understanding of

3

Massimino and Davenport

the error in the estimate of x is crucial. Moreover, we believe that our approach provides aninteresting alternative perspective as it yields natural robustness guarantees and suggestssimple adaptive schemes.

Also closely related is the work in (Lu and Negahban, 2015; Park et al., 2015; Oh et al.,2014) which consider paired comparisons and more general ordinal measurements in thesimilar (but as discussed above, subtly different) context of low-rank factorizations. Perhapsmost closely related to our work is that of (Jamieson and Nowak, 2011a), which examinesthe problem of learning a rank ordering using the same ideal point model considered in thispaper. The message in this work is broadly consistent with ours, in that the number ofcomparisons required should scale with the dimension of the preference space (not the totalnumber of items) and can be significantly improved via a clever adaptive scheme. However,this work does not bound the estimation error in terms of the Euclidean distance, which isour central concern. (Jamieson and Nowak, 2011b) also incorporates adaptivity, but seeksto embed a set of points in Euclidean space (as opposed to estimating a single user’s idealpoint) and relies on paired comparisons involving three arbitrarily selected points (ratherthan a user’s ideal point and two items). Our dyadic adaptive strategy is also similar inspirit to a higher-dimensional form of binary search, as in (Nowak, 2009), however herewe consider estimating a continuous ideal point x rather than choosing from a finite set ofpossible hypotheses.

Constructing an embedding of items given comparison measurements is sometimesreferred to as ordinal embedding and is studied in (Kleindessner and Luxburg, 2014), (Jainet al., 2016), and (Arias-Castro et al., 2017). As mentioned, in this work we assume thepresence of an item embedding created by e.g., those methods. Our work could then beused to create a corresponding embedding of users based on response data to perform e.g.,personalization and customer segmentation.

Finally, while seemingly unrelated, we note that our work builds on the growing bodyof literature of 1-bit compressive sensing. In particular, our results are largely inspired bythose in (Knudson et al., 2016; Baraniuk et al., 2017), and borrow techniques from (Jacqueset al., 2013) in the proofs of some of our main results. Our embedding result of Section 4.1is most directly related to the work of (Plan and Vershynin, 2014), which studies a similarproblem to (2) but under a different, non-pairwise model. Because of this distinction, theirresults are not directly applicable. However, due to our particular probabilistic model, ourresults are also in some sense stronger. We will expand further with a direct comparison tothis work during the presentation of our result. The 1-bit sensing problem is also consideredin (Plan and Vershynin, 2013b) and (Plan and Vershynin, 2013a) but these works handleonly queries involving homogeneous hyperplanes, without offset, which is not applicableto our pairwise setting. Ideas similar to 1-bit sensing appear in field of locality sensitivehashing, e.g, (Konoshima and Noma, 2012) which does treat inhomogeneous hyperplanesbut does not provide theory.

Note that in this work we extend preliminary results first presented in (Massimino andDavenport, 2016, 2017).

4

Localization via Paired Comparisons

2. A Randomized Observation Model

For the moment we will consider the β€œnoise-free” setting where each comparison between xand q𝑖 versus p𝑖 results in assigning the point which is truly closest to x with probability1. In this case we can represent the observed comparisons mathematically by letting π’œπ‘–(x)denote the 𝑖th observation, which consists of comparisons between p𝑖 and q𝑖, and setting

π’œπ‘–(x) := sign(β€–xβˆ’ q𝑖‖2 βˆ’ β€–xβˆ’ p𝑖‖2

)={

+1 if x is closer to p𝑖

βˆ’1 if x is closer to q𝑖.(1)

We will also use π’œ(x) := [π’œ1(x), Β· Β· Β· ,π’œπ‘š(x)]𝑇 to denote the vector of all observationsresulting from π‘š comparisons. Note that since

β€–xβˆ’ q𝑖‖2 βˆ’ β€–xβˆ’ p𝑖‖2 = 2(p𝑖 βˆ’ q𝑖)𝑇 x + β€–q𝑖‖2 βˆ’ β€–p𝑖‖2 ,

if we set a𝑖 = (p𝑖 βˆ’ q𝑖) and πœπ‘– = 12(β€–p𝑖‖2 βˆ’ β€–q𝑖‖2), then we can re-write our observation

model asπ’œπ‘–(x) = sign

(2a𝑇

𝑖 xβˆ’ 2πœπ‘–

)= sign

(a𝑇

𝑖 xβˆ’ πœπ‘–

). (2)

This is reminiscent of the standard setup in one-bit compressive sensing (with dithers) (Knud-son et al., 2016; Baraniuk et al., 2017) with the important differences that: (i) we have notmade any kind of sparsity or other structural assumption on x and, (ii) the β€œdithers” πœπ‘–,at least in this formulation, are dependent on the a𝑖, which results in difficulty applyingstandard results from this theory to the present setting.

However, many of the techniques from this literature will nevertheless be helpful inanalyzing this problem. To see this, we consider a randomized observation model wherethe pairs (p𝑖, q𝑖) are chosen independently with i.i.d. entries drawn according to a normaldistribution, i.e., p𝑖, q𝑖 ∼ 𝒩 (0, 𝜎2I). In this case, we have that the entries of our sensingvectors are i.i.d. with ��𝑖(𝑗) ∼ 𝒩 (0, 2𝜎2). Moreover, if we define b𝑖 = p𝑖 + q𝑖, then we alsohave that b𝑖 ∼ 𝒩 (0, 2𝜎2I), and

12 a𝑇

𝑖 b𝑖 = 12βˆ‘

𝑗

(p𝑖(𝑗)βˆ’ q𝑖(𝑗))(p𝑖(𝑗) + q𝑖(𝑗))

= 12βˆ‘

𝑗

p𝑖(𝑗)2 βˆ’ q𝑖(𝑗)2 = 12(β€–p𝑖‖2 βˆ’ β€–q𝑖‖2) = πœπ‘–.

Note that while πœπ‘– = 12 a𝑇

𝑖 b𝑖 is clearly dependent on a𝑖, we do have that a𝑖 and b𝑖 areindependent.

To simplify, we re-normalize by dividing by β€–a𝑖‖, i.e., setting a𝑖 := a𝑖/ β€–a𝑖‖ and πœπ‘– :=πœπ‘–/ β€–a𝑖‖, in which case we can write

π’œπ‘–(x) = sign(a𝑇𝑖 xβˆ’ πœπ‘–). (3)

It is easy to see that a𝑖 is distributed uniformly on the sphere Sπ‘›βˆ’1 = {a ∈ R𝑛 : β€–aβ€– = 1}.Note that throughout our analysis we will exploit the fact that a𝑖 is uniform on Sπ‘›βˆ’1 andwill let 𝜈 denote the uniform measure on the sphere. Note also that

πœπ‘– = 12a𝑇

𝑖 b𝑖.

5

Massimino and Davenport

Since a𝑖 and b𝑖 are independent, a𝑖 and b𝑖 are also independent. Moreover, for any unit-vector a𝑖, if b𝑖 ∼ 𝒩 (0, 2𝜎2I) then a𝑇

𝑖 b𝑖 ∼ 𝒩 (0, 2𝜎2). Thus, we must have πœπ‘– ∼ 𝒩 (0, 𝜎2/2),independent of a𝑖, which is the key insight that enables the analysis below.

3. Guarantees in the Noise-Free Setting

We now state a result concerning localization under the noise-free random model fromSection 2. Extensions to noisy comparisons and efficient estimation in practice are discussedin Sections 4 and 5, respectively. Let B𝑛

𝑅 denote the 𝑛-dimensional, radius 𝑅 Euclidean ball.

Theorem 1 Let πœ–, πœ‚ > 0 be given. Let π’œπ‘–(Β·) be defined as in (1), and suppose that π‘š pairs{(p𝑖, q𝑖)}π‘šπ‘–=1 are generated by drawing each p𝑖 and q𝑖 independently from 𝒩 (0, 𝜎2𝐼) where𝜎2 = 2𝑅2/𝑛. There exists a constant 𝐢 such that if

π‘š β‰₯ 𝐢𝑅

πœ–

(𝑛 log 𝑅

βˆšπ‘›

πœ–+ log 1

πœ‚

), (4)

then with probability at least 1βˆ’ πœ‚, for all x, y ∈ B𝑛𝑅 such that π’œ(x) = π’œ(y),

β€–xβˆ’ yβ€– ≀ πœ–.

The result follows from applying Lemma 2 below to pairs of points in a covering set of B𝑛𝑅.

The key message of this theorem is that if one chooses the variance 𝜎2 of the distributiongenerating the items appropriately, then it is possible to estimate x to within πœ– using anumber of comparisons that is nearly linear in 𝑛/πœ–. As we will show in Theorem 3, thisresult is optimal, ignoring log factors, in terms of the scaling of π‘š with 𝑛 and πœ–. This alsomakes intuitive sense; if all the hyperplanes were all axis-aligned, one would require thenumber of hyperplanes be at least proportional to the number of dimensions, 𝑛, otherwisesome direction would be unconstrained. Theorem 1 is also sensible in terms of the ratio ofthe initial uncertainty 𝑅 to the target uncertainty πœ– since 𝑅/πœ– hyperplanes would be requiredto uniformly localize a point along a single dimension.

A natural question is what would happen with a different choice of 𝜎2. In fact, thisassumption is criticalβ€”if 𝜎2 is substantially smaller the bound quickly becomes vacuous, andas 𝜎2 grows much past 𝑅2/𝑛 the bound begins to become steadily worse.1 As we will see inSection 6, this is in fact observed in practice. It should also be somewhat intuitive: if 𝜎2 istoo small, then nearly all the hyperplanes induced by the comparisons will pass very close tothe origin, so that accurate estimation of even β€–xβ€– becomes impossible. On the other hand,if 𝜎2 is too large, then an increasing number of these hyperplanes will not even intersect theball of radius 𝑅 in which x is presumed to lie, thus yielding no new information.

Lemma 2 Let w, z ∈ B𝑛𝑅 be distinct and fixed, and let 𝛿 > 0 be given. Define

𝐡𝛿(w) := {u ∈ B𝑛𝑅 : β€–uβˆ’wβ€– ≀ 𝛿}.

1. We note that it is possible to try to optimize 𝜎2 by setting 𝜎2 = 𝑐𝑅2/𝑛 for some constant 𝑐 and thenselecting 𝑐 so as to minimize the constant 𝐢 in (4). We believe this would yield limited insight since,in order to obtain a result which is valid uniformly for all possible 𝑛, we use certain bounds which forgeneral 𝑛 can be somewhat loose and would skew the resulting 𝑐. We instead simply select 𝑐 = 2 forsimplicity in our analysis (as it results in πœπ‘– ∼ 𝒩 (0, 𝑅2/𝑛)) and because it aligns well with simulations.

6

Localization via Paired Comparisons

Let π’œπ‘– be defined as in Theorem 1. Denote by 𝑃sep the probability that 𝐡𝛿(w) and 𝐡𝛿(z) areseparated by hyperplane 𝑖, i.e.,

𝑃sep := P [βˆ€u ∈ 𝐡𝛿(w),βˆ€v ∈ 𝐡𝛿(z) : π’œπ‘–(u) = π’œπ‘–(v)] .

For any πœ–0 ≀ β€–wβˆ’ zβ€– we have

𝑃sep β‰₯πœ–0 βˆ’ 𝛿

√2𝑛

22√

πœ‹π‘’5/2𝑅.

Proof Let πœ– = β€–wβˆ’ zβ€–. Here, we denote the normal vector and threshold of hyperplane 𝑖by a and 𝜏 respectively. It is easy to show that 𝑃sep can be expressed as

𝑃sep = P[a𝑇 z + 𝛿 ≀ 𝜏 ≀ a𝑇 wβˆ’ 𝛿 or a𝑇 w + 𝛿 ≀ 𝜏 ≀ a𝑇 zβˆ’ 𝛿

]= 2P

[a𝑇 z + 𝛿 ≀ 𝜏 ≀ a𝑇 wβˆ’ 𝛿

], (5)

where the second equality follows from the symmetry of the distributions of a and 𝜏 .Define 𝐢𝛼 := {a ∈ Sπ‘›βˆ’1 : a𝑇 (wβˆ’ z) β‰₯ 𝛼}. Note that the probability in (5) is zero unless

a ∈ 𝐢2𝛿. Thus, recalling that πœπ‘– ∼ 𝒩 (0, 𝜎2/2) we have

𝑃sep = 2∫

𝐢2𝛿

Ξ¦(

a𝑇 wβˆ’ 𝛿

𝜎/√

2

)βˆ’ Ξ¦

(a𝑇 z + 𝛿

𝜎/√

2

) 𝜈(da)

β‰₯ 2∫

𝐢′

Ξ¦(

a𝑇 wβˆ’ 𝛿

𝜎/√

2

)βˆ’ Ξ¦

(a𝑇 z + 𝛿

𝜎/√

2

) 𝜈(da) (6)

for any 𝐢 β€² βŠ† 𝐢2𝛿. To obtain a lower bound on (6), we will consider a carefully chosen subset𝐢 β€² βŠ† 𝐢2𝛿 and then simply multiply the area of 𝐢 β€² by the minimum value 𝛾 of the integrandover that set, yielding a bound of the form

𝑃sep β‰₯ 2π›Ύπœˆ(𝐢 β€²).

We construct the set 𝐢 β€² as follows. Let π‘Š := {a : a𝑇 w ≀ πœ‰/√

𝑛 β€–wβ€–}, 𝑍 := {a : a𝑇 z β‰₯βˆ’πœ‰/√

𝑛 β€–zβ€–}, and set 𝐢 β€² := 𝐢𝛼 ∩ π‘Š ∩ 𝑍 for some 𝛼 β‰₯ 2𝛿. Note that for any a ∈ 𝐢 β€²,since a𝑇 (w βˆ’ z) β‰₯ 𝛼 β‰₯ 2𝛿, we have βˆ’π‘…πœ‰/

βˆšπ‘› ≀ a𝑇 z + 𝛿 ≀ a𝑇 w βˆ’ 𝛿 ≀ π‘…πœ‰/

βˆšπ‘›. Thus, by

Lemma 11,

𝛾 = infaβˆˆπΆβ€²

Ξ¦(

a𝑇 wβˆ’ 𝛿

𝜎/√

2

)βˆ’ Ξ¦

(a𝑇 z + 𝛿

𝜎/√

2

) β‰₯√

2𝜎

(π›Όβˆ’ 2𝛿)πœ‘(√2π‘…πœ‰

𝜎√

𝑛

).

Recall by assumption we have that 𝜎 =√

2𝑅/√

𝑛, thus we obtain by setting πœ‰ =√

5,

𝛾 β‰₯√

𝑛

𝑅(π›Όβˆ’ 2𝛿)πœ‘(πœ‰) =

βˆšπ‘›(π›Όβˆ’ 2𝛿)√2πœ‹π‘’5/2𝑅

. (7)

Next note that 𝐢 β€² = 𝐢𝛼 βˆ©π‘Š βˆ©π‘ = 𝐢𝛼 βˆ–π‘Š 𝑐 βˆ–π‘π‘ is a difference of a set of hypersphericalcaps. To obtain a lower bound on 𝜈(𝐢 β€²) we use the upper and lower bounds on the measureof hyperspherical caps given in Lemma 2.1 of (Brieden et al., 2001).

7

Massimino and Davenport

Case 𝑛 β‰₯ 6 Provided that 𝛼/πœ– <√

2/𝑛 we can bound 𝜈(𝐢 β€²) as

𝜈(𝐢 β€²) β‰₯ 𝜈(𝐢𝛼)βˆ’ 𝜈(π‘Š 𝑐)βˆ’ 𝜈(𝑍𝑐) β‰₯ 112 βˆ’ 2 1

2πœ‰(1βˆ’ πœ‰2/𝑛)(π‘›βˆ’1)/2 β‰₯ 1

12 βˆ’1√

5𝑒5/2 ,

where the last inequality follows from the fact that (1 βˆ’ π‘₯/𝑛)π‘›βˆ’1 ≀ π‘’βˆ’π‘₯ for 𝑛 β‰₯ π‘₯ β‰₯ 2.Combining this with lower estimate (7),

𝑃sep β‰₯ 2π›Ύπœˆ(𝐢 β€²) β‰₯ 2√

𝑛(π›Όβˆ’ 2𝛿)√2πœ‹π‘’2𝑅

1βˆ’ 12π‘’βˆ’5/2/√

512 .

Setting 𝛼 = 𝛿 + πœ–/√

2𝑛, since 1βˆ’ 12π‘’βˆ’5/2/√

5 > 5/9, we have that

𝑃sep β‰₯2√

𝑛(πœ–/√

2π‘›βˆ’ 𝛿)(1βˆ’ 12π‘’βˆ’5/2/√

5)12√

2πœ‹π‘’5/2𝑅β‰₯ πœ–βˆ’ 𝛿

√2𝑛

22√

πœ‹π‘’5/2𝑅.

Note that this bound holds under the assumption that 𝛼/πœ– <√

2/𝑛, which for our choiceof 𝛼 is equivalent to the assumption that πœ– > 𝛿

√2𝑛. However, this bound also holds trivially

for all πœ– ≀ π›Ώβˆš

2𝑛, and thus in fact holds for all πœ– β‰₯ 0.

Case 𝑛 ≀ 5 In this case, note that πœ‰/√

𝑛 β‰₯ 1, so the sets π‘Š and 𝑍 are the entire sphere.Hence, 𝜈(π‘Š 𝑐) = 𝜈(𝑍𝑐) = 0 and 𝜈(𝐢 β€²) = 𝜈(𝐢𝛼) β‰₯ 1

12 . Thus,

𝑃sep β‰₯ 2π›Ύπœˆ(𝐢 β€²) β‰₯ πœ–βˆ’ π›Ώβˆš

2𝑛

12√

πœ‹π‘’5/2𝑅.

We obtain the stated lemma by noting πœ–0 ≀ πœ–.

Proof of Theorem 1 Let 𝑃e denote the probability that there exists some x, y ∈ B𝑛𝑅 with

β€–xβˆ’ yβ€– > πœ– and π’œ(x) = π’œ(y). Our goal is to show that 𝑃e ≀ πœ‚. Towards this end, let π‘ˆ bea 𝛿-covering set for B𝑛

𝑅 with |π‘ˆ | ≀ (3𝑅/𝛿)𝑛. By construction, for any x, y ∈ B𝑛𝑅, there exist

some w, z ∈ π‘ˆ satisfying β€–xβˆ’wβ€– ≀ 𝛿 and β€–yβˆ’ zβ€– ≀ 𝛿. In this case, if β€–xβˆ’ yβ€– > πœ– then

β€–wβˆ’ zβ€– β‰₯ β€–xβˆ’ yβ€– βˆ’ 2𝛿 > πœ–βˆ’ 2𝛿.

Our goal is to upper bound the probability that there exists some w, z ∈ π‘ˆ with β€–wβˆ’ zβ€– β‰₯πœ–0 = πœ–βˆ’ 2𝛿 and π’œ(u) = π’œ(v) for some u ∈ 𝐡𝛿(w) and v ∈ 𝐡𝛿(z). Said differently, we wouldlike to bound the probability that there exists a w, z ∈ π‘ˆ with β€–wβˆ’ zβ€– β‰₯ πœ–0 for which𝐡𝛿(w) and 𝐡𝛿(z) are not separated by any of the π‘š hyperplanes.

Let π‘ƒπ‘š(w, z) denote the probability that 𝐡𝛿(w) and 𝐡𝛿(z) are not separated by anyof the π‘š hyperplanes for a fixed w, z ∈ π‘ˆ with β€–wβˆ’ zβ€– β‰₯ πœ–0. Lemma 2 controls thisprobability for a single hyperplane, yielding a bound of

1βˆ’ 𝑃sep ≀ 1βˆ’ πœ–0 βˆ’ π›Ώβˆš

2𝑛

22√

πœ‹π‘’5/2𝑅.

Since the (p𝑖, q𝑖) are independent, we obtain

π‘ƒπ‘š(w, z) ≀(

1βˆ’ πœ–0 βˆ’ π›Ώβˆš

2𝑛

22√

πœ‹π‘’5/2𝑅

)π‘š

. (8)

8

Localization via Paired Comparisons

Since we are interested in the event that there exists any w, z ∈ π‘ˆ with β€–wβˆ’ zβ€– β‰₯ πœ–0 forwhich 𝐡𝛿(w) and 𝐡𝛿(z) are separated by none of the π‘š hyperplanes, we use the fact thatthere are at most (3𝑅/𝛿)2𝑛 such pairs w, z and combine a union bound with (8) to obtain

𝑃e ≀(3𝑅

𝛿

)2𝑛(

1βˆ’ πœ–0 βˆ’ π›Ώβˆš

2𝑛

22√

πœ‹π‘’5/2𝑅

)π‘š

≀ exp

βŽ›βŽ2𝑛 log 3𝑅

π›Ώβˆ’

(πœ–0 βˆ’ 𝛿

√2𝑛)

π‘š

22√

πœ‹π‘’5/2𝑅

⎞⎠ , (9)

which follows from (1βˆ’ π‘₯) ≀ π‘’βˆ’π‘₯. Bounding the right-hand side of (9) by πœ‚, we obtain

2𝑛 log 3𝑅

π›Ώβˆ’

(πœ–0 βˆ’ 𝛿

√2𝑛)

π‘š

22√

πœ‹π‘’5/2𝑅≀ log πœ‚. (10)

If we now make the substitutions πœ–0 = πœ– βˆ’ 2𝛿 and 𝛿 = πœ–/(4 +√

8𝑛), then we have thatπœ–0 βˆ’ 𝛿

βˆšπ‘› = πœ–/2 and thus we can reduce (10) to

2𝑛 log 3𝑅(4 +√

8𝑛)πœ–

βˆ’ πœ–π‘š

44√

πœ‹π‘’5/2𝑅≀ log πœ‚.

By rearranging, we see that this is equivalent to

π‘š β‰₯ 44√

πœ‹π‘’5/2 𝑅

πœ–

(2𝑛 log 3𝑅(4 +

√8𝑛)

πœ–+ log 1

πœ‚

). (11)

One can easily show that (4) implies (11) for an appropriate choice of 𝐢.

We now show that the result in Theorem 1 is optimal in the sense that any set ofcomparisons which can guarantee a uniform recovery of all x ∈ B𝑛

𝑅 to accuracy πœ– will requirea number of comparisons on the same order as that required in Theorem 1 (up to log factors).

Theorem 3 For any configuration of π‘š (inhomogeneous) hyperplanes in R𝑛 dividing B𝑛𝑅

into cells, if π‘š < 2𝑒

π‘…πœ– 𝑛, then there exist two points x, y ∈ B𝑛

𝑅 in the same cell such thatβ€–xβˆ’ yβ€– β‰₯ πœ–.

Proof We will use two facts. First, the number of cells (both bounded and unbounded)defined by π‘š hyperplanes in R𝑛 in general position2 is given by

𝐹𝑛(π‘š) =π‘›βˆ‘

𝑖=0

(π‘š

𝑖

)≀(

π‘’π‘š

𝑛

)𝑛

<

(2𝑅

πœ–

)𝑛

, (12)

where the second inequality follows from the assumption that π‘š < 2𝑅𝑛/π‘’πœ–.Second, for any convex set 𝐾 we have the isodiametric inequality (Giaquinta and Modica,

2010): where Diam(𝐾) = supπ‘₯,π‘¦βˆˆπΎ β€–π‘₯βˆ’ 𝑦‖,(Diam(𝐾)

2

)𝑛 πœ‹π‘›/2

Ξ“(𝑛/2 + 1) β‰₯ Vol(𝐾), (13)

2. For non-general position, this is an upper bound (Buck, 1943).

9

Massimino and Davenport

with equality when 𝐾 is a ball. Since the entire volume of B𝑛𝑅, denoted Vol(B𝑛

𝑅), is filled byat most 𝐹𝑛(π‘š) non-overlapping cells, there must exist at least one such cell 𝐾0 with

Vol(𝐾0) β‰₯ Vol(B𝑛𝑅)

𝐹𝑛(π‘š) = πœ‹π‘›/2

Ξ“(𝑛/2 + 1)𝑅𝑛

𝐹𝑛(π‘š) . (14)

Combining (13) with (14), we obtain(Diam(𝐾0)2

)𝑛

β‰₯ 𝑅𝑛

𝐹𝑛(π‘š) ,

which, together with (12), implies that

Diam(𝐾0) β‰₯ 2π‘…π‘›βˆš

𝐹𝑛(π‘š)> πœ–.

Thus there are vectors x, y ∈ 𝐾0 such that β€–xβˆ’ yβ€– > πœ–.

4. Stability in Noise

So far, we have only considered the noise-free case. In most practical applications, obser-vations may be corrupted by noise. We consider two scenarios; in the first, we make noassumption on the source of the errors and instead show the paired comparison observationsare stable with respect to Euclidean distance. That is, two signals that have similar signpatterns are also nearby (and vice-versa). One can view this as a strengthening of the resultin Theorem 1. In the second case, Gaussian noise is added prior to the sign(Β·) function in (3).This is equivalent to the Thurstone model (Thurstone, 1927) with the Probit (normal) link.

The results of this section apply without considering any particular recovery methodor algorithm and relate to the number of paired comparisons which may be flipped due tothe noise, not from reconstruction. We address these concerns by introducing a practicalalgorithm and associated guarantees in Section 5.

Throughout the following, we denote by 𝑑𝐻 the Hamming distance, i.e., 𝑑𝐻 counts thefraction of comparisons which differ between two sets of observations, here denoted π’œ(x)and π’œ(y):

𝑑𝐻(π’œ(x),π’œ(y)) := 1π‘š

π‘šβˆ‘π‘–=1

12 |π’œπ‘–(x)βˆ’π’œπ‘–(y)|. (15)

4.1 Stable Embedding

Here we show that given enough comparisons there is an approximate embedding of thepreference space into {βˆ’1, 1}π‘š via our model. Theorem 4 states that if x and y are sufficientlyclose, then the respective comparison patterns π’œ(x) and π’œ(y) closely align. In contrast withTheorem 8, Theorem 4 is a purely geometric statement which makes no assumptions on anyparticular noise model. Note also that Theorem 4 applies uniformly for all x and y.

10

Localization via Paired Comparisons

Theorem 4 Let πœ‚, 𝜁 > 0 be given. Let π’œ(x) denote the collection of π‘š observations definedas in Theorem 1. There exist constants 𝐢1, 𝑐1, 𝐢2, 𝑐2 such that if

π‘š β‰₯ 12𝜁2

(2𝑛 log 3

βˆšπ‘›

𝜁+ log 2

πœ‚

), (16)

then with probability at least 1βˆ’ πœ‚, for all x, y ∈ B𝑛𝑅 we have

𝐢1β€–xβˆ’ yβ€–

π‘…βˆ’ 𝑐1𝜁 ≀ 𝑑𝐻(π’œ(x),π’œ(y)) ≀ 𝐢2

β€–xβˆ’ y‖𝑅

+ 𝑐2𝜁. (17)

This result implies that the fraction of differences in the set of observed comparisonsbetween x and y will be constrained to within a constant factor of the Euclidean distance,plus an additive error approximately proportional to 1/

βˆšπ‘š. At first glance, this seems worse

than the result of Theorem 1, which suggests the rate 1/π‘š. However, Theorem 4 comes withmuch greater flexibility in that Theorem 1 only concerns the case where 𝑑𝐻(π’œ(x),π’œ(y)) = 0.As in Theorem 1, this result applies for all x on the same randomly drawn set of items.

This result is very reminiscent of Theorem 1.10 in (Plan and Vershynin, 2014) whichconcerns tessellations under uniform random affine hyperplanes generated according to theHaar measure, rather than the particular Gaussian model we study. Compared to thatwork, our result is much better in terms of the scaling of the lower bound on π‘š with respectto distortion 𝜁 (they predict 1/𝜁12 while ours is 1/𝜁2). This is due to both our specificprobabilistic model and because we do not use the technique of β€œlifting” the problem todimension 𝑛 + 1 in our analysis because it would incur additional distortion. On the otherhand, the result of (Plan and Vershynin, 2014) is applicable to x lying in an arbitrary convexbody whereas our result considers only x within a radius 𝑅 ball.

Unlike in Theorem 1, we are not aware whether the relationship between 𝜁 and π‘š givenin Theorem 4 is optimal. This result is related to open questions concerning Dvoretzky’stheorem for embedding β„“2 into β„“1. See the discussion of optimality in Section 1.7 of (Planand Vershynin, 2014) and Remark 1.6 of (Plan and Vershynin, 2013b).

In the context of a hypothetical recovery problem, suppose x is a parameter of interestand y is an estimate produced by any algorithm. Then, (17) says that if we want to recoverx to within error πœ–, the algorithm should look for vectors y which have up to 𝑂(πœ–) incorrectcomparisons. Likewise, if a y can be found having up to 𝑂(πœ–) comparison errors, we havethe same 𝑂(πœ–) guarantee on the Euclidean error of the estimate. In many cases, such aswhen errors are generated randomly, Theorem 1 would be inappropriate because finding a ysuch that 𝑑𝐻(π’œ(x),π’œ(y)) = 0 is likely to be impossible.

To prove Theorem 4 we will require the following Lemmas 5 and 6.

Lemma 5 Let w, z ∈ B𝑛𝑅 be distinct and fixed, and let 𝛿 > 0 be given. Let π’œ(x) denote the

collection of π‘š observations defined as in Theorem 1, and let 𝐡𝛿(Β·) be defined as in Lemma 2.Denote by 𝑃0 the probability that 𝐡𝛿(w) and 𝐡𝛿(z) are not separated by hyperplane 𝑖, i.e.,

𝑃0 = P [βˆ€u ∈ 𝐡𝛿(w), βˆ€v ∈ 𝐡𝛿(z) : π’œπ‘–(u) = π’œπ‘–(v)] .

Then1βˆ’ 𝑃0 ≀

√2πœ‹

(β€–wβˆ’ zβ€–

𝑅+ π›Ώβˆš

𝑛

𝑅

).

11

Massimino and Davenport

Proof We need an upper bound on

1βˆ’ 𝑃0 = P [π’œπ‘–(u) = π’œπ‘–(v) for some u ∈ 𝐡𝛿(w), v ∈ 𝐡𝛿(z)] .

Suppose for now that a is fixed and without loss of generality that a𝑇 w > a𝑇 z. Then thisprobability is simply

P[a𝑇 v < 𝜏 < a𝑇 u for some u ∈ 𝐡𝛿(w), v ∈ 𝐡𝛿(z)

]= P

[min

vβˆˆπ΅π›Ώ(z)a𝑇 v < 𝜏 < max

uβˆˆπ΅π›Ώ(w)a𝑇 u

]≀ P

[a𝑇 zβˆ’ 𝛿 < 𝜏 < a𝑇 w + 𝛿

],

since by Cauchy–Schwarz we have

minvβˆˆπ΅π›Ώ(z)

a𝑇 v β‰₯ a𝑇 zβˆ’ 𝛿 and maxuβˆˆπ΅π›Ώ(w)

a𝑇 u ≀ a𝑇 w + 𝛿.

Thus, recalling that πœπ‘– ∼ 𝒩 (0, 𝑅2/𝑛), from Lemma 11 we have

P[a𝑇 zβˆ’ 𝛿 < 𝜏 < a𝑇 w + 𝛿

]= Ξ¦

(a𝑇 w + 𝛿

𝑅/√

𝑛

)βˆ’ Ξ¦

(a𝑇 zβˆ’ 𝛿

𝑅/√

𝑛

)

≀ 1𝑅

βˆšπ‘›

2πœ‹

(a𝑇 (wβˆ’ z) + 2𝛿

).

Similarly, for a𝑇 w < a𝑇 z we have

P[a𝑇 wβˆ’ 𝛿 < 𝜏 < a𝑇 z + 𝛿

]≀ 1

𝑅

βˆšπ‘›

2πœ‹

(a𝑇 (zβˆ’w) + 2𝛿

).

Combining these we have

1βˆ’ 𝑃0 β‰€βˆ«Sπ‘›βˆ’1

1𝑅

βˆšπ‘›

2πœ‹

(|a𝑇 (wβˆ’ z)|+ 2𝛿

)𝜈(da)

= 1𝑅

βˆšπ‘›

2πœ‹

∫Sπ‘›βˆ’1|a𝑇 (wβˆ’ z)| 𝜈(da) + 2𝛿

𝑅

βˆšπ‘›

2πœ‹

=√

2𝑛

π‘…πœ‹

Ξ“(𝑛2 )

Ξ“(𝑛+12 )β€–wβˆ’ zβ€–+ 𝛿

𝑅

√2𝑛

πœ‹,

where the last equality is proven in Lemma 12. The lemma then follows from the facts thatΞ“(1/2)Ξ“(1) =

βˆšπœ‹ and Ξ“( 𝑛

2 )Ξ“( 𝑛+1

2 ) ≀2√

2π‘›βˆ’1 β‰€βˆš

πœ‹π‘› for 𝑛 β‰₯ 2 (Qi, 2010, (2.20)).

Lemma 6 Let w, z ∈ B𝑛𝑅 be distinct and fixed, and let 𝛿, 𝜁 > 0 be given. Let π’œ(x) denote

the collection of π‘š observations defined as in Theorem 1, and let 𝐡𝛿(Β·) be defined as inLemma 2. Then for all u ∈ 𝐡𝛿(w) and v ∈ 𝐡𝛿(z),

122𝑒5/2βˆšπœ‹

(β€–wβˆ’ zβ€–

π‘…βˆ’ π›Ώβˆš

2𝑛

𝑅

)βˆ’ 𝜁 ≀ 𝑑𝐻(π’œ(u),π’œ(v)) ≀

√2πœ‹

(β€–wβˆ’ zβ€–

𝑅+ π›Ώβˆš

𝑛

𝑅

)+ 𝜁,

with probability at least 1βˆ’ exp(βˆ’2𝜁2π‘š).

12

Localization via Paired Comparisons

Proof Fix 𝛿 > 0 and let u ∈ 𝐡𝛿(w), v ∈ 𝐡𝛿(z). Recall that the Hamming distance 𝑑𝐻 isa sum of independent and identically distributed Bernoulli random variables and we maybound it using Hoeffding’s inequality. Since our probabilistic upper and lower bounds musthold for all u, v as described above, we introduce quantities 𝐿0 and 𝐿1 which represent twoβ€œextreme cases” of the Bernoulli variables:

𝐿0 := supuβˆˆπ΅π›Ώ(w),vβˆˆπ΅π›Ώ(z)

12π‘š

π‘šβˆ‘π‘–=1|π’œπ‘–(u)βˆ’π’œπ‘–(v)|

𝐿1 := infuβˆˆπ΅π›Ώ(w),vβˆˆπ΅π›Ώ(z)

12π‘š

π‘šβˆ‘π‘–=1|π’œπ‘–(u)βˆ’π’œπ‘–(v)|.

Then we have𝐿1 ≀ 𝑑𝐻(π’œ(u),π’œ(v)) ≀ 𝐿0.

Denote 𝑃0 = 1βˆ’ E𝐿0 and 𝑃1 = E𝐿1, i.e.,

𝑃0 = P [βˆ€u ∈ 𝐡𝛿(w), βˆ€v ∈ 𝐡𝛿(z) : π’œπ‘–(u) = π’œπ‘–(v)]𝑃1 = P [βˆ€u ∈ 𝐡𝛿(w), βˆ€v ∈ 𝐡𝛿(z) : π’œπ‘–(u) = π’œπ‘–(v)] .

By Hoeffding’s inequality,

P [𝐿0 > (1βˆ’ 𝑃0) + 𝜁] ≀ exp(βˆ’2π‘šπœ2)P [𝐿1 < 𝑃1 βˆ’ 𝜁] ≀ exp(βˆ’2π‘šπœ2).

Hence, with probability at least 1βˆ’ 2 exp(βˆ’2π‘šπœ2),

𝑃1 βˆ’ 𝜁 ≀ 𝑑𝐻(π’œ(u),π’œ(v)) ≀ (1βˆ’ 𝑃0) + 𝜁.

The result follows directly from this combined with the facts that from Lemma 2 we have

𝑃1 β‰₯1

22𝑒5/2βˆšπœ‹

(β€–wβˆ’ zβ€–

π‘…βˆ’ π›Ώβˆš

2𝑛

𝑅

),

and from Lemma 5 we have

1βˆ’ 𝑃0 β‰€βˆš

2πœ‹

(β€–wβˆ’ zβ€–

𝑅+ π›Ώβˆš

𝑛

𝑅

).

Proof of Theorem 4 By Lemma 6, for any fixed pair w, z ∈ B𝑛𝑅 we have bounds on the

Hamming distance that hold with probability at least 1βˆ’ 2 exp(βˆ’2𝜁2π‘š), for all u ∈ 𝐡𝛿(w)and v ∈ 𝐡𝛿(z). Recall that the radius 𝑅 ball can be covered with a set π‘ˆ of radius 𝛿balls with |π‘ˆ | ≀ (3𝑅/𝛿)𝑛. Thus, by a union bound we have that with probability at least1βˆ’ 2(3𝑅/𝛿)2𝑛 exp(βˆ’2𝜁2π‘š), for any w, z ∈ π‘ˆ ,

122𝑒5/2βˆšπœ‹

(β€–wβˆ’ zβ€–

π‘…βˆ’ π›Ώβˆš

2𝑛

𝑅

)βˆ’ 𝜁 ≀ 𝑑𝐻(π’œ(u),π’œ(v)) ≀

√2πœ‹

(β€–wβˆ’ zβ€–

𝑅+ π›Ώβˆš

𝑛

𝑅

)+ 𝜁,

13

Massimino and Davenport

for all u ∈ 𝐡𝛿(w) and v ∈ 𝐡𝛿(z). Since β€–xβˆ’ yβ€–βˆ’2𝛿 ≀ β€–wβˆ’ zβ€– ≀ β€–xβˆ’ yβ€–+2𝛿, this impliesthat

122𝑒5/2βˆšπœ‹

(β€–xβˆ’ yβ€– βˆ’ 2𝛿

π‘…βˆ’ π›Ώβˆš

2𝑛

𝑅

)βˆ’πœ ≀ 𝑑𝐻(π’œ(x),π’œ(y)) ≀

√2πœ‹

(β€–xβˆ’ yβ€–+ 2𝛿

𝑅+ π›Ώβˆš

𝑛

𝑅

)+𝜁,

Letting 𝛿 = πœπ‘…/√

𝑛 and setting 𝐢1, 𝑐1, 𝐢2, 𝑐1 appropriately3 this reduces to (17). Lowerbounding the probability by 1βˆ’ πœ‚, we obtain

2(3√

𝑛/𝜁)2𝑛 exp(βˆ’2𝜁2π‘š) ≀ πœ‚.

Rearranging yields (16).

4.2 Gaussian Noise

Here we aim to understand how the paired comparisons change with the introduction ofβ€œpre-quantization” Gaussian noise. This will have the effect of causing some comparisons tobe erroneous, where the probability of an error will be largest when x is equidistant from p𝑖

and q𝑖 and will decay as x moves away from this boundary.Towards this end, recall that the observation model in (1) can be reduced to the form

π’œπ‘–(x) = sign(π‘žπ‘–) π‘žπ‘– := a𝑇𝑖 xβˆ’ πœπ‘–. (18)

In the noisy case, we will consider the observations

π’œπ‘–(x) = sign(π‘žπ‘–) π‘žπ‘– := a𝑇𝑖 xβˆ’ πœπ‘– + 𝑧𝑖 = π‘žπ‘– + 𝑧𝑖, (19)

where 𝑧𝑖 ∼ 𝒩 (0, 𝜎2𝑧). Note that since β€–a𝑖‖ = 1, this model is equivalent to adding multivariate

Gaussian noise directly to x with covariance 𝜎2𝑧I. For a fixed x, we can then quantify the

probability that 𝑑𝐻(π’œ(x),π’œ(x)) is large via our next two results. We first consider the caseof a single comparison in Lemma 7, then extend this to an arbitrary number of comparisonsin Theorem 8.

Lemma 7 Suppose 𝑛 β‰₯ 4. Then P[π’œπ‘–(x) = π’œπ‘–(x)] ≀ πœ…π‘›(𝜎2𝑧) where πœ…π‘› is defined by

πœ…π‘›(𝜎2𝑧) := 1

2

√𝜎2

𝑧

𝜎2𝑧 + 2𝑅2/𝑛 + 4 β€–xβ€–2 /𝑛

≀ 12

√1

1 + 2𝑅2/(π‘›πœŽ2𝑧) . (20)

For clarity, we focus here on the 𝑛 β‰₯ 4 case. We consider the 𝑛 = 2 and 𝑛 = 3 cases separatelybecause when 𝑛 β‰₯ 4 the probability distribution function of a𝑇

𝑖 x is well-approximated by aGaussian function but not for 𝑛 < 4. We give alternative expressions for πœ…π‘› when 𝑛 = 2 and𝑛 = 3 in Appendix B.

3. We set 𝐢1 = 1/22𝑒5/2βˆšπœ‹ and 𝐢2 =

√2/πœ‹. We may set 𝑐1 = 1+1/11𝑒5/2√

πœ‹+√

2/πœ‹ and 𝑐2 = 1+3√

2/πœ‹to obtain constants that are valid for all 𝑛—improved values are possible for large 𝑛.

14

Localization via Paired Comparisons

Theorem 8 Suppose 𝑛 β‰₯ 4 and fix x ∈ B𝑛𝑅. Let π’œ(x) and π’œ(x) denote the collection of π‘š

observations defined as in (18) and (19) respectively, where the {(p𝑖, q𝑖)}π‘šπ‘–=1 (and hence the{(a𝑖, πœπ‘–)}π‘šπ‘–=1) are generated as in Theorem 1. Then,

E 𝑑𝐻(π’œ(x),π’œ(x)) ≀ πœ…π‘›(𝜎2𝑧) ≀ 1

2

√1

1 + 2𝑅2/(π‘›πœŽ2𝑧) . (21)

andP[𝑑𝐻(π’œ(x),π’œ(x)) β‰₯ πœ…π‘›(𝜎2

𝑧) + 𝜁]≀ exp(βˆ’2π‘šπœ2). (22)

where πœ…π‘› is defined in (20).

Proof By Lemma 7, we have that P[π’œπ‘–(x) = π’œπ‘–(x)] is bounded by πœ…π‘›(𝜎2𝑧). Since the

comparisons are independent, the expected number of sign mismatches is just the probabilityof a sign flip just computed, which establishes (21). The tail bound in (22) is a simpleconsequence of Hoeffding’s inequality.

The bound (21) of Theorem 8 behaves as one would expect. If the variance 𝜎2𝑧 of the

added Gaussian noise is small, we predict that the expected fraction of errors is also small.To place this result in context, recall that πœπ‘– ∼ 𝒩 (0, 𝑅2/𝑛). Suppose that 𝜎2

𝑧 = 𝑐0𝑅2/𝑛. Inthis case one can bound πœ…π‘›(𝜎2

𝑧) in (20) as

12

βˆšπ‘0

𝑐0 + 6 ≀ πœ…π‘›(𝜎2𝑧) ≀ 1

2

βˆšπ‘0

𝑐0 + 2 .

Intuitively, if 𝑐0 is close to 1, meaning the noise variance is comparable to that of πœπ‘–,then we would expect to lose a significant amount of information about x, in which case𝑑𝐻(π’œ(x),π’œ(x)) could potentially be quite large. In contrast, by letting 𝑐0 grow small wecan bound πœ…π‘›(𝜎2

𝑧) β‰€βˆš

𝑐0/8 arbitrarily close to zero.It is instructive to consider Theorem 8 next to Theorem 4, which also predicts the

fraction of sign mismatches up to an additive constant which is proportional to 1/√

π‘š. (21)and (22) provide upper estimates of the level of comparison errors which is unavoidable,regardless of recovery technique. If, in a particular application the noise is expected tobe Gaussian, the bound in (22) can be used in the lower half of (17) to create a recoveryguarantee. Note that since Theorem 4 is a uniform guarantee, better estimates than thiscould be possible in the specific case of Gaussian noise. We explore Gaussian noise furtherin Section 5.1.Proof of Lemma 7 The probability of a sign flip is given by

P [π‘žπ‘–π‘žπ‘– < 0] = P [π‘žπ‘– < 0 and π‘žπ‘– > 0] + P [π‘žπ‘– > 0 and π‘žπ‘– < 0] .

Note that if we set π‘Ÿπ‘– := a𝑇𝑖 x/ β€–xβ€– ∈ [βˆ’1, 1], then we can write π‘žπ‘– = π‘Ÿπ‘– β€–xβ€– βˆ’ πœπ‘– and

π‘žπ‘– = π‘Ÿπ‘– β€–xβ€– βˆ’ πœπ‘– + 𝑧𝑖 where the random variables π‘Ÿπ‘–, πœπ‘–, and 𝑧𝑖 are independent. Where π‘“π‘Ÿ(π‘Ÿπ‘–)denotes the probability density functions for π‘Ÿπ‘– and recalling that πœπ‘– ∼ 𝒩 (0, 2𝑅2/𝑛), weshow in Appendix B using standard Gaussian tail bounds that

P[π‘žπ‘–π‘žπ‘– < 0] ≀ 1𝑅

βˆšπ‘›

πœ‹

∫ 1

0

∫ ∞

βˆ’βˆžπ‘“π‘Ÿ(π‘Ÿπ‘–) exp

(βˆ’(π‘Ÿπ‘– β€–xβ€– βˆ’ πœπ‘–)2

2𝜎2𝑧

βˆ’ π‘›πœ2𝑖

4𝑅2

)dπœπ‘– dπ‘Ÿπ‘–. (23)

15

Massimino and Davenport

The remainder of the proof (given in Appendix B) is obtained by bounding this integral.Note that in general, we have 1

2(π‘Ÿπ‘– + 1) ∼ Beta((π‘›βˆ’ 1)/2, (π‘›βˆ’ 1)/2), but π‘Ÿπ‘– is asymptoticallynormal with variance 1/𝑛 (Spruill, 2007). For 𝑛 β‰₯ 4, we use the simple upper bound

π‘“π‘Ÿ(π‘Ÿπ‘–) = 12

[𝐡

(π‘›βˆ’ 1

2 ,π‘›βˆ’ 1

2

)]βˆ’1(1 + π‘Ÿπ‘–

21βˆ’ π‘Ÿπ‘–

2

)(π‘›βˆ’3)/2

≀ 12

[√2πœ‹ π‘›βˆ’1

2(π‘›βˆ’2)/2 π‘›βˆ’1

2(π‘›βˆ’2)/2

(π‘›βˆ’ 1)π‘›βˆ’1βˆ’1/2

]βˆ’1(1βˆ’ π‘Ÿ2

𝑖

4

)(π‘›βˆ’3)/2

= 12

[ √2πœ‹

2π‘›βˆ’2βˆšπ‘›βˆ’ 1

]βˆ’1 12π‘›βˆ’3 exp(βˆ’(π‘›βˆ’ 3)π‘Ÿ2

𝑖 /2)

=√

π‘›βˆ’ 14√

2πœ‹exp(βˆ’(π‘›βˆ’ 3)π‘Ÿ2

𝑖 /2) β‰€βˆš

𝑛

4√

2πœ‹exp(βˆ’π‘›π‘Ÿ2

𝑖 /8). (24)

This follows from the standard inequalities 𝐡(π‘₯, 𝑦) β‰₯√

2πœ‹π‘₯π‘₯βˆ’1/2π‘¦π‘¦βˆ’1/2/(π‘₯ + 𝑦)π‘₯+π‘¦βˆ’1/2 (e.g.,GreniΓ© and Molteni, 2015) and 1βˆ’ π‘₯ ≀ exp(βˆ’π‘₯).

5. Estimation Algorithm and Guarantees

In the noise-free setting, given a set of comparisons π’œ(x), we may produce an estimate xby finding any x ∈ B𝑛

𝑅 satisfying π’œ(x) = π’œ(x). A simple approach is the following convexprogram:

x = arg minw

β€–wβ€–2 subject to π’œπ‘–(x)(a𝑇𝑖 wβˆ’ πœπ‘–) β‰₯ 0 βˆ€π‘– ∈ [π‘š]. (25)

This is relatively easy to solve since the constraints are simple linear inequalities and thefeasible region is convex. Note that (25) is guaranteed to satisfy x ∈ B𝑛

𝑅 since x ∈ B𝑛𝑅 and x

is feasible, so that β€–xβ€– ≀ β€–xβ€– ≀ 𝑅. In this case we may apply Theorem 1 to argue that if π‘šobeys the bound in (4), then β€–xβˆ’ xβ€– ≀ πœ–.

However, in most practical applications, observations are likely to be corrupted by noiseleading to inconsistencies. Any errors in the observations π’œ(x) would make strictly enforcingπ’œ(x) = π’œ(x) a questionable goal since, among other drawbacks, x itself would becomeinfeasible. In fact, in this case we cannot even necessarily guarantee that (25) has anyfeasible solutions. In the noisy case we instead use a relaxation inspired by the extended𝜈-SVM of (Perez-Cruz et al., 2003), which introduces slack variables πœ‰π‘– β‰₯ 0 and is controlledby the parameter 𝜈. Specifically, we denote by π’œ(x) the collection of (potentially) corruptedmeasurements, and we solve

minimizew∈R𝑛+1,πœ‰βˆˆRπ‘š,𝜌∈Rβˆ’πœˆπœŒ + 1

π‘š

π‘šβˆ‘π‘–=1

πœ‰π‘–

subject to π’œπ‘–(x)([a𝑇𝑖 ,βˆ’πœπ‘–] w) β‰₯ πœŒβˆ’ πœ‰π‘–, πœ‰π‘– β‰₯ 0, βˆ€π‘– ∈ [π‘š],

β€– w[1 : 𝑛]β€–2 ≀ 2𝑅2

1+𝑅2 , and β€– wβ€–2 = 2.

(26)

Finally, we set x = w[1, . . . , 𝑛]/ w[𝑛 + 1]. The additional constraint β€– w[1 : 𝑛]β€–2 ≀ 2𝑅2

1+𝑅2

ensures that β€–xβ€– ≀ 𝑅. Note that an important difference between the extended 𝜈-SVM and

16

Localization via Paired Comparisons

(26) is that there is no β€œoffset” parameter to be optimized over. That is, if we interpret[a𝑖,βˆ’πœπ‘–] as β€œtraining examples,” then w := [x, 1] ∈ R𝑛+1 corresponds to a homogeneouslinear classifier. Note that in the absence of comparison errors, setting 𝜈 = 0, we would havea feasible solution with πœ‰π‘– = 0.

Unfortunately, due to the norm equality constraint, (26) is not convex and a uniqueglobal minimum cannot be guaranteed, i.e., there may be multiple solutions x. Nevertheless,the following result shows that any local minimum will have certain desirable properties,and in the process also provides guidance on choosing the parameter 𝜈. Combined with ourprevious results, this also allows us to give recovery guarantees.

Proposition 9 At any local minimum x of (26), we have 1π‘š |{𝑖 : πœ‰π‘– > 0}| ≀ 𝜈. If the

corresponding 𝜌 > 0, this further implies that 𝑑𝐻(π’œ(x),π’œ(x)) ≀ 𝜈.

Proof This proof follows similarly to that of Proposition 7.5 of (Schâlkopf and Smola,2001), except applied to the extended 𝜈-SVM of (Perez-Cruz et al., 2003) and with theremoval of the hyperplane bias term. Specifically, we first form the Lagrangian of (26):

𝐿( w, πœ‰, 𝜌, 𝛼, 𝛽, 𝛾, 𝛿) = βˆ’πœˆπœŒ + 1π‘š

βˆ‘π‘–

πœ‰π‘– βˆ’βˆ‘

𝑖

(𝛼𝑖(π’œπ‘–(x)[a𝑖,βˆ’πœπ‘–]𝑇 wβˆ’ 𝜌 + πœ‰π‘–) + π›½π‘–πœ‰π‘–)

+ 𝛾

( 2𝑅2

1 + 𝑅2 βˆ’ β€– w[1 : 𝑛]β€–2)βˆ’ 𝛿(2βˆ’ β€– wβ€–2).

We define the functions corresponding to the equality constraint (β„Ž1) and inequality con-straints (𝑔𝑖 for 𝑖 ∈ [2π‘š + 1]) as follows:

β„Ž1(w, πœ‰, 𝜌) := (2βˆ’ β€–wβ€–2),

𝑔𝑖(w, πœ‰, 𝜌) :=

⎧βŽͺβŽͺ⎨βŽͺβŽͺβŽ©π’œπ‘–(x)[a𝑖,βˆ’πœπ‘–]𝑇 wβˆ’ 𝜌 + πœ‰π‘– 𝑖 ∈ [1, π‘š]πœ‰π‘–βˆ’π‘š 𝑖 ∈ [π‘š + 1, 2π‘š]βˆ’(

2𝑅2

1+𝑅2 βˆ’ β€–w[1 : 𝑛]β€–2)

𝑖 = 2π‘š + 1.

Consider the 𝑛 + π‘š + 2 variables ( w, πœ‰, 𝜌). The gradient corresponding to the equalityconstraint, βˆ‡h1, involves only the first 𝑛+1 variables. Thus, there exists an π‘š+1 dimensionalsubspace π’Ÿ βŠ‚ R𝑛+π‘š+2 where for any d ∈ π’Ÿ, βˆ‡h𝑇

1 d = 0. The gradients corresponding tothe 2π‘š + 1 inequality constraints are given in the (2π‘š + 1)Γ— (𝑛 + π‘š + 2) matrix

G :=

⎑⎒⎒⎒⎒⎒⎒⎒⎒⎒⎒⎒⎒⎣

βˆ‡g𝑇1

...βˆ‡g𝑇

π‘š

βˆ‡gπ‘‡π‘š+1...

βˆ‡g𝑇2π‘š

βˆ‡g𝑇2π‘š+1

⎀βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯⎦=

⎑⎒⎒⎒⎒⎒⎒⎒⎒⎒⎒⎒⎣

Β· Β· Β· βˆ’1 Β· Β· Β· 0 1

Β· Β· Β·... . . . ...

...← (𝑛 + 1)β†’ 0 Β· Β· Β· βˆ’1 1

irrelevant βˆ’1 Β· Β· Β· 0 0

Β· Β· Β·... . . . ...

...Β· Β· Β· 0 Β· Β· Β· βˆ’1 0

← w[1 : 𝑛] β†’ 0 0 Β· Β· Β· 0 0

⎀βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯βŽ₯⎦.

17

Massimino and Davenport

Since there is a d ∈ π’Ÿ such that (Gd)[𝑖] < 0 for all 𝑖 (for example, d = [0, . . . , 0|1, . . . , 1,βˆ’1]),the Mangasarian–Fromovitz constraint qualifications hold and we have the following first-order necessary conditions for local minima (see e.g., Bertsekas, 1999),

πœ•πΏ

πœ•πœŒ= βˆ’πœˆ +

βˆ‘π›Όπ‘– =β‡’

βˆ‘π›Όπ‘– = 𝜈

andπœ•πΏ

πœ•πœ‰π‘–= 1

π‘šβˆ’ 𝛼𝑖 βˆ’ 𝛽𝑖 = 0 =β‡’ 𝛼𝑖 + 𝛽𝑖 = 1

π‘š.

Sinceβˆ‘π‘š

𝑖=1 𝛼𝑖 = 𝜈, at most a fraction of 𝜈 can have 𝛼𝑖 = 1/π‘š. Now, any 𝑖 such that πœ‰π‘– > 0must have 𝛼𝑖 = 1/π‘š since by complimentary slackness, 𝛽𝑖 = 0. Hence, 𝜈 is an upper boundon the fraction of πœ‰ such that πœ‰π‘– > 0.

Finally, note that if 𝜌 > 0, then πœ‰π‘– = 0 implies π’œπ‘–(x)([a𝑇𝑖 ,βˆ’πœπ‘–] w) β‰₯ πœŒβˆ’ πœ‰π‘– > 0. Hence,

the fraction of πœ‰ such that πœ‰π‘– > 0 is an upper bound for 𝑑𝐻(π’œ(x),π’œ(x)).

5.1 Estimation Guarantees

We now show how the results of Theorems 8 and 4 can be combined with Proposition 9to give recovery guarantees on β€–xβˆ’ xβ€– when (26) is used for recovery under realistic noisyobservation models. We consider three basic noise models. In the first, an arbitrary (butsmall) fraction of comparisons are reversed. We then consider the implication of this resultin the context of two other noise models, one where Gaussian noise is added to either theunderlying x or to the comparisons β€œpre-quantization,” that is, directly to (a𝑇

𝑖 xβˆ’ πœπ‘–), andanother where the observations are generated using an arbitrary (but bounded) perturbationof x. We will ultimately see that largely similar guarantees are possible in all three cases,summarized in Table 1.

Table 1: Summary of our estimation guarantees in this section, where each result holdsseparately with probability at least 1βˆ’ πœ‚ where πœ‚, 𝑐1, 𝑐2, 𝐢1, 𝐢2 are constants whichmay differ between results.

Noise type Guarantee on β€–xβˆ’ xβ€– /𝑅 Eq.

Adversarial, level πœ… ∘ ≀ 2𝐢1

πœ… + 𝑐1𝐢1

βˆšπ‘› log(18π‘š)+log(2/πœ‚)

2π‘š (30)i.i.d. Gaussian, 𝜎2

𝑧 ∘ β‰€βˆš

2𝐢1

βˆšπ‘›πœŽ2

𝑧𝑅2 + 1

𝐢1

√log(1/πœ‚)

2π‘š + 𝑐1𝐢1

βˆšπ‘› log(18π‘š)+log(2/πœ‚)

2π‘š (31)Perturbations π‘₯ ↦→ π‘₯β€² ∘ ≀ 2𝐢2

𝐢1β€–xβˆ’xβ€²β€–

𝑅 + 𝑐1+2𝑐2𝐢1

βˆšπ‘› log(18π‘š)+log(2/πœ‚)

2π‘š (33)

In our analysis of all three settings, we will use the fact that from the lower bound ofTheorem 4 we have

β€–xβˆ’ x‖𝑅

≀ 𝑑𝐻(π’œ(x),π’œ(x)) + 𝑐1𝜁

𝐢1(27)

18

Localization via Paired Comparisons

with probability at least 1βˆ’ πœ‚ provided that π‘š is sufficiently large, e.g., by taking

π‘š = 12𝜁2

(2𝑛 log 3

βˆšπ‘›

𝜁+ log 2

πœ‚

). (28)

Note that by setting 𝛽 = 9π‘›πœ2

(2πœ‚

)1/𝑛, we can rearrange (28) to be of the form

18π‘š

(2πœ‚

)1/𝑛

= 𝛽 log 𝛽,

which implies that

𝛽 = 18π‘š(2/πœ‚)1/𝑛

π‘Š(18π‘š(2/πœ‚)1/𝑛

) ,where π‘Š (Β·) denotes the Lambert π‘Š function. Using the fact that π‘Š (π‘₯) ≀ log(π‘₯) for π‘₯ β‰₯ 𝑒and substituting back in for 𝛽, we have

𝜁 ≀

βˆšπ‘› log(18π‘š) + log(2/πœ‚)

2π‘š

under the mild assumption that π‘š β‰₯ 𝑒18(πœ‚

2 )1/𝑛. Substituting this in to (27) yields

β€–xβˆ’ x‖𝑅

≀ 𝑑𝐻(π’œ(x),π’œ(x))𝐢1

+ 𝑐1𝐢1

βˆšπ‘› log(18π‘š) + log(2/πœ‚)

2π‘š. (29)

We use this bound repeatedly below.

Noise model 1. In the first noise model, we suppose that an adversary is allowed toarbitrarily flip a fraction πœ… of measurements, where we assume πœ… is known (or can be bounded).This would seem to be a challenging setting, but in fact a guarantee under this model followsimmediately from the lower bound in Theorem 4. Specifically, suppose that π’œ(x) representsthe noise-free comparisons, and we receive instead π’œ(x), where 𝑑𝐻(π’œ(x),π’œ(x)) ≀ πœ….

Consider using (26) to produce an x setting 𝜈 = πœ…. If x is a local minimum for (26) with𝜌 > 0, Proposition 9 implies that 𝑑𝐻(π’œ(x),π’œ(x)) ≀ πœ…. Thus, by the triangle inequality,

𝑑𝐻(π’œ(x),π’œ(x)) ≀ 𝑑𝐻(π’œ(x),π’œ(x)) + 𝑑𝐻(π’œ(x),π’œ(x)) ≀ 2πœ….

Plugging this into (29) we have that with probability at least 1βˆ’ πœ‚

β€–xβˆ’ x‖𝑅

≀ 2πœ…

𝐢1+ 𝑐1

𝐢1

βˆšπ‘› log(18π‘š) + log(2/πœ‚)

2π‘š. (30)

We emphasize the power of this resultβ€”the adversary may flip not merely a randomfraction of comparisons, but an arbitrary set of comparisons. Moreover, this holds uniformlyfor all x and x simultaneously (with high probability).

19

Massimino and Davenport

Noise model 2. Here we model errors as being generated by adding i.i.d. Gaussian beforethe sign(Β·) function, as described in Section 4.2, i.e.,

π’œπ‘–(x) = sign(a𝑇𝑖 xβˆ’ πœπ‘– + 𝑧𝑖),

where 𝑧𝑖 ∼ 𝒩 (0, 𝜎2𝑧). Note that this model is equivalent to the Thurstone model of

comparative judgment (Thurstone, 1927), and causes a predictable probability of errordepending the geometry of the set of items. Specifically, comparisons which are β€œdecisive,”i.e., whose hyperplane lies far from x, are unlikely to be affected by this noise. Conversely,comparisons which are nearly even are quite likely to be affected.

Under the random observation model considered in this paper, by Theorem 8 we havethat, with probability at least 1βˆ’ πœ‚,

𝑑𝐻(π’œ(x),π’œ(x)) ≀ πœ…π‘›(𝜎2𝑧) +

√log(1/πœ‚)

2π‘š,

where

πœ…π‘›(𝜎2𝑧) =

√𝜎2

𝑧

𝜎2𝑧 + 2𝑅2/𝑛 + 4 β€–xβ€–2 /𝑛

≀

βˆšπ‘›πœŽ2

𝑧

2𝑅2 .

We now assume that x is a local minimum of (26) with 𝜈 = πœ…π‘›(𝜎2𝑧) such that 𝜌 > 0. By the

triangle inequality and Proposition 9,

𝑑𝐻(π’œ(x),π’œ(x)) ≀ 𝑑𝐻(π’œ(x),π’œ(x)) + 𝑑𝐻(π’œ(x),π’œ(x)) ≀ 2

βˆšπ‘›πœŽ2

𝑧

2𝑅2 +

√log(1/πœ‚)

2π‘š.

Combining this with (29), we have that with probability at least 1βˆ’ 2πœ‚,

β€–xβˆ’ x‖𝑅

β‰€βˆš

2𝐢1

βˆšπ‘›πœŽ2

𝑧

𝑅2 + 1𝐢1

√log(1/πœ‚)

2π‘š+ 𝑐1

𝐢1

βˆšπ‘› log(18π‘š) + log(2/πœ‚)

2π‘š. (31)

We next consider an alternative perspective on this model. Specifically, suppose thatour observations are generated via

π’œπ‘–(x) = π’œπ‘–(x′𝑖) where xβ€²

𝑖 = x + z𝑖,

where z𝑖 ∼ 𝒩 (0, 𝜎2𝑧𝐼). Note that we can write this as

π’œπ‘–(x′𝑖) = a𝑇

𝑖 (x + z𝑖)βˆ’ πœπ‘– = a𝑇𝑖 xβˆ’ πœπ‘– + a𝑇

𝑖 z𝑖.

Since β€–a𝑖‖ = 1, a𝑇𝑖 z𝑖 ∼ 𝒩 (0, 𝜎2

𝑧), and thus this is equivalent to the model described above.Thus, we can also interpret the above results as applying when each comparison is generatedusing a β€œmisspecified” version of x which has been perturbed by Gaussian noise. Moreover,note that

Exβˆ’ xβ€²

𝑖

2 = E β€–z𝑖‖2 = π‘›πœŽ2𝑧 ,

in which case we can also express the bound in (31) as

β€–xβˆ’ x‖𝑅

β‰€βˆš

2𝐢1

√E β€–xβˆ’ xβ€²

𝑖‖2

𝑅2 + 1𝐢1

√log(1/πœ‚)

2π‘š+ 𝑐1

𝐢1

βˆšπ‘› log(18π‘š) + log(2/πœ‚)

2π‘š. (32)

20

Localization via Paired Comparisons

Thus, a small Gaussian perturbation of x in the comparisons will result in an increasedrecovery error roughly proportional to the (average) size of the perturbation.

Note that in establishing this result we apply Theorem 8, and so in contrast to our firstnoise model, here the result holds with high probability for a fixed x (as opposed to beinguniform over all x for a single choice of π’œ).

Noise model 3. In the third noise model, we assume the comparisons are generatedaccording to

π’œ(x) = π’œ(xβ€²),

where xβ€² represents an arbitrary perturbation of x. Much like in the previous model,comparisons which are β€œdecisive” are not likely to be affected by this kind of noise, whilecomparisons which are nearly even are quite likely to be affected. Unlike the previous model,our results here make no assumption on the distribution of the noise and will instead usethe upper bound in Theorem 4 to establish a uniform guarantee that holds (with highprobability) simultaneously for all choices of x (and xβ€²). Thus, in this model our guaranteesare quite a bit stronger.

Specifically, we use the fact that from the upper bound of Theorem 4, with probabilityat least 1βˆ’ πœ‚ we simultaneously have (27) and

𝑑𝐻(π’œ(x),π’œ(x)) ≀ 𝐢2β€–xβˆ’ xβ€²β€–

𝑅+ 𝑐2𝜁 =: πœ….

We again use (26) with 𝜈 = πœ… and Proposition 9 to produce an estimate x satisfying𝑑𝐻(π’œ(x),π’œ(x)) ≀ πœ…. Again using the triangle inequality, we have

𝑑𝐻(π’œ(x),π’œ(x)) ≀ 𝑑𝐻(π’œ(x),π’œ(x)) + 𝑑𝐻(π’œ(x),π’œ(x)) ≀ 2πœ….

Combining this with (27) we have

β€–xβˆ’ x‖𝑅

≀ 2πœ… + 𝑐1𝜁

𝐢1= 2𝐢2

𝐢1

β€–xβˆ’ x′‖𝑅

+ 𝑐1 + 2𝑐2𝐢1

𝜁.

Substituting in for 𝜁 as in (29) yields

β€–xβˆ’ x‖𝑅

≀ 2𝐢2𝐢1

β€–xβˆ’ x′‖𝑅

+ 𝑐1 + 2𝑐2𝐢1

βˆšπ‘› log(18π‘š) + log(2/πœ‚)

2π‘š. (33)

Contrasting the result in (33) with that in (32), we note that up to constants, the resultsare essentially the same. This is perhaps somewhat surprising since (33) applies to arbitraryperturbations (as opposed to only Gaussian noise), and moreover, (33) is a uniform guarantee.

5.2 Adaptive Estimation

Here we describe a simple extension to our previous (noiseless) theory and show that ifwe modify the mean and variance of the sampling distribution of items over a number ofstages, we can localize adaptively and produce an estimate with many fewer comparisonsthan possible in a non-adaptive strategy. We assume 𝑑 stages (𝑑 = 1 for the non-adaptiveapproach). At each stage β„“ ∈ [𝑑] we will attempt to produce an estimate xβ„“ such that

21

Massimino and Davenport

β€–x βˆ’ xβ„“β€– ≀ πœ–β„“ where πœ–β„“ = 𝑅ℓ/2 = 𝑅2βˆ’β„“, then recentering to our previous estimate anddividing the problem radius in half. In stage β„“, each p𝑖, q𝑖 ∼ 𝒩 (x, 2𝑅2

β„“ /𝑛I). After 𝑑 stageswe will have β€–xβˆ’ x𝑑‖ ≀ 𝑅2βˆ’π‘‘ =: 𝑒𝑑 with probability at least 1βˆ’ π‘‘πœ‚.

Proposition 10 Let πœ–π‘‘, πœ‚ > 0 be given. Suppose that x ∈ B𝑛𝑅 and that π‘š total comparisons

are obtained following the adaptive scheme where

π‘š β‰₯ 2𝐢 log2

(2𝑅

πœ–π‘‘

)(𝑛 log 2

βˆšπ‘› + log 1

πœ‚

),

where 𝐢 is a constant. Then with probability at least 1βˆ’ log2(2𝑅/πœ–π‘‘)πœ‚, for any estimate xsatisfying π’œ(x) = π’œ(x),

β€–xβˆ’ xβ€– ≀ πœ–π‘‘.

Proof The adaptive scheme uses 𝑑 = ⌈log2(𝑅/πœ–π‘‘)βŒ‰ ≀ log2(2𝑅/πœ–π‘‘) stages. Assume each stageis allocated π‘šβ„“ comparisons. By Theorem 1, localization at each stage β„“ can be accomplishedwith high probability when

π‘šβ„“ β‰₯ 𝐢𝑅ℓ

πœ–β„“

(𝑛 log 𝑅ℓ

βˆšπ‘›

πœ–β„“+ log 1

πœ‚

)= 2𝐢

(𝑛 log 2

βˆšπ‘› + log 1

πœ‚

).

This condition is met by giving an equal number of comparisons to each stage, π‘šβ„“ = βŒŠπ‘š/π‘‘βŒ‹.Each stage fails with probability πœ‚. By a union bound, the target localization fails withprobability at most π‘‘πœ‚. Hence, localization succeeds with probability at least 1βˆ’ π‘‘πœ‚.

Proposition 10 implies π‘šadapt ≍ (𝑛 log 𝑛) log2(𝑅/πœ–π‘‘) comparisons suffice to estimate xto within πœ–π‘‘. This represents an exponential improvement in terms of number of totalcomparisons as a function of the target accuracy, πœ–π‘‘, as compared to a lower bound on thenumber of required comparisons, π‘šlower := 2𝑛𝑅/(π‘’πœ–π‘‘) for any non-adaptive strategy (recallTheorem 3).

Note that this result holds in the noise-free setting, but can be generalized to handlenoisy settings via the approaches discussed in Sections 4 and 5. While we cannot hope toachieve the same exponential improvement in the general case, we expect rates better thanthose of passive methods to be possible for certain noise levels.

6. Simulations

In this section we perform a range of synthetic experiments to demonstrate our approach.

6.1 Effect of Varying 𝜎2

In Fig. 2, we let x ∈ R{2,3,4,8} with β€–xβ€– = 𝑅 = 1. We vary 𝜎2 and perform 1500 trials, eachwith π‘š = 100 or π‘š = 200 pairs of points drawn according to 𝒩 (0, 𝜎2𝐼). To isolate theimpact of 𝜎2, we consider the case where our observations are noise-free, and use (25) torecover x. As predicted by the theory, localization accuracy depends on the parameter 𝜎,which controls the distribution of the hyperplane thresholds. Intuitively, if 𝜎 is too small,the hyperplane boundaries concentrate closer to the origin and do not localize points with

22

Localization via Paired Comparisons

(a) (b)

(c) (d)

Figure 2: Mean error norm β€–xβˆ’ xβ€– as 𝜎2 varies for dimensions (a) 2, (b) 3, (c) 4, and (d) 8for (a–c) 100 comparisons and (d) 200 comparisons.

large norm well. On the other hand, if 𝜎 is too large, most hyperplanes lie far from thetarget x. The sweet spot which allows uniform localization over the radius 𝑅 ball existsaround 𝜎2 β‰ˆ 2𝑅2/𝑛 here.

6.2 Effect of Noise

Here we experiment with noise as discussed in Sections 4 and 5, and use the optimizationprogram (26) for recovery. To approximately solve this non-convex problem, we use thelinearization procedure described in (Perez-Cruz et al., 2003). Specifically, over a number ofiterations π‘˜, we repeatedly solve the sub-problem

minimize𝜌∈R,πœ‰βˆˆRπ‘š,w(π‘˜)∈R𝑛+1

βˆ’πœˆπœŒ + 1π‘š

π‘šβˆ‘π‘–=1

πœ‰π‘–

subject to π’œπ‘–(x)([a𝑇𝑖 ,βˆ’πœπ‘–] w(π‘˜)) β‰₯ πœŒβˆ’ πœ‰π‘–, πœ‰π‘– β‰₯ 0, βˆ€π‘– ∈ [π‘š],w(π‘˜)𝑇 w(π‘˜) = 2

where we set w(π‘˜+1) ← πœ’ w(π‘˜) + (1 βˆ’ πœ’) w(π‘˜) with πœ’ = 0.7. After sufficient iterations, ifw(π‘˜) β‰ˆ w(π‘˜) then (26) is approximately solved. This is a linear program and it can be easilyverified using the KKT conditions that |{𝑖 : πœ‰π‘– > 0}| ≀ π‘šπœˆ. Thus in practice, this propertywill always be satisfied after each iteration.

23

Massimino and Davenport

We also emphasize that the error bounds in Section 5 rely on the fact from Proposition 9that 𝑑𝐻(π’œ(x),π’œ(x)) ≀ 𝜈, provided that the solution results in a 𝜌 > 0. Unfortunately, wecannot guarantee that this will always be the case. Empirically, we have observed that givena certain noise level quantified by 𝑑𝐻(π’œ(x),π’œ(x)) = πœ…, we are more likely to observe 𝜌 ≀ 0when we aggressively set 𝜈 = πœ…. By increasing 𝜈 somewhat this becomes much less likely.As a rule of thumb, we set 𝜈 = 2πœ…. We note that while in our context this choice is purelyheuristic, it has some theoretical support in the 𝜈-SVM literature (e.g., see Proposition 5 ofChen et al., 2005).

We consider the following noise models; (i) Gaussian, where we add pre-quantizationGaussian noise as in Section 4.2, (ii) logistic, in which pre-quantization logistic noise is added,(iii) random, where a uniform random 𝜈/2 fraction of comparisons are flipped, and (iv)adversarial, where we flip the 𝜈/2 fraction of comparisons whose hyperplanes lie farthest fromthe ideal point. The logistic noise assumption is commonly studied in paired comparisonliterature and is equivalent to the Bradley–Terry model (Bradley and Terry, 1952). Althoughwe do not have theory for this case, we expect the logistic noise case to behave similarlyto the Gaussian case. In both the Gaussian and logistic cases, we sweep the added noisevariance, then for each variance plot against the mean number of errors induced as thex-axis. In each case, we set 𝑛 = 5 and generate π‘š = 1000 pairs of points and a random xwith β€–xβ€– = 0.7.

The mean and median recovery error β€–xβˆ’ xβ€– and the fraction of violated comparisons𝑑𝐻(π’œ(x),π’œ(x)) are plotted over 100 independent trials with varying number of comparisonerrors in Figs. 3–6. In the Gaussian noise, logistic noise, and uniform random comparisonflipping cases, the actual fraction of comparison errors of the estimate is on average muchsmaller than our target 𝜈. This is also seen in the adversarial case (Fig. 6) for smallerlevels of error. However, at a high fraction of error (greater than about 17%) the error(both in terms of Euclidean norm and fraction of incorrect comparisons) grows rapidly. Thisillustrates a limitation to the approach of using slack variables as a relaxation to the 0–1 loss.We mention that in this regime, the recovery approach of (26) frequently yields 𝜌 ≀ 0, towhich our theory does not apply. Here, the recovery error counter-intuitively decreases withincreasing comparison flips. This scenario, with a large number of erroneous comparisons,represents a very difficult situation in which any tractable recovery strategy would likelystruggle. A possible direction for future work would be to make (26) more robust to suchlarge outliers.

6.3 Adaptive Comparisons

In Fig. 7, we show the effect of varying levels of adaptivity, starting with the completelynon-adaptive approach up to using 10 stages where we progressively re-center and re-scalethe hyperplane offsets. In each case, we generate x ∈ R3 where β€–xβ€– = 0.75 and choosing thedirection randomly. The total number of comparisons are held fixed and are split as equallyas possible among the number of stages (preferring earlier stages when rounding). We set𝜎2 = 𝑅 = 1 and plot the average over 700 independent trials. As the number of stagesincreases, performance worsens if the number of comparisons are kept small due to badlocalization in the earlier stages. However, if the number of total comparisons is sufficientlylarge, an exponential improvement over non-adaptivity is possible.

24

Localization via Paired Comparisons

Figure 3: Mean estimation error and average fraction comparison errors when adding pre-quantization Gaussian noise, sweeping the variance and plotting against theaverage number of induced errors for each variance.

Figure 4: Mean estimation error and fraction comparison errors when adding pre-quantization logistic noise, sweeping the variance and plotting against the averagenumber of induced errors for each variance.

25

Massimino and Davenport

Figure 5: Mean estimation error and fraction comparison errors when introducing uniformrandom comparison errors.

6.4 Adaptive Comparisons with a Fixed Non-Gaussian Data Set

In Fig. 8, we demonstrate the effect of adaptively choosing item pairs from a fixed syntheticdata set over four stages versus choosing items non-adaptively, i.e., without attempting toestimate the signal during the comparison collection process. We first generated 10,000 itemsuniformly distributed inside the 3-dimensional unit ball and a vector x ∈ R3 where β€–xβ€– = 0.4.In both cases, we generate pairs of Gaussian points and choose the items from the fixeddata set which lie closest to them. In the adaptive case over four stages, we progressivelyre-center and re-scale the generated points; the initial 𝜎2 is set to the variance of the dataset and is reduced dyadically after each stage. The total number of comparisons is heldfixed and is split as equally as possible among the number of stages (preferring later stageswhen rounding). We plot the mean error over 200 independent data set trials.

7. Discussion

We have shown that given the ability to generate item pairs according to a Gaussiandistribution with a particular variance, it is possible to estimate a point x satisfying β€–xβ€– ≀ 𝑅to within πœ– with roughly 𝑛𝑅/πœ– paired comparisons (ignoring log factors). This procedure isalso robust to a variety of forms of noise. If one is able to shift the distribution of the itemsdrawn, adaptive estimation gives a substantial improvement over a non-adaptive strategy. Todirectly implement such a scheme, one would require the ability to generate items arbitrarilyin R𝑛. While there may be some cases where this is possible (e.g., in market testing ofitems where the features correspond to known quantities that can be manually manipulated,such as the amount of various ingredients in a food or beverage), in many of the settingsconsidered by recommendation systems, the only items which can be compared belong to afixed set of points. While our theory would still provide rough guidance as to how accurateof a localization is possible, many open questions in this setting remain. For instance, the

26

Localization via Paired Comparisons

Figure 6: (Top) Mean estimation error and fraction comparison errors when flipping thefarthest comparisons. (Bottom) Fraction of trials with 𝜌 > 0.

27

Massimino and Davenport

Figure 7: Mean error norm β€–xβˆ’ xβ€– versus total comparisons for a sequence of experimentswith varying number of adaptive stages.

Figure 8: Mean error norm β€–xβˆ’ xβ€– versus total comparisons for nonadaptive and adaptiveselection. Dotted lines denote stage boundaries.

28

Localization via Paired Comparisons

algorithm itself needs to be adapted, as done in Section 6.4. Of course, there are manyother ways that the adaptive scheme could be modified to account for this restriction. Forexample, one could use rejection sampling, so that although many candidate pairs wouldneed to be drawn, only a fraction would actually need to be presented to and labeled by theuser. We leave the exploration of such variations for future work.

Acknowledgments

This work was supported by grants AFOSR FA9550-14-1-0342, NSF CCF-1350616, and agift from the Alfred P. Sloan Foundation.

29

Massimino and Davenport

30

Localization via Paired Comparisons

Appendix A. Supporting Lemmas

Lemma 11 Let 𝑏 > π‘Ž and let 𝐿 = min{|π‘Ž|, |𝑏|} and π‘ˆ = max{|π‘Ž|, |𝑏|}. Then if Ξ¦ and πœ‘respectively denote the standard normal cumulative distribution function and probabilitydistribution function, we have the bounds

(π‘βˆ’ π‘Ž)πœ‘(π‘ˆ) ≀ Ξ¦(𝑏)βˆ’ Ξ¦(π‘Ž) ≀ (π‘βˆ’ π‘Ž)πœ‘(𝐿) ≀ (π‘βˆ’ π‘Ž)πœ‘(0).

Proof By the mean value theorem, we have for some π‘Ž < 𝑐 < 𝑏, Ξ¦(𝑏)βˆ’Ξ¦(π‘Ž) = (π‘βˆ’π‘Ž)Ξ¦β€²(𝑐) =(π‘βˆ’ π‘Ž)πœ‘(𝑐). Since πœ‘(|π‘₯|) is monotonic decreasing, it is lower bounded by πœ‘(π‘ˆ) and upperbounded by πœ‘(𝐿) (and also πœ‘(0)).

Lemma 12 Let x, y ∈ R𝑛. Then,∫Sπ‘›βˆ’1|a𝑇 (xβˆ’ y)| 𝜈(da) = 2√

πœ‹

Ξ“(𝑛2 )

Ξ“(𝑛+12 )β€–xβˆ’ yβ€– .

Proof By spherical symmetry, we may assume Ξ” = x βˆ’ y = [πœ–, 0, . . . , 0] for πœ– > 0without loss of generality. Then β€–xβˆ’ yβ€– = πœ– and |a𝑇 (x βˆ’ y)| = π‘Ž(1)πœ– = πœ–|cos πœƒ|, wherecosβˆ’1(π‘Ž(1)) = πœƒ ∈ [0, πœ‹]. We will use the fact (Gradshteyn and Ryzhik, 2007):∫ πœ‹

2

0cosπœ‡βˆ’1 πœƒ sinπœ”βˆ’1 πœƒ dπœƒ = 1

2𝐡

(πœ‡

2 ,πœ”

2

)= 1

2Ξ“(πœ‡/2)Ξ“(πœ”/2)Ξ“((πœ‡ + πœ”)/2) .

Integrating |cos πœƒ| in the first spherical coordinate, since the integrand is symmetric about πœ‹2 ,∫ πœ‹

0|cos πœƒ| sinπ‘›βˆ’2 πœƒ dπœƒ = 2

∫ πœ‹/2

0cos πœƒ sinπ‘›βˆ’2 πœƒ dπœƒ =

Ξ“(1)Ξ“(π‘›βˆ’12 )

Ξ“(1 + π‘›βˆ’12 )

= 2π‘›βˆ’ 1 .

Then with the appropriate normalization, we have (using Ξ“(1/2) =√

πœ‹)βˆ«π‘†π‘›βˆ’1

|π‘Žπ‘‡ (xβˆ’ y)| 𝜈(dπ‘Ž) =(∫ πœ‹

0sinπ‘›βˆ’2 πœƒ dπœƒ

)βˆ’1 ∫ πœ‹

0πœ–|cos πœƒ| sinπ‘›βˆ’2 πœƒ dπœƒ

= πœ–

(Ξ“(1

2)Ξ“(π‘›βˆ’12 )

Ξ“(12 + π‘›βˆ’1

2 )

)βˆ’1 2π‘›βˆ’ 1 = 2√

πœ‹

Ξ“(𝑛2 )

Ξ“(𝑛+12 )β€–xβˆ’ yβ€– .

Appendix B. Integral Calculations for Lemma 7

Bound (23) Recall that π‘Ÿπ‘– := a𝑇𝑖 x/ β€–xβ€– ∈ [βˆ’1, 1], π‘žπ‘– = π‘Ÿπ‘– β€–xβ€–βˆ’ πœπ‘–, and π‘žπ‘– = π‘Ÿπ‘– β€–xβ€–βˆ’ πœπ‘– + 𝑧𝑖.

Thus, if π‘“π‘Ÿ(π‘Ÿπ‘–), π‘“πœ (πœπ‘–), and 𝑓𝑧(𝑧𝑖) denote the probability density functions for π‘Ÿπ‘–, πœπ‘–, and 𝑧𝑖,then since these random variables are independent we can write

P [π‘žπ‘– < 0 and π‘žπ‘– > 0] = P [π‘Ÿπ‘– β€–xβ€– βˆ’ πœπ‘– < 0 and π‘Ÿπ‘– β€–xβ€– βˆ’ πœπ‘– + 𝑧𝑖 > 0]

=∫ 1

βˆ’1

∫ ∞

π‘Ÿπ‘–β€–xβ€–

∫ π‘Ÿπ‘–β€–xβ€–βˆ’πœπ‘–

βˆ’βˆžπ‘“π‘Ÿ(π‘Ÿπ‘–)π‘“πœ (πœπ‘–)𝑓𝑧(𝑧𝑖) d𝑧𝑖 dπœπ‘– dπ‘Ÿπ‘–

=∫ 1

βˆ’1

∫ ∞

π‘Ÿπ‘–β€–xβ€–π‘“π‘Ÿ(π‘Ÿπ‘–)π‘“πœ (πœπ‘–)P [𝑧𝑖 > πœπ‘– βˆ’ π‘Ÿπ‘– β€–xβ€–] dπœπ‘– dπ‘Ÿπ‘–

=∫ 1

βˆ’1

∫ ∞

π‘Ÿπ‘–β€–xβ€–π‘“π‘Ÿ(π‘Ÿπ‘–)π‘“πœ (πœπ‘–)𝑄

(πœπ‘– βˆ’ π‘Ÿπ‘– β€–xβ€–

πœŽπ‘§

)dπœπ‘– dπ‘Ÿπ‘–,

31

Massimino and Davenport

where 𝑄(π‘₯) = 1√2πœ‹

∫∞π‘₯ exp(βˆ’π‘₯2/2) dπ‘₯, i.e., the tail probability for the standard normal

distribution. Via a similar argument we have

P [π‘žπ‘– > 0 and π‘žπ‘– < 0] = P [π‘Ÿπ‘– β€–xβ€– βˆ’ πœπ‘– > 0 and π‘Ÿπ‘– β€–xβ€– βˆ’ πœπ‘– + 𝑧𝑖 < 0]

=∫ 1

βˆ’1

∫ π‘Ÿπ‘–β€–xβ€–

βˆ’βˆž

∫ ∞

π‘Ÿπ‘–β€–xβ€–βˆ’πœπ‘–

π‘“π‘Ÿ(π‘Ÿπ‘–)π‘“πœ (πœπ‘–)𝑓𝑧(𝑧𝑖) d𝑧𝑖 dπœπ‘– dπ‘Ÿπ‘–

=∫ 1

βˆ’1

∫ π‘Ÿπ‘–β€–xβ€–

βˆ’βˆžπ‘“π‘Ÿ(π‘Ÿπ‘–)π‘“πœ (πœπ‘–)P [𝑧𝑖 < πœπ‘– βˆ’ π‘Ÿπ‘– β€–xβ€–)] dπœπ‘– dπ‘Ÿπ‘–

=∫ 1

βˆ’1

∫ π‘Ÿπ‘–β€–xβ€–

βˆ’βˆžπ‘“π‘Ÿ(π‘Ÿπ‘–)π‘“πœ (πœπ‘–)𝑄

(π‘Ÿπ‘– β€–xβ€– βˆ’ πœπ‘–

πœŽπ‘§

)dπœπ‘– dπ‘Ÿπ‘–.

Combining these we obtain

P[π‘žπ‘–π‘žπ‘– < 0] = P [π‘žπ‘– < 0 and π‘žπ‘– > 0] + P [π‘žπ‘– > 0 and π‘žπ‘– < 0]

=∫ 1

βˆ’1

∫ ∞

βˆ’βˆžπ‘“π‘Ÿ(π‘Ÿπ‘–)π‘“πœ (πœπ‘–)𝑄

( |π‘Ÿπ‘– β€–xβ€– βˆ’ πœπ‘–|πœŽπ‘§

)dπœπ‘– dπ‘Ÿπ‘–

= 2∫ 1

0

∫ ∞

βˆ’βˆžπ‘“π‘Ÿ(π‘Ÿπ‘–)π‘“πœ (πœπ‘–)𝑄

( |π‘Ÿπ‘– β€–xβ€– βˆ’ πœπ‘–|πœŽπ‘§

)dπœπ‘– dπ‘Ÿπ‘–,

following from the symmetry of π‘“π‘Ÿ(Β·). Using the bound 𝑄(π‘₯) ≀ 12 exp(βˆ’π‘₯2/2) (see (13.48) of

Johnson et al., 1994), and recalling that πœπ‘– ∼ 𝒩 (0, 2𝑅2/𝑛), we have that

P[π‘žπ‘–π‘žπ‘– < 0] ≀ 1𝑅

βˆšπ‘›

πœ‹

∫ 1

0

∫ ∞

βˆ’βˆžπ‘“π‘Ÿ(π‘Ÿπ‘–) exp

(βˆ’(π‘Ÿπ‘– β€–xβ€– βˆ’ πœπ‘–)2

2𝜎2𝑧

βˆ’ π‘›πœ2𝑖

4𝑅2

)dπœπ‘– dπ‘Ÿπ‘–.

B.1 Bounding πœ…π‘›

First, we give an expression for πœ…π‘› for all cases 𝑛 β‰₯ 2, expanding upon that given inTheorem 8 and Lemma 7. We have

πœ…π‘›(𝜎2𝑧) :=

⎧βŽͺβŽͺβŽͺβŽͺβŽͺβŽͺ⎨βŽͺβŽͺβŽͺβŽͺβŽͺβŽͺ⎩

12

√𝜎2

π‘§πœŽ2

𝑧+𝑅2 𝑛 = 2

min{√

𝜎2𝑧

𝜎2𝑧+2𝑅2/3 ,

βˆšπœ‹2

πœŽπ‘§β€–xβ€–

}𝑛 = 3√

𝜎2𝑧

𝜎2𝑧+2𝑅2/𝑛+4β€–xβ€–2/𝑛

𝑛 β‰₯ 4.

Below we derive this expression for the cases 𝑛 = 2, 𝑛 = 3, and 𝑛 β‰₯ 4.

B.2 Case 𝑛 = 2

For the special case 𝑛 = 2, 𝑑𝑖 = cos πœƒπ‘– where πœƒπ‘– ∈ [βˆ’πœ‹, πœ‹] is distributed uniformly. In thiscase, (23) can be re-written as

P[π‘žπ‘–π‘žπ‘– < 0] ≀ 12𝑅

√2πœ‹

∫ πœ‹/2

βˆ’πœ‹/2

∫ ∞

βˆ’βˆž

12πœ‹

exp(βˆ’(β€–xβ€– cos πœƒπ‘– βˆ’ πœπ‘–)2

2𝜎2𝑧

βˆ’ 𝜏2𝑖

2𝑅2

)dπœπ‘– dπœƒπ‘–

= 1πœ‹π‘…

√1

2πœ‹

∫ πœ‹/2

0

∫ ∞

βˆ’βˆžexp

(βˆ’(β€–xβ€– cos πœƒπ‘– βˆ’ πœπ‘–)2

2𝜎2𝑧

βˆ’ 𝜏2𝑖

2𝑅2

)dπœπ‘– dπœƒπ‘–.

32

Localization via Paired Comparisons

Expanding and setting 𝛼, 𝛽, and 𝛾 appropriately,

P[π‘žπ‘–π‘žπ‘– < 0] ≀ 1πœ‹π‘…

√1

2πœ‹

∫ πœ‹/2

0

∫ ∞

βˆ’βˆžexp

(βˆ’β€–xβ€–

2 cos2 πœƒπ‘–

2𝜎2𝑧

+ 2 β€–xβ€– πœπ‘– cos πœƒπ‘–

2𝜎2𝑧

βˆ’ 𝜏2𝑖

2𝜎2𝑧

βˆ’ 𝜏2𝑖

2𝑅2

)dπœπ‘– dπœƒπ‘–

= 1πœ‹π‘…

√1

2πœ‹

∫ πœ‹/2

0

∫ ∞

βˆ’βˆžexp

(βˆ’π›Ύ cos2 πœƒπ‘– + π›½πœπ‘– cos πœƒπ‘– βˆ’ π›Όπœ2

𝑖

)dπœπ‘– dπœƒπ‘–.

Completing the square for πœπ‘–,

P[π‘žπ‘–π‘žπ‘– < 0] = 1πœ‹π‘…

√1

2πœ‹

∫ πœ‹/2

0

∫ ∞

βˆ’βˆžexp

(βˆ’π›Ό

(πœπ‘– + 𝛽 cos πœƒπ‘–

2𝛼

)2+ (𝛽 cos πœƒπ‘–)2

4π›Όβˆ’ 𝛾 cos2 πœƒπ‘–

)dπœπ‘– dπœƒπ‘–

= 1πœ‹π‘…

√1

2πœ‹

∫ πœ‹/2

0

βˆšπœ‹

𝛼exp

(βˆ’(

𝛾 βˆ’ 𝛽2

4𝛼

)cos2 πœƒπ‘–

)dπœƒπ‘–

= πœ‹

2πœ‹π‘…

√1

2𝛼exp

(βˆ’1

2

(𝛾 βˆ’ 𝛽2

4𝛼

))𝐼0

(12

(𝛾 βˆ’ 𝛽2

4𝛼

)),

where 𝐼0(Β·) denotes the modified Bessel function of the first kind. Since exp(βˆ’π‘‘)𝐼0(𝑑) < 1,by plugging back in for 𝛼 we obtain

P[π‘žπ‘–π‘žπ‘– < 0] ≀ 12𝑅

√1

2𝛼= 1

2√

2𝑅

⎯⎸⎸⎷ 11

2𝜎2𝑧

+ 12𝑅2

= 12

√𝜎2

𝑧

𝜎2𝑧 + 𝑅2 .

We also note that since since exp(βˆ’π‘‘)𝐼0(𝑑) < 1/√

πœ‹π‘‘, we can obtain the bound P[π‘žπ‘–π‘žπ‘– < 0] ≀1βˆšπœ‹

πœŽπ‘§β€–xβ€– , but one can show that the previous bound will dominate this whenever β€–xβ€– ≀ 𝑅.

B.3 Case 𝑛 = 3

For the case 𝑛 = 3, 𝑑𝑖 ∼ [βˆ’1, 1] is itself distributed uniformly. In this case we have

P[π‘žπ‘–π‘žπ‘– < 0] ≀ 12𝑅

√3πœ‹

∫ 1

0

∫ ∞

βˆ’βˆžexp

(βˆ’(𝑑𝑖 β€–xβ€– βˆ’ πœπ‘–)2

2𝜎2𝑧

βˆ’ 3𝜏2𝑖

4𝑅2

)dπœπ‘– d𝑑𝑖.

Expanding and setting 𝛼, 𝛽, and 𝛾 appropriately,

P[π‘žπ‘–π‘žπ‘– < 0] ≀ 12𝑅

√3πœ‹

∫ 1

0

∫ ∞

βˆ’βˆžexp

(βˆ’π‘‘2

𝑖 β€–xβ€–2

2𝜎2𝑧

+ 2𝑑𝑖 β€–xβ€– πœπ‘–

2𝜎2𝑧

βˆ’ 𝜏2𝑖

2𝜎2𝑧

βˆ’ 3𝜏2𝑖

4𝑅2

)dπœπ‘– d𝑑𝑖

= 12𝑅

√3πœ‹

∫ 1

0

∫ ∞

βˆ’βˆžexp

(βˆ’π›Ύπ‘‘2

𝑖 + π›½π‘‘π‘–πœπ‘– βˆ’ π›Όπœ2𝑖

)dπœπ‘– d𝑑𝑖.

Completing the square for πœπ‘–,

P[π‘žπ‘–π‘žπ‘– < 0] = 12𝑅

√3πœ‹

∫ 1

0

∫ ∞

βˆ’βˆžexp

(βˆ’π›Ό

(πœπ‘– + 𝑑𝑖𝛽

2𝛼

)2+ (𝑑𝑖𝛽)2

4π›Όβˆ’ 𝛾𝑑𝑖

)dπœπ‘– d𝑑𝑖

= 12𝑅

√3πœ‹

∫ 1

0

βˆšπœ‹

𝛼exp

(βˆ’π‘‘2

𝑖

(𝛾 βˆ’ 𝛽2

4𝛼

))d𝑑𝑖

= 12𝑅

√3𝛼

βˆšπœ‹

2erf(√

𝛾 βˆ’ 𝛽2/4𝛼)

βˆšπ›Ύ βˆ’ 𝛽2/4𝛼

.

33

Massimino and Davenport

Since erf(𝑑)/𝑑 ≀ 2/√

πœ‹, by plugging back in for 𝛼 we obtain

P[π‘žπ‘–π‘žπ‘– < 0] ≀ 12𝑅

⎯⎸⎸⎷ 31

2𝜎2𝑧

+ 34𝑅2

=√

𝜎2𝑧

𝜎2𝑧 + 2𝑅2/3 .

Additionally, since erf(𝑑) ≀ 1,

P[π‘žπ‘–π‘žπ‘– < 0] β‰€βˆš

3πœ‹

4𝑅

(π›Ύπ›Όβˆ’ 𝛽2/4

)βˆ’1/2

=√

3πœ‹

4𝑅

(β€–xβ€–2

2𝜎2𝑧

( 12𝜎2

𝑧

+ 34𝑅2

)βˆ’ β€–xβ€–

2

4𝜎4𝑧

)βˆ’1/2

=√

3πœ‹

4𝑅

(3 β€–xβ€–2

8𝜎2𝑧𝑅2

)βˆ’1/2

=√

πœ‹

2πœŽπ‘§

β€–xβ€– ,

which can be tighter when πœŽπ‘§ is small and β€–xβ€– is large.

B.4 Case 𝑛 β‰₯ 4

Combining (23) with our upper bound (24) on 𝑓𝑑(𝑑𝑖), we obtain

P[π‘žπ‘–π‘žπ‘– < 0] ≀ 𝑛

4√

2πœ‹π‘…

∫ 1

0

∫ ∞

βˆ’βˆžexp

(βˆ’(𝑑𝑖 β€–xβ€– βˆ’ πœπ‘–)2

2𝜎2𝑧

βˆ’ π‘›πœ2𝑖

4𝑅2 βˆ’π‘›π‘‘2

𝑖

8

)dπœπ‘– d𝑑𝑖.

Expanding and setting 𝛼, 𝛽, and 𝛾 appropriately,

P[π‘žπ‘–π‘žπ‘– < 0] ≀ 𝑛

4√

2πœ‹π‘…

∫ 1

0

∫ ∞

βˆ’βˆžexp

(βˆ’π‘‘2

𝑖

(β€–xβ€–2

2𝜎2𝑧

+ 𝑛

8

)+ 2𝑑𝑖 β€–xβ€– πœπ‘–

2𝜎2𝑧

βˆ’ 𝜏2𝑖

2𝜎2𝑧

βˆ’ π‘›πœ2𝑖

4𝑅2

)dπœπ‘– d𝑑𝑖

= 𝑛

4√

2πœ‹π‘…

∫ 1

0

∫ ∞

βˆ’βˆžexp

(βˆ’π›Ύπ‘‘2

𝑖 + π›½π‘‘π‘–πœπ‘– βˆ’ π›Όπœ2𝑖

)dπœπ‘– d𝑑𝑖.

Completing the square for πœπ‘–,

P[π‘žπ‘–π‘žπ‘– < 0] = 𝑛

4√

2πœ‹π‘…

∫ 1

0

∫ ∞

βˆ’βˆžexp

(βˆ’π›Ό

(πœπ‘– βˆ’

𝑑𝑖𝛽

2𝛼

)2+ (𝑑𝑖𝛽)2

4π›Όβˆ’ 𝛾𝑑2

𝑖

)dπœπ‘– d𝑑𝑖

= 𝑛

4√

2πœ‹π‘…

∫ 1

0

βˆšπœ‹

𝛼exp

(βˆ’π‘‘2

𝑖

(𝛾 βˆ’ 𝛽2

4𝛼

))d𝑑𝑖

= 𝑛

4√

2πœ‹π›Όπ‘…

βˆšπœ‹

2erf(√

𝛾 βˆ’ 𝛽2/4𝛼)

βˆšπ›Ύ βˆ’ 𝛽2/4𝛼

.

34

Localization via Paired Comparisons

Since erf(𝑑) ≀ 1, we have

P[π‘žπ‘–π‘žπ‘– < 0] ≀ 𝑛

4√

2𝑅

(π›Ύπ›Όβˆ’ 𝛽2/4

)βˆ’1/2

= 𝑛

4√

2𝑅

((β€–xβ€–2

2𝜎2𝑧

+ 𝑛

8

)( 12𝜎2

𝑧

+ 𝑛

4𝑅2

)βˆ’ β€–xβ€–

2

4𝜎4𝑧

)βˆ’1/2

= 12

(32𝑅2

𝑛2

(β€–xβ€–2 𝑛

8𝜎2𝑧𝑅2 + 𝑛

16𝜎2𝑧

+ 𝑛2

32𝑅2

))βˆ’1/2

= 12

√𝜎2

𝑧

𝜎2𝑧 + 2𝑅2/𝑛 + 4 β€–xβ€–2 /𝑛

.

We also note that since erf(𝑑)/𝑑 ≀ 2/√

πœ‹, it is also possible to obtain the bound

P[π‘žπ‘–π‘žπ‘– < 0] ≀ 12

βˆšπ‘›

2πœ‹

√𝜎2

𝑧

𝜎2𝑧 + 2𝑅2/𝑛

.

However, this bound can only be tighter when β€–xβ€– is small and when 𝑛2πœ‹ < 1 (i.e., for 𝑛 ≀ 6).

Given this narrow range of applicability, we omit this from the formal statement of theresult.

35

Massimino and Davenport

36

Localization via Paired Comparisons

References

S. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D. Kriegman, and S. Belongie. Generalizednon-metric multidimensional scaling. In Proc. Int. Conf. Art. Intell. Stat. (AISTATS),San Juan, Puerto Rico, Mar. 2007.

N. Ailon. Active learning ranking from pairwise preferences with almost optimal querycomplexity. In Proc. Adv. in Neural Processing Systems (NIPS), Grenada, Spain, Dec.2011.

E. Arias-Castro et al. Some theory for ordinal embedding. Bernoulli, 23(3):1663–1693, 2017.

R. Baraniuk, S. Foucart, D. Needell, Y. Plan, and M. Wootters. Exponential decay ofreconstruction error from binary measurements of sparse signals. IEEE Trans. Inform.Theory, 63(6):3368–3385, 2017.

D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.

R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method ofpaired comparisons. Biometrika, 39(3/4):324–345, 1952.

A. Brieden, P. Gritzmann, R. Kannan, V. Klee, L. LovΓ‘sz, and M. Simonovits. Deterministicand randomized polynomial-time approximation of radii. Mathematika, 48(1-2):63–105,2001.

R. Buck. Partition of space. Amer. Math. Monthly, 50(9):541–544, 1943.

E. CandΓ¨s and B. Recht. Exact matrix completion via convex optimization. Found. Comput.Math., 9(6):717–772, 2009.

P.-H. Chen, C.-J. Lin, and B. SchΓΆlkopf. A tutorial on 𝜈-support vector machines. Appl.Stoch. Models Bus. Ind., 21(2):111–136, 2005.

C. Coombs. Psychological scaling without a unit of measurement. Psych. Rev., 57(3):145–158, 1950.

M. Davenport. Lost without a compass: Nonmetric triangulation and landmark multidimen-sional scaling. In Proc. IEEE Int. Work. Comput. Adv. Multi-Sensor Adaptive Processing(CAMSAP), Saint Martin, Dec. 2013.

H. David. The Method of Paired Comparisons. Charles Griffin & Company Limited, London,UK, 1963.

B. Dubois. Ideal point versus attribute models of brand preference: A comparison ofpredictive validity. Adv. Consumer Research, 2(1):321–333, 1975.

B. Eriksson. Learning to Top-K search using pairwise comparisons. In Proc. Int. Conf. Art.Intell. Stat. (AISTATS), Scottsdale, AZ, Apr. 2013.

M. Giaquinta and G. Modica. Mathematical Analysis: An Introduction to Functions ofSeveral Variables. BirkhΓ€user, New York, NY, 2010.

37

Massimino and Davenport

I. Gradshteyn and I. Ryzhik. Table of Integrals, Series, and Products. Academic Press,Burlington, MA, 2007.

L. GreniΓ© and G. Molteni. Inequalities for the beta function. Math. Inequal. Appl, 18(4):1427–1442, 2015.

L. Jacques, J. Laska, P. Boufounos, and R. Baraniuk. Robust 1-bit compressive sensing viabinary stable embeddings of sparse vectors. IEEE Trans. Inform. Theory, 59(4):2082–2102,2013.

L. Jain, K. G. Jamieson, and R. Nowak. Finite sample prediction and recovery boundsfor ordinal embedding. In Advances In Neural Information Processing Systems, pages2711–2719, 2016.

K. Jamieson and R. Nowak. Active ranking using pairwise comparisons. In Proc. Adv. inNeural Processing Systems (NIPS), Granada, Spain, Dec. 2011a.

K. Jamieson and R. Nowak. Low-dimensional embedding using adaptively selected ordinaldata. In Proc. Allerton Conf. Communication, Control, and Computing, Monticello, IL,Sept. 2011b.

N. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distributions, volume 1.Wiley, New York, NY, 1994.

M. Kleindessner and U. Luxburg. Uniqueness of ordinal embedding. In Conference onLearning Theory, pages 40–67, 2014.

K. Knudson, R. Saab, and R. Ward. One-bit compressive sensing with norm estimation.IEEE Trans. Inform. Theory, 62(5):2748–2758, 2016.

M. Konoshima and Y. Noma. Hyperplane arrangements and locality-sensitive hashing withlift. arXiv preprint arXiv:1212.6110, 2012.

Y. Lu and S. Negahban. Individualized rank aggregation using nuclear norm regularization.In Proc. Allerton Conf. Communication, Control, and Computing, Monticello, IL, Sept.2015.

A. Massimino and M. Davenport. Binary stable embedding via paired comparisons. In Proc.IEEE Work. Stat. Signal Processing, Palma de Mallorca, Spain, June 2016.

A. Massimino and M. Davenport. The geometry of random paired comparisons. In Proc.IEEE Int. Conf. Acoust., Speech, and Signal Processing (ICASSP), New Orleans, Louisiana,Mar. 2017.

A. Maydeu-Olivares and U. BΓΆckenholt. Modeling preference data. In R. Millsap andA. Maydeu-Olivares, editors, The SAGE Handbook of Quantitative Methods in Psychology,chapter 12, pages 264–282. SAGE Publications Ltd., London, UK, 2009.

G. Miller. The magical number seven, plus or minus two: Some limits on our capacity forprocessing information. Psych. Rev., 63(2):81, 1956.

38

Localization via Paired Comparisons

R. Nowak. Noisy generalized binary search. In Advances in neural information processingsystems, pages 1366–1374, 2009.

S. Oh, K. Thekumparampil, and J. Xu. Collaboratively learning preferences from ordinaldata. In Proc. Adv. in Neural Processing Systems (NIPS), MontrΓ©al, QuΓ©bec, Dec. 2014.

M. O’Shaughnessy and M. Davenport. Localizing users and items from paired comparisons.In Proc. IEEE Int. Work. Machine Learning for Signal Processing (MLSP), Vietri sulMare, Salerno, Italy, Sept. 2016.

D. Park, J. Neeman, J. Zhang, S. Sanghavi, and I. Dhillon. Preference completion: Large-scale collaborative ranking form pairwise comparisons. In Proc. Int. Conf. MachineLearning (ICML), Lille, France, July 2015.

F. Perez-Cruz, J. Weston, D. Herrmann, and B. SchΓΆlkopf. Extension of the 𝜈-SVM rangefor classification. NATO Science Series Sub Series III Computer and Systems Sciences,190:179–196, 2003.

Y. Plan and R. Vershynin. One-bit compressed sensing by linear programming. Communi-cations on Pure and Applied Mathematics, 66(8):1275–1297, 2013a.

Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: Aconvex programming approach. IEEE Transactions on Information Theory, 59(1):482–494,2013b.

Y. Plan and R. Vershynin. Dimension reduction by random hyperplane tessellations. Discrete& Computational Geometry, 51(2):438–461, 2014.

F. Qi. Bounds for the ratio of two gamma functions. J. Inequal. Appl., 2010(1):493058, 2010.

F. Radlinski and T. Joachims. Active exploration for learning rankings from clickthroughdata. In Proc. ACM SIGKDD Int. Conf. on Knowledge, Discovery, and Data Mining(KDD), San Jose, CA, Aug. 2007.

J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborativeprediction. In Proc. Int. Conf. Machine Learning (ICML), Bonn, Germany, Aug. 2005.

B. SchΓΆlkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization,Optimization, and Beyond. MIT Press, Cambridge, MA, 2001.

N. Shah, S. Balakrishnan, J. Bradley, A. Parekh, K. Ramchandran, and M. Wainwright.Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence.J. Machine Learning Research, 17(58):1–47, 2016.

M. Spruill. Asymptotic distribution of coordinates on high dimensional spheres. Elec. Comm.in Prob., 12:234–247, 2007.

L. Thurstone. A law of comparative judgment. Psych. Rev., 34(4):273, 1927.

F. Wauthier, M. Jordan, and N. Jojic. Efficient ranking from pairwise comparisons. In Proc.Int. Conf. Machine Learning (ICML), Atlanta, GA, June 2013.

39