As You Like It: Localization via Paired Comparisons - Journal ...
-
Upload
khangminh22 -
Category
Documents
-
view
1 -
download
0
Transcript of As You Like It: Localization via Paired Comparisons - Journal ...
Journal of Machine Learning Research 22 (2021) 1-37 Submitted 2/18; Revised 7/19; Published 4/21
As You Like It: Localization via Paired Comparisons
Andrew K. Massimino [email protected] A. Davenport [email protected] of Electrical and Computer EngineeringGeorgia Institute of Technology777 Atlantic Dr NWAtlanta, GA 30332 USA
Editor: Francis Bach
AbstractSuppose that we wish to estimate a vector x from a set of binary paired comparisons ofthe form βx is closer to p than to qβ for various choices of vectors p and q. The problemof estimating x from this type of observation arises in a variety of contexts, includingnonmetric multidimensional scaling, βunfolding,β and ranking problems, often because itprovides a powerful and flexible model of preference. We describe theoretical bounds for howwell we can expect to estimate x under a randomized model for p and q. We also presentresults for the case where the comparisons are noisy and subject to some degree of error.Additionally, we show that under a randomized model for p and q, a suitable number ofbinary paired comparisons yield a stable embedding of the space of target vectors. Finally,we also show that we can achieve significant gains by adaptively changing the distributionused for choosing p and q.Keywords: paired comparisons, ideal point models, recommendation systems, 1-bit sensing,binary embedding
1. Introduction
In this paper we consider the problem of determining the location of a point in Euclideanspace based on distance comparisons to a set of known points, where our observations arenonmetric. In particular, let x β Rπ be the true position of the point that we are trying toestimate, and let (p1, q1), . . . , (pπ, qπ) be pairs of βlandmarkβ points in Rπ which we assumeto be known a priori. Rather than directly observing the raw distances from x, i.e., βxβpπβand βxβ qπβ, we instead obtain only paired comparisons of the form βxβ pπβ < βxβ qπβ.Our goal is to estimate x from a set of such inequalities. Nonmetric observations of thistype arise in numerous applications and have seen considerable interest in recent literaturee.g., (Ailon, 2011; Davenport, 2013; Eriksson, 2013; Shah et al., 2016). These methods areoften applied in situations where we have a collection of items and hypothesize that it ispossible to embed the items in Rπ in such a way that the Euclidean distance between pointscorresponds to their βdissimilarity,β with small distances corresponding to similar items.
Here, we focus on the sub-problem of adding a new point to a known (or previouslylearned) configuration of landmark points.
As a motivating example, we consider the problem of estimating a userβs preferences fromlimited response data. This is useful, for instance, in recommender systems, information
cβ2021 Andrew K. Massimino and Mark A. Davenport.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are providedat http://jmlr.org/papers/v22/18-105.html.
Massimino and Davenport
qi
pi
x
Figure 1: An illustration of the localization problem from paired comparisons. The infor-mation that x is closer to pπ than qπ tells us which side of a hyperplane x lies.Through many such comparisons we can hope to localize x to a high degree ofaccuracy.
retrieval, targeted advertising, and psychological studies. A common and intuitively appealingway to model preferences is via the ideal point model, which supposes preference for aparticular item varies inversely with Euclidean distance in a feature space (Coombs, 1950).We assume that the items to be rated are represented by points pπ and qπ in an π-dimensionalEuclidean space. A userβs preference is modeled as an additional point x in this space (calledthe individualβs βideal pointβ). This represents a hypothetical βperfectβ item satisfying allof the userβs criteria for evaluating items.
Using response data consisting of paired comparisons between items (e.g., βuser x prefersitem pπ to item qπβ) is a natural approach when dealing with human subjects since it avoidsrequiring people to assign precise numerical scores to different items, which is generally aquite difficult task, especially when preferences may depend on multiple factors (Miller, 1956).In contrast, human subjects often find pairwise judgements much easier to make (David,1963). Data consisting of paired comparisons is often generated implicitly in contexts wherethe user has the option to act on two (or more) alternatives; for instance they may chooseto watch a particular movie, or click a particular advertisement, out of those displayed tothem (Radlinski and Joachims, 2007). In such contexts, the βtrue distancesβ in the idealpoint modelβs preference space are generally inaccessible directly, but it is nevertheless stillpossible to obtain an estimate of a userβs ideal point.
1.1 Main Results
The fundamental question which interests us in this paper is how many comparisons weneed (and how should we choose them) to estimate x to a desired degree of accuracy. Thus,we consider the case where we are given an existing embedding of the items (as in a maturerecommender system) and focus on the on-line problem of locating a single new user fromtheir feedback (consisting of binary data generated from paired comparisons). The itemembedding could be generated using various methods, such as multidimensional scalingapplied to a set of item features, or even using the results of previous paired comparisons viaan approach like that in (Agarwal et al., 2007). Given such an embedding of β items, there
2
Localization via Paired Comparisons
are a total of(β
2)
= Ξ(β2) possible paired comparisons. Clearly, in a system with thousands(or more) items, it will be prohibitive to acquire this many comparisons as a typical user willlikely only provide comparisons for a handful of items. Fortunately, in general we can expectthat many, if not most, of the possible comparisons are actually redundant. For example, ofthe comparisons illustrated in Fig. 1, all but four are redundant andβat least in the absenceof noiseβadd no additional information.
Any precise answer to this question would depend on the underlying geometry of theitem embedding. Each comparison essentially divides Rπ in two, indicating on which side ofa hyperplane x lies, and some arrangements of hyperplanes will yield better tessellationsof the preference space than others. Thus, to gain some intuition on this problem withoutreference to the geometry of a particular embedding, we will instead consider a probabilisticmodel where the items are generated at random from a particular distribution. In this casewe show that under certain natural assumptions on the distribution, it is possible to estimatethe location of any x to within an error of π using a number of comparisons which, up to logfactors, is proportional to π/π. This is essentially optimal, so that no set of comparisonscan provide a uniform guarantee with significantly fewer comparisons. We then describeseveral stability and robustness guarantees for various settings in which the comparisonsare subject to noise or errors. Finally, we then describe a simple extension to an adaptivescheme where we adaptively select the comparisons (manifested here in adaptively alteringthe mean and variance of the distribution generating the items) to substantially reduce therequired number of comparisons.
1.2 Related Work
It is important to note that the ideal point model, while similar, is distinct from the low-rankmodel used in matrix completion (Rennie and Srebro, 2005; Candès and Recht, 2009).Although both models suppose user choices are guided by a number of attributes, the idealpoint model leads to preferences that are non-monotonic functions of those attributes. Theideal point model suggests that each feature has an ideal level; too much of a feature can bejust as undesirable as too little. It is not possible to obtain this kind of performance with atraditional low-rank model, though if points are limited to the sphere, then the ideal pointmodel can duplicate the performance of a low-rank factorization. There is also empiricalevidence that the ideal point model captures behavior more accurately than factorizationbased approaches do (Dubois, 1975; Maydeu-Olivares and Bâckenholt, 2009).
There is a large body of work that studies the problem of learning to rank items fromvarious sources of data, including paired comparisons of the sort we consider in this paper.See, for example, (Jamieson and Nowak, 2011a,b; Wauthier et al., 2013) and referencestherein. We first note that in most work on rankings, the central focus is on learning a correctrank-ordered list for a particular user, without providing any guarantees on recovering acorrect parameterization for the userβs preferences as we do here. While these two problemsare related, there are natural settings where it might be desirable to guarantee an accuraterecovery of the underlying parameterization (x in our model). For example, one could exploitthese guarantees in the context of an iterative algorithm for nonmetric multidimensionalscaling which aims to refine the underlying embedding by updating each user and item oneat a time (e.g., see OβShaughnessy and Davenport, 2016), in which case an understanding of
3
Massimino and Davenport
the error in the estimate of x is crucial. Moreover, we believe that our approach provides aninteresting alternative perspective as it yields natural robustness guarantees and suggestssimple adaptive schemes.
Also closely related is the work in (Lu and Negahban, 2015; Park et al., 2015; Oh et al.,2014) which consider paired comparisons and more general ordinal measurements in thesimilar (but as discussed above, subtly different) context of low-rank factorizations. Perhapsmost closely related to our work is that of (Jamieson and Nowak, 2011a), which examinesthe problem of learning a rank ordering using the same ideal point model considered in thispaper. The message in this work is broadly consistent with ours, in that the number ofcomparisons required should scale with the dimension of the preference space (not the totalnumber of items) and can be significantly improved via a clever adaptive scheme. However,this work does not bound the estimation error in terms of the Euclidean distance, which isour central concern. (Jamieson and Nowak, 2011b) also incorporates adaptivity, but seeksto embed a set of points in Euclidean space (as opposed to estimating a single userβs idealpoint) and relies on paired comparisons involving three arbitrarily selected points (ratherthan a userβs ideal point and two items). Our dyadic adaptive strategy is also similar inspirit to a higher-dimensional form of binary search, as in (Nowak, 2009), however herewe consider estimating a continuous ideal point x rather than choosing from a finite set ofpossible hypotheses.
Constructing an embedding of items given comparison measurements is sometimesreferred to as ordinal embedding and is studied in (Kleindessner and Luxburg, 2014), (Jainet al., 2016), and (Arias-Castro et al., 2017). As mentioned, in this work we assume thepresence of an item embedding created by e.g., those methods. Our work could then beused to create a corresponding embedding of users based on response data to perform e.g.,personalization and customer segmentation.
Finally, while seemingly unrelated, we note that our work builds on the growing bodyof literature of 1-bit compressive sensing. In particular, our results are largely inspired bythose in (Knudson et al., 2016; Baraniuk et al., 2017), and borrow techniques from (Jacqueset al., 2013) in the proofs of some of our main results. Our embedding result of Section 4.1is most directly related to the work of (Plan and Vershynin, 2014), which studies a similarproblem to (2) but under a different, non-pairwise model. Because of this distinction, theirresults are not directly applicable. However, due to our particular probabilistic model, ourresults are also in some sense stronger. We will expand further with a direct comparison tothis work during the presentation of our result. The 1-bit sensing problem is also consideredin (Plan and Vershynin, 2013b) and (Plan and Vershynin, 2013a) but these works handleonly queries involving homogeneous hyperplanes, without offset, which is not applicableto our pairwise setting. Ideas similar to 1-bit sensing appear in field of locality sensitivehashing, e.g, (Konoshima and Noma, 2012) which does treat inhomogeneous hyperplanesbut does not provide theory.
Note that in this work we extend preliminary results first presented in (Massimino andDavenport, 2016, 2017).
4
Localization via Paired Comparisons
2. A Randomized Observation Model
For the moment we will consider the βnoise-freeβ setting where each comparison between xand qπ versus pπ results in assigning the point which is truly closest to x with probability1. In this case we can represent the observed comparisons mathematically by letting ππ(x)denote the πth observation, which consists of comparisons between pπ and qπ, and setting
ππ(x) := sign(βxβ qπβ2 β βxβ pπβ2
)={
+1 if x is closer to pπ
β1 if x is closer to qπ.(1)
We will also use π(x) := [π1(x), Β· Β· Β· ,ππ(x)]π to denote the vector of all observationsresulting from π comparisons. Note that since
βxβ qπβ2 β βxβ pπβ2 = 2(pπ β qπ)π x + βqπβ2 β βpπβ2 ,
if we set aπ = (pπ β qπ) and ππ = 12(βpπβ2 β βqπβ2), then we can re-write our observation
model asππ(x) = sign
(2aπ
π xβ 2ππ
)= sign
(aπ
π xβ ππ
). (2)
This is reminiscent of the standard setup in one-bit compressive sensing (with dithers) (Knud-son et al., 2016; Baraniuk et al., 2017) with the important differences that: (i) we have notmade any kind of sparsity or other structural assumption on x and, (ii) the βdithersβ ππ,at least in this formulation, are dependent on the aπ, which results in difficulty applyingstandard results from this theory to the present setting.
However, many of the techniques from this literature will nevertheless be helpful inanalyzing this problem. To see this, we consider a randomized observation model wherethe pairs (pπ, qπ) are chosen independently with i.i.d. entries drawn according to a normaldistribution, i.e., pπ, qπ βΌ π© (0, π2I). In this case, we have that the entries of our sensingvectors are i.i.d. with οΏ½οΏ½π(π) βΌ π© (0, 2π2). Moreover, if we define bπ = pπ + qπ, then we alsohave that bπ βΌ π© (0, 2π2I), and
12 aπ
π bπ = 12β
π
(pπ(π)β qπ(π))(pπ(π) + qπ(π))
= 12β
π
pπ(π)2 β qπ(π)2 = 12(βpπβ2 β βqπβ2) = ππ.
Note that while ππ = 12 aπ
π bπ is clearly dependent on aπ, we do have that aπ and bπ areindependent.
To simplify, we re-normalize by dividing by βaπβ, i.e., setting aπ := aπ/ βaπβ and ππ :=ππ/ βaπβ, in which case we can write
ππ(x) = sign(aππ xβ ππ). (3)
It is easy to see that aπ is distributed uniformly on the sphere Sπβ1 = {a β Rπ : βaβ = 1}.Note that throughout our analysis we will exploit the fact that aπ is uniform on Sπβ1 andwill let π denote the uniform measure on the sphere. Note also that
ππ = 12aπ
π bπ.
5
Massimino and Davenport
Since aπ and bπ are independent, aπ and bπ are also independent. Moreover, for any unit-vector aπ, if bπ βΌ π© (0, 2π2I) then aπ
π bπ βΌ π© (0, 2π2). Thus, we must have ππ βΌ π© (0, π2/2),independent of aπ, which is the key insight that enables the analysis below.
3. Guarantees in the Noise-Free Setting
We now state a result concerning localization under the noise-free random model fromSection 2. Extensions to noisy comparisons and efficient estimation in practice are discussedin Sections 4 and 5, respectively. Let Bπ
π denote the π-dimensional, radius π Euclidean ball.
Theorem 1 Let π, π > 0 be given. Let ππ(Β·) be defined as in (1), and suppose that π pairs{(pπ, qπ)}ππ=1 are generated by drawing each pπ and qπ independently from π© (0, π2πΌ) whereπ2 = 2π 2/π. There exists a constant πΆ such that if
π β₯ πΆπ
π
(π log π
βπ
π+ log 1
π
), (4)
then with probability at least 1β π, for all x, y β Bππ such that π(x) = π(y),
βxβ yβ β€ π.
The result follows from applying Lemma 2 below to pairs of points in a covering set of Bππ .
The key message of this theorem is that if one chooses the variance π2 of the distributiongenerating the items appropriately, then it is possible to estimate x to within π using anumber of comparisons that is nearly linear in π/π. As we will show in Theorem 3, thisresult is optimal, ignoring log factors, in terms of the scaling of π with π and π. This alsomakes intuitive sense; if all the hyperplanes were all axis-aligned, one would require thenumber of hyperplanes be at least proportional to the number of dimensions, π, otherwisesome direction would be unconstrained. Theorem 1 is also sensible in terms of the ratio ofthe initial uncertainty π to the target uncertainty π since π /π hyperplanes would be requiredto uniformly localize a point along a single dimension.
A natural question is what would happen with a different choice of π2. In fact, thisassumption is criticalβif π2 is substantially smaller the bound quickly becomes vacuous, andas π2 grows much past π 2/π the bound begins to become steadily worse.1 As we will see inSection 6, this is in fact observed in practice. It should also be somewhat intuitive: if π2 istoo small, then nearly all the hyperplanes induced by the comparisons will pass very close tothe origin, so that accurate estimation of even βxβ becomes impossible. On the other hand,if π2 is too large, then an increasing number of these hyperplanes will not even intersect theball of radius π in which x is presumed to lie, thus yielding no new information.
Lemma 2 Let w, z β Bππ be distinct and fixed, and let πΏ > 0 be given. Define
π΅πΏ(w) := {u β Bππ : βuβwβ β€ πΏ}.
1. We note that it is possible to try to optimize π2 by setting π2 = ππ 2/π for some constant π and thenselecting π so as to minimize the constant πΆ in (4). We believe this would yield limited insight since,in order to obtain a result which is valid uniformly for all possible π, we use certain bounds which forgeneral π can be somewhat loose and would skew the resulting π. We instead simply select π = 2 forsimplicity in our analysis (as it results in ππ βΌ π© (0, π 2/π)) and because it aligns well with simulations.
6
Localization via Paired Comparisons
Let ππ be defined as in Theorem 1. Denote by πsep the probability that π΅πΏ(w) and π΅πΏ(z) areseparated by hyperplane π, i.e.,
πsep := P [βu β π΅πΏ(w),βv β π΅πΏ(z) : ππ(u) = ππ(v)] .
For any π0 β€ βwβ zβ we have
πsep β₯π0 β πΏ
β2π
22β
ππ5/2π .
Proof Let π = βwβ zβ. Here, we denote the normal vector and threshold of hyperplane πby a and π respectively. It is easy to show that πsep can be expressed as
πsep = P[aπ z + πΏ β€ π β€ aπ wβ πΏ or aπ w + πΏ β€ π β€ aπ zβ πΏ
]= 2P
[aπ z + πΏ β€ π β€ aπ wβ πΏ
], (5)
where the second equality follows from the symmetry of the distributions of a and π .Define πΆπΌ := {a β Sπβ1 : aπ (wβ z) β₯ πΌ}. Note that the probability in (5) is zero unless
a β πΆ2πΏ. Thus, recalling that ππ βΌ π© (0, π2/2) we have
πsep = 2β«
πΆ2πΏ
Ξ¦(
aπ wβ πΏ
π/β
2
)β Ξ¦
(aπ z + πΏ
π/β
2
) π(da)
β₯ 2β«
πΆβ²
Ξ¦(
aπ wβ πΏ
π/β
2
)β Ξ¦
(aπ z + πΏ
π/β
2
) π(da) (6)
for any πΆ β² β πΆ2πΏ. To obtain a lower bound on (6), we will consider a carefully chosen subsetπΆ β² β πΆ2πΏ and then simply multiply the area of πΆ β² by the minimum value πΎ of the integrandover that set, yielding a bound of the form
πsep β₯ 2πΎπ(πΆ β²).
We construct the set πΆ β² as follows. Let π := {a : aπ w β€ π/β
π βwβ}, π := {a : aπ z β₯βπ/β
π βzβ}, and set πΆ β² := πΆπΌ β© π β© π for some πΌ β₯ 2πΏ. Note that for any a β πΆ β²,since aπ (w β z) β₯ πΌ β₯ 2πΏ, we have βπ π/
βπ β€ aπ z + πΏ β€ aπ w β πΏ β€ π π/
βπ. Thus, by
Lemma 11,
πΎ = infaβπΆβ²
Ξ¦(
aπ wβ πΏ
π/β
2
)β Ξ¦
(aπ z + πΏ
π/β
2
) β₯β
2π
(πΌβ 2πΏ)π(β2π π
πβ
π
).
Recall by assumption we have that π =β
2π /β
π, thus we obtain by setting π =β
5,
πΎ β₯β
π
π (πΌβ 2πΏ)π(π) =
βπ(πΌβ 2πΏ)β2ππ5/2π
. (7)
Next note that πΆ β² = πΆπΌ β©π β©π = πΆπΌ βπ π βππ is a difference of a set of hypersphericalcaps. To obtain a lower bound on π(πΆ β²) we use the upper and lower bounds on the measureof hyperspherical caps given in Lemma 2.1 of (Brieden et al., 2001).
7
Massimino and Davenport
Case π β₯ 6 Provided that πΌ/π <β
2/π we can bound π(πΆ β²) as
π(πΆ β²) β₯ π(πΆπΌ)β π(π π)β π(ππ) β₯ 112 β 2 1
2π(1β π2/π)(πβ1)/2 β₯ 1
12 β1β
5π5/2 ,
where the last inequality follows from the fact that (1 β π₯/π)πβ1 β€ πβπ₯ for π β₯ π₯ β₯ 2.Combining this with lower estimate (7),
πsep β₯ 2πΎπ(πΆ β²) β₯ 2β
π(πΌβ 2πΏ)β2ππ2π
1β 12πβ5/2/β
512 .
Setting πΌ = πΏ + π/β
2π, since 1β 12πβ5/2/β
5 > 5/9, we have that
πsep β₯2β
π(π/β
2πβ πΏ)(1β 12πβ5/2/β
5)12β
2ππ5/2π β₯ πβ πΏ
β2π
22β
ππ5/2π .
Note that this bound holds under the assumption that πΌ/π <β
2/π, which for our choiceof πΌ is equivalent to the assumption that π > πΏ
β2π. However, this bound also holds trivially
for all π β€ πΏβ
2π, and thus in fact holds for all π β₯ 0.
Case π β€ 5 In this case, note that π/β
π β₯ 1, so the sets π and π are the entire sphere.Hence, π(π π) = π(ππ) = 0 and π(πΆ β²) = π(πΆπΌ) β₯ 1
12 . Thus,
πsep β₯ 2πΎπ(πΆ β²) β₯ πβ πΏβ
2π
12β
ππ5/2π .
We obtain the stated lemma by noting π0 β€ π.
Proof of Theorem 1 Let πe denote the probability that there exists some x, y β Bππ with
βxβ yβ > π and π(x) = π(y). Our goal is to show that πe β€ π. Towards this end, let π bea πΏ-covering set for Bπ
π with |π | β€ (3π /πΏ)π. By construction, for any x, y β Bππ , there exist
some w, z β π satisfying βxβwβ β€ πΏ and βyβ zβ β€ πΏ. In this case, if βxβ yβ > π then
βwβ zβ β₯ βxβ yβ β 2πΏ > πβ 2πΏ.
Our goal is to upper bound the probability that there exists some w, z β π with βwβ zβ β₯π0 = πβ 2πΏ and π(u) = π(v) for some u β π΅πΏ(w) and v β π΅πΏ(z). Said differently, we wouldlike to bound the probability that there exists a w, z β π with βwβ zβ β₯ π0 for whichπ΅πΏ(w) and π΅πΏ(z) are not separated by any of the π hyperplanes.
Let ππ(w, z) denote the probability that π΅πΏ(w) and π΅πΏ(z) are not separated by anyof the π hyperplanes for a fixed w, z β π with βwβ zβ β₯ π0. Lemma 2 controls thisprobability for a single hyperplane, yielding a bound of
1β πsep β€ 1β π0 β πΏβ
2π
22β
ππ5/2π .
Since the (pπ, qπ) are independent, we obtain
ππ(w, z) β€(
1β π0 β πΏβ
2π
22β
ππ5/2π
)π
. (8)
8
Localization via Paired Comparisons
Since we are interested in the event that there exists any w, z β π with βwβ zβ β₯ π0 forwhich π΅πΏ(w) and π΅πΏ(z) are separated by none of the π hyperplanes, we use the fact thatthere are at most (3π /πΏ)2π such pairs w, z and combine a union bound with (8) to obtain
πe β€(3π
πΏ
)2π(
1β π0 β πΏβ
2π
22β
ππ5/2π
)π
β€ exp
ββ2π log 3π
πΏβ
(π0 β πΏ
β2π)
π
22β
ππ5/2π
ββ , (9)
which follows from (1β π₯) β€ πβπ₯. Bounding the right-hand side of (9) by π, we obtain
2π log 3π
πΏβ
(π0 β πΏ
β2π)
π
22β
ππ5/2π β€ log π. (10)
If we now make the substitutions π0 = π β 2πΏ and πΏ = π/(4 +β
8π), then we have thatπ0 β πΏ
βπ = π/2 and thus we can reduce (10) to
2π log 3π (4 +β
8π)π
β ππ
44β
ππ5/2π β€ log π.
By rearranging, we see that this is equivalent to
π β₯ 44β
ππ5/2 π
π
(2π log 3π (4 +
β8π)
π+ log 1
π
). (11)
One can easily show that (4) implies (11) for an appropriate choice of πΆ.
We now show that the result in Theorem 1 is optimal in the sense that any set ofcomparisons which can guarantee a uniform recovery of all x β Bπ
π to accuracy π will requirea number of comparisons on the same order as that required in Theorem 1 (up to log factors).
Theorem 3 For any configuration of π (inhomogeneous) hyperplanes in Rπ dividing Bππ
into cells, if π < 2π
π π π, then there exist two points x, y β Bπ
π in the same cell such thatβxβ yβ β₯ π.
Proof We will use two facts. First, the number of cells (both bounded and unbounded)defined by π hyperplanes in Rπ in general position2 is given by
πΉπ(π) =πβ
π=0
(π
π
)β€(
ππ
π
)π
<
(2π
π
)π
, (12)
where the second inequality follows from the assumption that π < 2π π/ππ.Second, for any convex set πΎ we have the isodiametric inequality (Giaquinta and Modica,
2010): where Diam(πΎ) = supπ₯,π¦βπΎ βπ₯β π¦β,(Diam(πΎ)
2
)π ππ/2
Ξ(π/2 + 1) β₯ Vol(πΎ), (13)
2. For non-general position, this is an upper bound (Buck, 1943).
9
Massimino and Davenport
with equality when πΎ is a ball. Since the entire volume of Bππ , denoted Vol(Bπ
π ), is filled byat most πΉπ(π) non-overlapping cells, there must exist at least one such cell πΎ0 with
Vol(πΎ0) β₯ Vol(Bππ )
πΉπ(π) = ππ/2
Ξ(π/2 + 1)π π
πΉπ(π) . (14)
Combining (13) with (14), we obtain(Diam(πΎ0)2
)π
β₯ π π
πΉπ(π) ,
which, together with (12), implies that
Diam(πΎ0) β₯ 2π πβ
πΉπ(π)> π.
Thus there are vectors x, y β πΎ0 such that βxβ yβ > π.
4. Stability in Noise
So far, we have only considered the noise-free case. In most practical applications, obser-vations may be corrupted by noise. We consider two scenarios; in the first, we make noassumption on the source of the errors and instead show the paired comparison observationsare stable with respect to Euclidean distance. That is, two signals that have similar signpatterns are also nearby (and vice-versa). One can view this as a strengthening of the resultin Theorem 1. In the second case, Gaussian noise is added prior to the sign(Β·) function in (3).This is equivalent to the Thurstone model (Thurstone, 1927) with the Probit (normal) link.
The results of this section apply without considering any particular recovery methodor algorithm and relate to the number of paired comparisons which may be flipped due tothe noise, not from reconstruction. We address these concerns by introducing a practicalalgorithm and associated guarantees in Section 5.
Throughout the following, we denote by ππ» the Hamming distance, i.e., ππ» counts thefraction of comparisons which differ between two sets of observations, here denoted π(x)and π(y):
ππ»(π(x),π(y)) := 1π
πβπ=1
12 |ππ(x)βππ(y)|. (15)
4.1 Stable Embedding
Here we show that given enough comparisons there is an approximate embedding of thepreference space into {β1, 1}π via our model. Theorem 4 states that if x and y are sufficientlyclose, then the respective comparison patterns π(x) and π(y) closely align. In contrast withTheorem 8, Theorem 4 is a purely geometric statement which makes no assumptions on anyparticular noise model. Note also that Theorem 4 applies uniformly for all x and y.
10
Localization via Paired Comparisons
Theorem 4 Let π, π > 0 be given. Let π(x) denote the collection of π observations definedas in Theorem 1. There exist constants πΆ1, π1, πΆ2, π2 such that if
π β₯ 12π2
(2π log 3
βπ
π+ log 2
π
), (16)
then with probability at least 1β π, for all x, y β Bππ we have
πΆ1βxβ yβ
π β π1π β€ ππ»(π(x),π(y)) β€ πΆ2
βxβ yβπ
+ π2π. (17)
This result implies that the fraction of differences in the set of observed comparisonsbetween x and y will be constrained to within a constant factor of the Euclidean distance,plus an additive error approximately proportional to 1/
βπ. At first glance, this seems worse
than the result of Theorem 1, which suggests the rate 1/π. However, Theorem 4 comes withmuch greater flexibility in that Theorem 1 only concerns the case where ππ»(π(x),π(y)) = 0.As in Theorem 1, this result applies for all x on the same randomly drawn set of items.
This result is very reminiscent of Theorem 1.10 in (Plan and Vershynin, 2014) whichconcerns tessellations under uniform random affine hyperplanes generated according to theHaar measure, rather than the particular Gaussian model we study. Compared to thatwork, our result is much better in terms of the scaling of the lower bound on π with respectto distortion π (they predict 1/π12 while ours is 1/π2). This is due to both our specificprobabilistic model and because we do not use the technique of βliftingβ the problem todimension π + 1 in our analysis because it would incur additional distortion. On the otherhand, the result of (Plan and Vershynin, 2014) is applicable to x lying in an arbitrary convexbody whereas our result considers only x within a radius π ball.
Unlike in Theorem 1, we are not aware whether the relationship between π and π givenin Theorem 4 is optimal. This result is related to open questions concerning Dvoretzkyβstheorem for embedding β2 into β1. See the discussion of optimality in Section 1.7 of (Planand Vershynin, 2014) and Remark 1.6 of (Plan and Vershynin, 2013b).
In the context of a hypothetical recovery problem, suppose x is a parameter of interestand y is an estimate produced by any algorithm. Then, (17) says that if we want to recoverx to within error π, the algorithm should look for vectors y which have up to π(π) incorrectcomparisons. Likewise, if a y can be found having up to π(π) comparison errors, we havethe same π(π) guarantee on the Euclidean error of the estimate. In many cases, such aswhen errors are generated randomly, Theorem 1 would be inappropriate because finding a ysuch that ππ»(π(x),π(y)) = 0 is likely to be impossible.
To prove Theorem 4 we will require the following Lemmas 5 and 6.
Lemma 5 Let w, z β Bππ be distinct and fixed, and let πΏ > 0 be given. Let π(x) denote the
collection of π observations defined as in Theorem 1, and let π΅πΏ(Β·) be defined as in Lemma 2.Denote by π0 the probability that π΅πΏ(w) and π΅πΏ(z) are not separated by hyperplane π, i.e.,
π0 = P [βu β π΅πΏ(w), βv β π΅πΏ(z) : ππ(u) = ππ(v)] .
Then1β π0 β€
β2π
(βwβ zβ
π + πΏβ
π
π
).
11
Massimino and Davenport
Proof We need an upper bound on
1β π0 = P [ππ(u) = ππ(v) for some u β π΅πΏ(w), v β π΅πΏ(z)] .
Suppose for now that a is fixed and without loss of generality that aπ w > aπ z. Then thisprobability is simply
P[aπ v < π < aπ u for some u β π΅πΏ(w), v β π΅πΏ(z)
]= P
[min
vβπ΅πΏ(z)aπ v < π < max
uβπ΅πΏ(w)aπ u
]β€ P
[aπ zβ πΏ < π < aπ w + πΏ
],
since by CauchyβSchwarz we have
minvβπ΅πΏ(z)
aπ v β₯ aπ zβ πΏ and maxuβπ΅πΏ(w)
aπ u β€ aπ w + πΏ.
Thus, recalling that ππ βΌ π© (0, π 2/π), from Lemma 11 we have
P[aπ zβ πΏ < π < aπ w + πΏ
]= Ξ¦
(aπ w + πΏ
π /β
π
)β Ξ¦
(aπ zβ πΏ
π /β
π
)
β€ 1π
βπ
2π
(aπ (wβ z) + 2πΏ
).
Similarly, for aπ w < aπ z we have
P[aπ wβ πΏ < π < aπ z + πΏ
]β€ 1
π
βπ
2π
(aπ (zβw) + 2πΏ
).
Combining these we have
1β π0 β€β«Sπβ1
1π
βπ
2π
(|aπ (wβ z)|+ 2πΏ
)π(da)
= 1π
βπ
2π
β«Sπβ1|aπ (wβ z)| π(da) + 2πΏ
π
βπ
2π
=β
2π
π π
Ξ(π2 )
Ξ(π+12 )βwβ zβ+ πΏ
π
β2π
π,
where the last equality is proven in Lemma 12. The lemma then follows from the facts thatΞ(1/2)Ξ(1) =
βπ and Ξ( π
2 )Ξ( π+1
2 ) β€2β
2πβ1 β€β
ππ for π β₯ 2 (Qi, 2010, (2.20)).
Lemma 6 Let w, z β Bππ be distinct and fixed, and let πΏ, π > 0 be given. Let π(x) denote
the collection of π observations defined as in Theorem 1, and let π΅πΏ(Β·) be defined as inLemma 2. Then for all u β π΅πΏ(w) and v β π΅πΏ(z),
122π5/2βπ
(βwβ zβ
π β πΏβ
2π
π
)β π β€ ππ»(π(u),π(v)) β€
β2π
(βwβ zβ
π + πΏβ
π
π
)+ π,
with probability at least 1β exp(β2π2π).
12
Localization via Paired Comparisons
Proof Fix πΏ > 0 and let u β π΅πΏ(w), v β π΅πΏ(z). Recall that the Hamming distance ππ» isa sum of independent and identically distributed Bernoulli random variables and we maybound it using Hoeffdingβs inequality. Since our probabilistic upper and lower bounds musthold for all u, v as described above, we introduce quantities πΏ0 and πΏ1 which represent twoβextreme casesβ of the Bernoulli variables:
πΏ0 := supuβπ΅πΏ(w),vβπ΅πΏ(z)
12π
πβπ=1|ππ(u)βππ(v)|
πΏ1 := infuβπ΅πΏ(w),vβπ΅πΏ(z)
12π
πβπ=1|ππ(u)βππ(v)|.
Then we haveπΏ1 β€ ππ»(π(u),π(v)) β€ πΏ0.
Denote π0 = 1β EπΏ0 and π1 = EπΏ1, i.e.,
π0 = P [βu β π΅πΏ(w), βv β π΅πΏ(z) : ππ(u) = ππ(v)]π1 = P [βu β π΅πΏ(w), βv β π΅πΏ(z) : ππ(u) = ππ(v)] .
By Hoeffdingβs inequality,
P [πΏ0 > (1β π0) + π] β€ exp(β2ππ2)P [πΏ1 < π1 β π] β€ exp(β2ππ2).
Hence, with probability at least 1β 2 exp(β2ππ2),
π1 β π β€ ππ»(π(u),π(v)) β€ (1β π0) + π.
The result follows directly from this combined with the facts that from Lemma 2 we have
π1 β₯1
22π5/2βπ
(βwβ zβ
π β πΏβ
2π
π
),
and from Lemma 5 we have
1β π0 β€β
2π
(βwβ zβ
π + πΏβ
π
π
).
Proof of Theorem 4 By Lemma 6, for any fixed pair w, z β Bππ we have bounds on the
Hamming distance that hold with probability at least 1β 2 exp(β2π2π), for all u β π΅πΏ(w)and v β π΅πΏ(z). Recall that the radius π ball can be covered with a set π of radius πΏballs with |π | β€ (3π /πΏ)π. Thus, by a union bound we have that with probability at least1β 2(3π /πΏ)2π exp(β2π2π), for any w, z β π ,
122π5/2βπ
(βwβ zβ
π β πΏβ
2π
π
)β π β€ ππ»(π(u),π(v)) β€
β2π
(βwβ zβ
π + πΏβ
π
π
)+ π,
13
Massimino and Davenport
for all u β π΅πΏ(w) and v β π΅πΏ(z). Since βxβ yββ2πΏ β€ βwβ zβ β€ βxβ yβ+2πΏ, this impliesthat
122π5/2βπ
(βxβ yβ β 2πΏ
π β πΏβ
2π
π
)βπ β€ ππ»(π(x),π(y)) β€
β2π
(βxβ yβ+ 2πΏ
π + πΏβ
π
π
)+π,
Letting πΏ = ππ /β
π and setting πΆ1, π1, πΆ2, π1 appropriately3 this reduces to (17). Lowerbounding the probability by 1β π, we obtain
2(3β
π/π)2π exp(β2π2π) β€ π.
Rearranging yields (16).
4.2 Gaussian Noise
Here we aim to understand how the paired comparisons change with the introduction ofβpre-quantizationβ Gaussian noise. This will have the effect of causing some comparisons tobe erroneous, where the probability of an error will be largest when x is equidistant from pπ
and qπ and will decay as x moves away from this boundary.Towards this end, recall that the observation model in (1) can be reduced to the form
ππ(x) = sign(ππ) ππ := aππ xβ ππ. (18)
In the noisy case, we will consider the observations
ππ(x) = sign(ππ) ππ := aππ xβ ππ + π§π = ππ + π§π, (19)
where π§π βΌ π© (0, π2π§). Note that since βaπβ = 1, this model is equivalent to adding multivariate
Gaussian noise directly to x with covariance π2π§I. For a fixed x, we can then quantify the
probability that ππ»(π(x),π(x)) is large via our next two results. We first consider the caseof a single comparison in Lemma 7, then extend this to an arbitrary number of comparisonsin Theorem 8.
Lemma 7 Suppose π β₯ 4. Then P[ππ(x) = ππ(x)] β€ π π(π2π§) where π π is defined by
π π(π2π§) := 1
2
βπ2
π§
π2π§ + 2π 2/π + 4 βxβ2 /π
β€ 12
β1
1 + 2π 2/(ππ2π§) . (20)
For clarity, we focus here on the π β₯ 4 case. We consider the π = 2 and π = 3 cases separatelybecause when π β₯ 4 the probability distribution function of aπ
π x is well-approximated by aGaussian function but not for π < 4. We give alternative expressions for π π when π = 2 andπ = 3 in Appendix B.
3. We set πΆ1 = 1/22π5/2βπ and πΆ2 =
β2/π. We may set π1 = 1+1/11π5/2β
π+β
2/π and π2 = 1+3β
2/πto obtain constants that are valid for all πβimproved values are possible for large π.
14
Localization via Paired Comparisons
Theorem 8 Suppose π β₯ 4 and fix x β Bππ . Let π(x) and π(x) denote the collection of π
observations defined as in (18) and (19) respectively, where the {(pπ, qπ)}ππ=1 (and hence the{(aπ, ππ)}ππ=1) are generated as in Theorem 1. Then,
E ππ»(π(x),π(x)) β€ π π(π2π§) β€ 1
2
β1
1 + 2π 2/(ππ2π§) . (21)
andP[ππ»(π(x),π(x)) β₯ π π(π2
π§) + π]β€ exp(β2ππ2). (22)
where π π is defined in (20).
Proof By Lemma 7, we have that P[ππ(x) = ππ(x)] is bounded by π π(π2π§). Since the
comparisons are independent, the expected number of sign mismatches is just the probabilityof a sign flip just computed, which establishes (21). The tail bound in (22) is a simpleconsequence of Hoeffdingβs inequality.
The bound (21) of Theorem 8 behaves as one would expect. If the variance π2π§ of the
added Gaussian noise is small, we predict that the expected fraction of errors is also small.To place this result in context, recall that ππ βΌ π© (0, π 2/π). Suppose that π2
π§ = π0π 2/π. Inthis case one can bound π π(π2
π§) in (20) as
12
βπ0
π0 + 6 β€ π π(π2π§) β€ 1
2
βπ0
π0 + 2 .
Intuitively, if π0 is close to 1, meaning the noise variance is comparable to that of ππ,then we would expect to lose a significant amount of information about x, in which caseππ»(π(x),π(x)) could potentially be quite large. In contrast, by letting π0 grow small wecan bound π π(π2
π§) β€β
π0/8 arbitrarily close to zero.It is instructive to consider Theorem 8 next to Theorem 4, which also predicts the
fraction of sign mismatches up to an additive constant which is proportional to 1/β
π. (21)and (22) provide upper estimates of the level of comparison errors which is unavoidable,regardless of recovery technique. If, in a particular application the noise is expected tobe Gaussian, the bound in (22) can be used in the lower half of (17) to create a recoveryguarantee. Note that since Theorem 4 is a uniform guarantee, better estimates than thiscould be possible in the specific case of Gaussian noise. We explore Gaussian noise furtherin Section 5.1.Proof of Lemma 7 The probability of a sign flip is given by
P [ππππ < 0] = P [ππ < 0 and ππ > 0] + P [ππ > 0 and ππ < 0] .
Note that if we set ππ := aππ x/ βxβ β [β1, 1], then we can write ππ = ππ βxβ β ππ and
ππ = ππ βxβ β ππ + π§π where the random variables ππ, ππ, and π§π are independent. Where ππ(ππ)denotes the probability density functions for ππ and recalling that ππ βΌ π© (0, 2π 2/π), weshow in Appendix B using standard Gaussian tail bounds that
P[ππππ < 0] β€ 1π
βπ
π
β« 1
0
β« β
ββππ(ππ) exp
(β(ππ βxβ β ππ)2
2π2π§
β ππ2π
4π 2
)dππ dππ. (23)
15
Massimino and Davenport
The remainder of the proof (given in Appendix B) is obtained by bounding this integral.Note that in general, we have 1
2(ππ + 1) βΌ Beta((πβ 1)/2, (πβ 1)/2), but ππ is asymptoticallynormal with variance 1/π (Spruill, 2007). For π β₯ 4, we use the simple upper bound
ππ(ππ) = 12
[π΅
(πβ 1
2 ,πβ 1
2
)]β1(1 + ππ
21β ππ
2
)(πβ3)/2
β€ 12
[β2π πβ1
2(πβ2)/2 πβ1
2(πβ2)/2
(πβ 1)πβ1β1/2
]β1(1β π2
π
4
)(πβ3)/2
= 12
[ β2π
2πβ2βπβ 1
]β1 12πβ3 exp(β(πβ 3)π2
π /2)
=β
πβ 14β
2πexp(β(πβ 3)π2
π /2) β€β
π
4β
2πexp(βππ2
π /8). (24)
This follows from the standard inequalities π΅(π₯, π¦) β₯β
2ππ₯π₯β1/2π¦π¦β1/2/(π₯ + π¦)π₯+π¦β1/2 (e.g.,GreniΓ© and Molteni, 2015) and 1β π₯ β€ exp(βπ₯).
5. Estimation Algorithm and Guarantees
In the noise-free setting, given a set of comparisons π(x), we may produce an estimate xby finding any x β Bπ
π satisfying π(x) = π(x). A simple approach is the following convexprogram:
x = arg minw
βwβ2 subject to ππ(x)(aππ wβ ππ) β₯ 0 βπ β [π]. (25)
This is relatively easy to solve since the constraints are simple linear inequalities and thefeasible region is convex. Note that (25) is guaranteed to satisfy x β Bπ
π since x β Bππ and x
is feasible, so that βxβ β€ βxβ β€ π . In this case we may apply Theorem 1 to argue that if πobeys the bound in (4), then βxβ xβ β€ π.
However, in most practical applications, observations are likely to be corrupted by noiseleading to inconsistencies. Any errors in the observations π(x) would make strictly enforcingπ(x) = π(x) a questionable goal since, among other drawbacks, x itself would becomeinfeasible. In fact, in this case we cannot even necessarily guarantee that (25) has anyfeasible solutions. In the noisy case we instead use a relaxation inspired by the extendedπ-SVM of (Perez-Cruz et al., 2003), which introduces slack variables ππ β₯ 0 and is controlledby the parameter π. Specifically, we denote by π(x) the collection of (potentially) corruptedmeasurements, and we solve
minimizewβRπ+1,πβRπ,πβRβππ + 1
π
πβπ=1
ππ
subject to ππ(x)([aππ ,βππ] w) β₯ πβ ππ, ππ β₯ 0, βπ β [π],
β w[1 : π]β2 β€ 2π 2
1+π 2 , and β wβ2 = 2.
(26)
Finally, we set x = w[1, . . . , π]/ w[π + 1]. The additional constraint β w[1 : π]β2 β€ 2π 2
1+π 2
ensures that βxβ β€ π . Note that an important difference between the extended π-SVM and
16
Localization via Paired Comparisons
(26) is that there is no βoffsetβ parameter to be optimized over. That is, if we interpret[aπ,βππ] as βtraining examples,β then w := [x, 1] β Rπ+1 corresponds to a homogeneouslinear classifier. Note that in the absence of comparison errors, setting π = 0, we would havea feasible solution with ππ = 0.
Unfortunately, due to the norm equality constraint, (26) is not convex and a uniqueglobal minimum cannot be guaranteed, i.e., there may be multiple solutions x. Nevertheless,the following result shows that any local minimum will have certain desirable properties,and in the process also provides guidance on choosing the parameter π. Combined with ourprevious results, this also allows us to give recovery guarantees.
Proposition 9 At any local minimum x of (26), we have 1π |{π : ππ > 0}| β€ π. If the
corresponding π > 0, this further implies that ππ»(π(x),π(x)) β€ π.
Proof This proof follows similarly to that of Proposition 7.5 of (SchΓΆlkopf and Smola,2001), except applied to the extended π-SVM of (Perez-Cruz et al., 2003) and with theremoval of the hyperplane bias term. Specifically, we first form the Lagrangian of (26):
πΏ( w, π, π, πΌ, π½, πΎ, πΏ) = βππ + 1π
βπ
ππ ββ
π
(πΌπ(ππ(x)[aπ,βππ]π wβ π + ππ) + π½πππ)
+ πΎ
( 2π 2
1 + π 2 β β w[1 : π]β2)β πΏ(2β β wβ2).
We define the functions corresponding to the equality constraint (β1) and inequality con-straints (ππ for π β [2π + 1]) as follows:
β1(w, π, π) := (2β βwβ2),
ππ(w, π, π) :=
β§βͺβͺβ¨βͺβͺβ©ππ(x)[aπ,βππ]π wβ π + ππ π β [1, π]ππβπ π β [π + 1, 2π]β(
2π 2
1+π 2 β βw[1 : π]β2)
π = 2π + 1.
Consider the π + π + 2 variables ( w, π, π). The gradient corresponding to the equalityconstraint, βh1, involves only the first π+1 variables. Thus, there exists an π+1 dimensionalsubspace π β Rπ+π+2 where for any d β π, βhπ
1 d = 0. The gradients corresponding tothe 2π + 1 inequality constraints are given in the (2π + 1)Γ (π + π + 2) matrix
G :=
β‘β’β’β’β’β’β’β’β’β’β’β’β’β£
βgπ1
...βgπ
π
βgππ+1...
βgπ2π
βgπ2π+1
β€β₯β₯β₯β₯β₯β₯β₯β₯β₯β₯β₯β₯β¦=
β‘β’β’β’β’β’β’β’β’β’β’β’β£
Β· Β· Β· β1 Β· Β· Β· 0 1
Β· Β· Β·... . . . ...
...β (π + 1)β 0 Β· Β· Β· β1 1
irrelevant β1 Β· Β· Β· 0 0
Β· Β· Β·... . . . ...
...Β· Β· Β· 0 Β· Β· Β· β1 0
β w[1 : π] β 0 0 Β· Β· Β· 0 0
β€β₯β₯β₯β₯β₯β₯β₯β₯β₯β₯β₯β¦.
17
Massimino and Davenport
Since there is a d β π such that (Gd)[π] < 0 for all π (for example, d = [0, . . . , 0|1, . . . , 1,β1]),the MangasarianβFromovitz constraint qualifications hold and we have the following first-order necessary conditions for local minima (see e.g., Bertsekas, 1999),
ππΏ
ππ= βπ +
βπΌπ =β
βπΌπ = π
andππΏ
πππ= 1
πβ πΌπ β π½π = 0 =β πΌπ + π½π = 1
π.
Sinceβπ
π=1 πΌπ = π, at most a fraction of π can have πΌπ = 1/π. Now, any π such that ππ > 0must have πΌπ = 1/π since by complimentary slackness, π½π = 0. Hence, π is an upper boundon the fraction of π such that ππ > 0.
Finally, note that if π > 0, then ππ = 0 implies ππ(x)([aππ ,βππ] w) β₯ πβ ππ > 0. Hence,
the fraction of π such that ππ > 0 is an upper bound for ππ»(π(x),π(x)).
5.1 Estimation Guarantees
We now show how the results of Theorems 8 and 4 can be combined with Proposition 9to give recovery guarantees on βxβ xβ when (26) is used for recovery under realistic noisyobservation models. We consider three basic noise models. In the first, an arbitrary (butsmall) fraction of comparisons are reversed. We then consider the implication of this resultin the context of two other noise models, one where Gaussian noise is added to either theunderlying x or to the comparisons βpre-quantization,β that is, directly to (aπ
π xβ ππ), andanother where the observations are generated using an arbitrary (but bounded) perturbationof x. We will ultimately see that largely similar guarantees are possible in all three cases,summarized in Table 1.
Table 1: Summary of our estimation guarantees in this section, where each result holdsseparately with probability at least 1β π where π, π1, π2, πΆ1, πΆ2 are constants whichmay differ between results.
Noise type Guarantee on βxβ xβ /π Eq.
Adversarial, level π β β€ 2πΆ1
π + π1πΆ1
βπ log(18π)+log(2/π)
2π (30)i.i.d. Gaussian, π2
π§ β β€β
2πΆ1
βππ2
π§π 2 + 1
πΆ1
βlog(1/π)
2π + π1πΆ1
βπ log(18π)+log(2/π)
2π (31)Perturbations π₯ β¦β π₯β² β β€ 2πΆ2
πΆ1βxβxβ²β
π + π1+2π2πΆ1
βπ log(18π)+log(2/π)
2π (33)
In our analysis of all three settings, we will use the fact that from the lower bound ofTheorem 4 we have
βxβ xβπ
β€ ππ»(π(x),π(x)) + π1π
πΆ1(27)
18
Localization via Paired Comparisons
with probability at least 1β π provided that π is sufficiently large, e.g., by taking
π = 12π2
(2π log 3
βπ
π+ log 2
π
). (28)
Note that by setting π½ = 9ππ2
(2π
)1/π, we can rearrange (28) to be of the form
18π
(2π
)1/π
= π½ log π½,
which implies that
π½ = 18π(2/π)1/π
π(18π(2/π)1/π
) ,where π (Β·) denotes the Lambert π function. Using the fact that π (π₯) β€ log(π₯) for π₯ β₯ πand substituting back in for π½, we have
π β€
βπ log(18π) + log(2/π)
2π
under the mild assumption that π β₯ π18(π
2 )1/π. Substituting this in to (27) yields
βxβ xβπ
β€ ππ»(π(x),π(x))πΆ1
+ π1πΆ1
βπ log(18π) + log(2/π)
2π. (29)
We use this bound repeatedly below.
Noise model 1. In the first noise model, we suppose that an adversary is allowed toarbitrarily flip a fraction π of measurements, where we assume π is known (or can be bounded).This would seem to be a challenging setting, but in fact a guarantee under this model followsimmediately from the lower bound in Theorem 4. Specifically, suppose that π(x) representsthe noise-free comparisons, and we receive instead π(x), where ππ»(π(x),π(x)) β€ π .
Consider using (26) to produce an x setting π = π . If x is a local minimum for (26) withπ > 0, Proposition 9 implies that ππ»(π(x),π(x)) β€ π . Thus, by the triangle inequality,
ππ»(π(x),π(x)) β€ ππ»(π(x),π(x)) + ππ»(π(x),π(x)) β€ 2π .
Plugging this into (29) we have that with probability at least 1β π
βxβ xβπ
β€ 2π
πΆ1+ π1
πΆ1
βπ log(18π) + log(2/π)
2π. (30)
We emphasize the power of this resultβthe adversary may flip not merely a randomfraction of comparisons, but an arbitrary set of comparisons. Moreover, this holds uniformlyfor all x and x simultaneously (with high probability).
19
Massimino and Davenport
Noise model 2. Here we model errors as being generated by adding i.i.d. Gaussian beforethe sign(Β·) function, as described in Section 4.2, i.e.,
ππ(x) = sign(aππ xβ ππ + π§π),
where π§π βΌ π© (0, π2π§). Note that this model is equivalent to the Thurstone model of
comparative judgment (Thurstone, 1927), and causes a predictable probability of errordepending the geometry of the set of items. Specifically, comparisons which are βdecisive,βi.e., whose hyperplane lies far from x, are unlikely to be affected by this noise. Conversely,comparisons which are nearly even are quite likely to be affected.
Under the random observation model considered in this paper, by Theorem 8 we havethat, with probability at least 1β π,
ππ»(π(x),π(x)) β€ π π(π2π§) +
βlog(1/π)
2π,
where
π π(π2π§) =
βπ2
π§
π2π§ + 2π 2/π + 4 βxβ2 /π
β€
βππ2
π§
2π 2 .
We now assume that x is a local minimum of (26) with π = π π(π2π§) such that π > 0. By the
triangle inequality and Proposition 9,
ππ»(π(x),π(x)) β€ ππ»(π(x),π(x)) + ππ»(π(x),π(x)) β€ 2
βππ2
π§
2π 2 +
βlog(1/π)
2π.
Combining this with (29), we have that with probability at least 1β 2π,
βxβ xβπ
β€β
2πΆ1
βππ2
π§
π 2 + 1πΆ1
βlog(1/π)
2π+ π1
πΆ1
βπ log(18π) + log(2/π)
2π. (31)
We next consider an alternative perspective on this model. Specifically, suppose thatour observations are generated via
ππ(x) = ππ(xβ²π) where xβ²
π = x + zπ,
where zπ βΌ π© (0, π2π§πΌ). Note that we can write this as
ππ(xβ²π) = aπ
π (x + zπ)β ππ = aππ xβ ππ + aπ
π zπ.
Since βaπβ = 1, aππ zπ βΌ π© (0, π2
π§), and thus this is equivalent to the model described above.Thus, we can also interpret the above results as applying when each comparison is generatedusing a βmisspecifiedβ version of x which has been perturbed by Gaussian noise. Moreover,note that
Exβ xβ²
π
2 = E βzπβ2 = ππ2π§ ,
in which case we can also express the bound in (31) as
βxβ xβπ
β€β
2πΆ1
βE βxβ xβ²
πβ2
π 2 + 1πΆ1
βlog(1/π)
2π+ π1
πΆ1
βπ log(18π) + log(2/π)
2π. (32)
20
Localization via Paired Comparisons
Thus, a small Gaussian perturbation of x in the comparisons will result in an increasedrecovery error roughly proportional to the (average) size of the perturbation.
Note that in establishing this result we apply Theorem 8, and so in contrast to our firstnoise model, here the result holds with high probability for a fixed x (as opposed to beinguniform over all x for a single choice of π).
Noise model 3. In the third noise model, we assume the comparisons are generatedaccording to
π(x) = π(xβ²),
where xβ² represents an arbitrary perturbation of x. Much like in the previous model,comparisons which are βdecisiveβ are not likely to be affected by this kind of noise, whilecomparisons which are nearly even are quite likely to be affected. Unlike the previous model,our results here make no assumption on the distribution of the noise and will instead usethe upper bound in Theorem 4 to establish a uniform guarantee that holds (with highprobability) simultaneously for all choices of x (and xβ²). Thus, in this model our guaranteesare quite a bit stronger.
Specifically, we use the fact that from the upper bound of Theorem 4, with probabilityat least 1β π we simultaneously have (27) and
ππ»(π(x),π(x)) β€ πΆ2βxβ xβ²β
π + π2π =: π .
We again use (26) with π = π and Proposition 9 to produce an estimate x satisfyingππ»(π(x),π(x)) β€ π . Again using the triangle inequality, we have
ππ»(π(x),π(x)) β€ ππ»(π(x),π(x)) + ππ»(π(x),π(x)) β€ 2π .
Combining this with (27) we have
βxβ xβπ
β€ 2π + π1π
πΆ1= 2πΆ2
πΆ1
βxβ xβ²βπ
+ π1 + 2π2πΆ1
π.
Substituting in for π as in (29) yields
βxβ xβπ
β€ 2πΆ2πΆ1
βxβ xβ²βπ
+ π1 + 2π2πΆ1
βπ log(18π) + log(2/π)
2π. (33)
Contrasting the result in (33) with that in (32), we note that up to constants, the resultsare essentially the same. This is perhaps somewhat surprising since (33) applies to arbitraryperturbations (as opposed to only Gaussian noise), and moreover, (33) is a uniform guarantee.
5.2 Adaptive Estimation
Here we describe a simple extension to our previous (noiseless) theory and show that ifwe modify the mean and variance of the sampling distribution of items over a number ofstages, we can localize adaptively and produce an estimate with many fewer comparisonsthan possible in a non-adaptive strategy. We assume π‘ stages (π‘ = 1 for the non-adaptiveapproach). At each stage β β [π‘] we will attempt to produce an estimate xβ such that
21
Massimino and Davenport
βx β xββ β€ πβ where πβ = π β/2 = π 2ββ, then recentering to our previous estimate anddividing the problem radius in half. In stage β, each pπ, qπ βΌ π© (x, 2π 2
β /πI). After π‘ stageswe will have βxβ xπ‘β β€ π 2βπ‘ =: ππ‘ with probability at least 1β π‘π.
Proposition 10 Let ππ‘, π > 0 be given. Suppose that x β Bππ and that π total comparisons
are obtained following the adaptive scheme where
π β₯ 2πΆ log2
(2π
ππ‘
)(π log 2
βπ + log 1
π
),
where πΆ is a constant. Then with probability at least 1β log2(2π /ππ‘)π, for any estimate xsatisfying π(x) = π(x),
βxβ xβ β€ ππ‘.
Proof The adaptive scheme uses π‘ = βlog2(π /ππ‘)β β€ log2(2π /ππ‘) stages. Assume each stageis allocated πβ comparisons. By Theorem 1, localization at each stage β can be accomplishedwith high probability when
πβ β₯ πΆπ β
πβ
(π log π β
βπ
πβ+ log 1
π
)= 2πΆ
(π log 2
βπ + log 1
π
).
This condition is met by giving an equal number of comparisons to each stage, πβ = βπ/π‘β.Each stage fails with probability π. By a union bound, the target localization fails withprobability at most π‘π. Hence, localization succeeds with probability at least 1β π‘π.
Proposition 10 implies πadapt β (π log π) log2(π /ππ‘) comparisons suffice to estimate xto within ππ‘. This represents an exponential improvement in terms of number of totalcomparisons as a function of the target accuracy, ππ‘, as compared to a lower bound on thenumber of required comparisons, πlower := 2ππ /(πππ‘) for any non-adaptive strategy (recallTheorem 3).
Note that this result holds in the noise-free setting, but can be generalized to handlenoisy settings via the approaches discussed in Sections 4 and 5. While we cannot hope toachieve the same exponential improvement in the general case, we expect rates better thanthose of passive methods to be possible for certain noise levels.
6. Simulations
In this section we perform a range of synthetic experiments to demonstrate our approach.
6.1 Effect of Varying π2
In Fig. 2, we let x β R{2,3,4,8} with βxβ = π = 1. We vary π2 and perform 1500 trials, eachwith π = 100 or π = 200 pairs of points drawn according to π© (0, π2πΌ). To isolate theimpact of π2, we consider the case where our observations are noise-free, and use (25) torecover x. As predicted by the theory, localization accuracy depends on the parameter π,which controls the distribution of the hyperplane thresholds. Intuitively, if π is too small,the hyperplane boundaries concentrate closer to the origin and do not localize points with
22
Localization via Paired Comparisons
(a) (b)
(c) (d)
Figure 2: Mean error norm βxβ xβ as π2 varies for dimensions (a) 2, (b) 3, (c) 4, and (d) 8for (aβc) 100 comparisons and (d) 200 comparisons.
large norm well. On the other hand, if π is too large, most hyperplanes lie far from thetarget x. The sweet spot which allows uniform localization over the radius π ball existsaround π2 β 2π 2/π here.
6.2 Effect of Noise
Here we experiment with noise as discussed in Sections 4 and 5, and use the optimizationprogram (26) for recovery. To approximately solve this non-convex problem, we use thelinearization procedure described in (Perez-Cruz et al., 2003). Specifically, over a number ofiterations π, we repeatedly solve the sub-problem
minimizeπβR,πβRπ,w(π)βRπ+1
βππ + 1π
πβπ=1
ππ
subject to ππ(x)([aππ ,βππ] w(π)) β₯ πβ ππ, ππ β₯ 0, βπ β [π],w(π)π w(π) = 2
where we set w(π+1) β π w(π) + (1 β π) w(π) with π = 0.7. After sufficient iterations, ifw(π) β w(π) then (26) is approximately solved. This is a linear program and it can be easilyverified using the KKT conditions that |{π : ππ > 0}| β€ ππ. Thus in practice, this propertywill always be satisfied after each iteration.
23
Massimino and Davenport
We also emphasize that the error bounds in Section 5 rely on the fact from Proposition 9that ππ»(π(x),π(x)) β€ π, provided that the solution results in a π > 0. Unfortunately, wecannot guarantee that this will always be the case. Empirically, we have observed that givena certain noise level quantified by ππ»(π(x),π(x)) = π , we are more likely to observe π β€ 0when we aggressively set π = π . By increasing π somewhat this becomes much less likely.As a rule of thumb, we set π = 2π . We note that while in our context this choice is purelyheuristic, it has some theoretical support in the π-SVM literature (e.g., see Proposition 5 ofChen et al., 2005).
We consider the following noise models; (i) Gaussian, where we add pre-quantizationGaussian noise as in Section 4.2, (ii) logistic, in which pre-quantization logistic noise is added,(iii) random, where a uniform random π/2 fraction of comparisons are flipped, and (iv)adversarial, where we flip the π/2 fraction of comparisons whose hyperplanes lie farthest fromthe ideal point. The logistic noise assumption is commonly studied in paired comparisonliterature and is equivalent to the BradleyβTerry model (Bradley and Terry, 1952). Althoughwe do not have theory for this case, we expect the logistic noise case to behave similarlyto the Gaussian case. In both the Gaussian and logistic cases, we sweep the added noisevariance, then for each variance plot against the mean number of errors induced as thex-axis. In each case, we set π = 5 and generate π = 1000 pairs of points and a random xwith βxβ = 0.7.
The mean and median recovery error βxβ xβ and the fraction of violated comparisonsππ»(π(x),π(x)) are plotted over 100 independent trials with varying number of comparisonerrors in Figs. 3β6. In the Gaussian noise, logistic noise, and uniform random comparisonflipping cases, the actual fraction of comparison errors of the estimate is on average muchsmaller than our target π. This is also seen in the adversarial case (Fig. 6) for smallerlevels of error. However, at a high fraction of error (greater than about 17%) the error(both in terms of Euclidean norm and fraction of incorrect comparisons) grows rapidly. Thisillustrates a limitation to the approach of using slack variables as a relaxation to the 0β1 loss.We mention that in this regime, the recovery approach of (26) frequently yields π β€ 0, towhich our theory does not apply. Here, the recovery error counter-intuitively decreases withincreasing comparison flips. This scenario, with a large number of erroneous comparisons,represents a very difficult situation in which any tractable recovery strategy would likelystruggle. A possible direction for future work would be to make (26) more robust to suchlarge outliers.
6.3 Adaptive Comparisons
In Fig. 7, we show the effect of varying levels of adaptivity, starting with the completelynon-adaptive approach up to using 10 stages where we progressively re-center and re-scalethe hyperplane offsets. In each case, we generate x β R3 where βxβ = 0.75 and choosing thedirection randomly. The total number of comparisons are held fixed and are split as equallyas possible among the number of stages (preferring earlier stages when rounding). We setπ2 = π = 1 and plot the average over 700 independent trials. As the number of stagesincreases, performance worsens if the number of comparisons are kept small due to badlocalization in the earlier stages. However, if the number of total comparisons is sufficientlylarge, an exponential improvement over non-adaptivity is possible.
24
Localization via Paired Comparisons
Figure 3: Mean estimation error and average fraction comparison errors when adding pre-quantization Gaussian noise, sweeping the variance and plotting against theaverage number of induced errors for each variance.
Figure 4: Mean estimation error and fraction comparison errors when adding pre-quantization logistic noise, sweeping the variance and plotting against the averagenumber of induced errors for each variance.
25
Massimino and Davenport
Figure 5: Mean estimation error and fraction comparison errors when introducing uniformrandom comparison errors.
6.4 Adaptive Comparisons with a Fixed Non-Gaussian Data Set
In Fig. 8, we demonstrate the effect of adaptively choosing item pairs from a fixed syntheticdata set over four stages versus choosing items non-adaptively, i.e., without attempting toestimate the signal during the comparison collection process. We first generated 10,000 itemsuniformly distributed inside the 3-dimensional unit ball and a vector x β R3 where βxβ = 0.4.In both cases, we generate pairs of Gaussian points and choose the items from the fixeddata set which lie closest to them. In the adaptive case over four stages, we progressivelyre-center and re-scale the generated points; the initial π2 is set to the variance of the dataset and is reduced dyadically after each stage. The total number of comparisons is heldfixed and is split as equally as possible among the number of stages (preferring later stageswhen rounding). We plot the mean error over 200 independent data set trials.
7. Discussion
We have shown that given the ability to generate item pairs according to a Gaussiandistribution with a particular variance, it is possible to estimate a point x satisfying βxβ β€ π to within π with roughly ππ /π paired comparisons (ignoring log factors). This procedure isalso robust to a variety of forms of noise. If one is able to shift the distribution of the itemsdrawn, adaptive estimation gives a substantial improvement over a non-adaptive strategy. Todirectly implement such a scheme, one would require the ability to generate items arbitrarilyin Rπ. While there may be some cases where this is possible (e.g., in market testing ofitems where the features correspond to known quantities that can be manually manipulated,such as the amount of various ingredients in a food or beverage), in many of the settingsconsidered by recommendation systems, the only items which can be compared belong to afixed set of points. While our theory would still provide rough guidance as to how accurateof a localization is possible, many open questions in this setting remain. For instance, the
26
Localization via Paired Comparisons
Figure 6: (Top) Mean estimation error and fraction comparison errors when flipping thefarthest comparisons. (Bottom) Fraction of trials with π > 0.
27
Massimino and Davenport
Figure 7: Mean error norm βxβ xβ versus total comparisons for a sequence of experimentswith varying number of adaptive stages.
Figure 8: Mean error norm βxβ xβ versus total comparisons for nonadaptive and adaptiveselection. Dotted lines denote stage boundaries.
28
Localization via Paired Comparisons
algorithm itself needs to be adapted, as done in Section 6.4. Of course, there are manyother ways that the adaptive scheme could be modified to account for this restriction. Forexample, one could use rejection sampling, so that although many candidate pairs wouldneed to be drawn, only a fraction would actually need to be presented to and labeled by theuser. We leave the exploration of such variations for future work.
Acknowledgments
This work was supported by grants AFOSR FA9550-14-1-0342, NSF CCF-1350616, and agift from the Alfred P. Sloan Foundation.
29
Localization via Paired Comparisons
Appendix A. Supporting Lemmas
Lemma 11 Let π > π and let πΏ = min{|π|, |π|} and π = max{|π|, |π|}. Then if Ξ¦ and πrespectively denote the standard normal cumulative distribution function and probabilitydistribution function, we have the bounds
(πβ π)π(π) β€ Ξ¦(π)β Ξ¦(π) β€ (πβ π)π(πΏ) β€ (πβ π)π(0).
Proof By the mean value theorem, we have for some π < π < π, Ξ¦(π)βΞ¦(π) = (πβπ)Ξ¦β²(π) =(πβ π)π(π). Since π(|π₯|) is monotonic decreasing, it is lower bounded by π(π) and upperbounded by π(πΏ) (and also π(0)).
Lemma 12 Let x, y β Rπ. Then,β«Sπβ1|aπ (xβ y)| π(da) = 2β
π
Ξ(π2 )
Ξ(π+12 )βxβ yβ .
Proof By spherical symmetry, we may assume Ξ = x β y = [π, 0, . . . , 0] for π > 0without loss of generality. Then βxβ yβ = π and |aπ (x β y)| = π(1)π = π|cos π|, wherecosβ1(π(1)) = π β [0, π]. We will use the fact (Gradshteyn and Ryzhik, 2007):β« π
2
0cosπβ1 π sinπβ1 π dπ = 1
2π΅
(π
2 ,π
2
)= 1
2Ξ(π/2)Ξ(π/2)Ξ((π + π)/2) .
Integrating |cos π| in the first spherical coordinate, since the integrand is symmetric about π2 ,β« π
0|cos π| sinπβ2 π dπ = 2
β« π/2
0cos π sinπβ2 π dπ =
Ξ(1)Ξ(πβ12 )
Ξ(1 + πβ12 )
= 2πβ 1 .
Then with the appropriate normalization, we have (using Ξ(1/2) =β
π)β«ππβ1
|ππ (xβ y)| π(dπ) =(β« π
0sinπβ2 π dπ
)β1 β« π
0π|cos π| sinπβ2 π dπ
= π
(Ξ(1
2)Ξ(πβ12 )
Ξ(12 + πβ1
2 )
)β1 2πβ 1 = 2β
π
Ξ(π2 )
Ξ(π+12 )βxβ yβ .
Appendix B. Integral Calculations for Lemma 7
Bound (23) Recall that ππ := aππ x/ βxβ β [β1, 1], ππ = ππ βxββ ππ, and ππ = ππ βxββ ππ + π§π.
Thus, if ππ(ππ), ππ (ππ), and ππ§(π§π) denote the probability density functions for ππ, ππ, and π§π,then since these random variables are independent we can write
P [ππ < 0 and ππ > 0] = P [ππ βxβ β ππ < 0 and ππ βxβ β ππ + π§π > 0]
=β« 1
β1
β« β
ππβxβ
β« ππβxββππ
ββππ(ππ)ππ (ππ)ππ§(π§π) dπ§π dππ dππ
=β« 1
β1
β« β
ππβxβππ(ππ)ππ (ππ)P [π§π > ππ β ππ βxβ] dππ dππ
=β« 1
β1
β« β
ππβxβππ(ππ)ππ (ππ)π
(ππ β ππ βxβ
ππ§
)dππ dππ,
31
Massimino and Davenport
where π(π₯) = 1β2π
β«βπ₯ exp(βπ₯2/2) dπ₯, i.e., the tail probability for the standard normal
distribution. Via a similar argument we have
P [ππ > 0 and ππ < 0] = P [ππ βxβ β ππ > 0 and ππ βxβ β ππ + π§π < 0]
=β« 1
β1
β« ππβxβ
ββ
β« β
ππβxββππ
ππ(ππ)ππ (ππ)ππ§(π§π) dπ§π dππ dππ
=β« 1
β1
β« ππβxβ
ββππ(ππ)ππ (ππ)P [π§π < ππ β ππ βxβ)] dππ dππ
=β« 1
β1
β« ππβxβ
ββππ(ππ)ππ (ππ)π
(ππ βxβ β ππ
ππ§
)dππ dππ.
Combining these we obtain
P[ππππ < 0] = P [ππ < 0 and ππ > 0] + P [ππ > 0 and ππ < 0]
=β« 1
β1
β« β
ββππ(ππ)ππ (ππ)π
( |ππ βxβ β ππ|ππ§
)dππ dππ
= 2β« 1
0
β« β
ββππ(ππ)ππ (ππ)π
( |ππ βxβ β ππ|ππ§
)dππ dππ,
following from the symmetry of ππ(Β·). Using the bound π(π₯) β€ 12 exp(βπ₯2/2) (see (13.48) of
Johnson et al., 1994), and recalling that ππ βΌ π© (0, 2π 2/π), we have that
P[ππππ < 0] β€ 1π
βπ
π
β« 1
0
β« β
ββππ(ππ) exp
(β(ππ βxβ β ππ)2
2π2π§
β ππ2π
4π 2
)dππ dππ.
B.1 Bounding π π
First, we give an expression for π π for all cases π β₯ 2, expanding upon that given inTheorem 8 and Lemma 7. We have
π π(π2π§) :=
β§βͺβͺβͺβͺβͺβͺβ¨βͺβͺβͺβͺβͺβͺβ©
12
βπ2
π§π2
π§+π 2 π = 2
min{β
π2π§
π2π§+2π 2/3 ,
βπ2
ππ§βxβ
}π = 3β
π2π§
π2π§+2π 2/π+4βxβ2/π
π β₯ 4.
Below we derive this expression for the cases π = 2, π = 3, and π β₯ 4.
B.2 Case π = 2
For the special case π = 2, ππ = cos ππ where ππ β [βπ, π] is distributed uniformly. In thiscase, (23) can be re-written as
P[ππππ < 0] β€ 12π
β2π
β« π/2
βπ/2
β« β
ββ
12π
exp(β(βxβ cos ππ β ππ)2
2π2π§
β π2π
2π 2
)dππ dππ
= 1ππ
β1
2π
β« π/2
0
β« β
ββexp
(β(βxβ cos ππ β ππ)2
2π2π§
β π2π
2π 2
)dππ dππ.
32
Localization via Paired Comparisons
Expanding and setting πΌ, π½, and πΎ appropriately,
P[ππππ < 0] β€ 1ππ
β1
2π
β« π/2
0
β« β
ββexp
(ββxβ
2 cos2 ππ
2π2π§
+ 2 βxβ ππ cos ππ
2π2π§
β π2π
2π2π§
β π2π
2π 2
)dππ dππ
= 1ππ
β1
2π
β« π/2
0
β« β
ββexp
(βπΎ cos2 ππ + π½ππ cos ππ β πΌπ2
π
)dππ dππ.
Completing the square for ππ,
P[ππππ < 0] = 1ππ
β1
2π
β« π/2
0
β« β
ββexp
(βπΌ
(ππ + π½ cos ππ
2πΌ
)2+ (π½ cos ππ)2
4πΌβ πΎ cos2 ππ
)dππ dππ
= 1ππ
β1
2π
β« π/2
0
βπ
πΌexp
(β(
πΎ β π½2
4πΌ
)cos2 ππ
)dππ
= π
2ππ
β1
2πΌexp
(β1
2
(πΎ β π½2
4πΌ
))πΌ0
(12
(πΎ β π½2
4πΌ
)),
where πΌ0(Β·) denotes the modified Bessel function of the first kind. Since exp(βπ‘)πΌ0(π‘) < 1,by plugging back in for πΌ we obtain
P[ππππ < 0] β€ 12π
β1
2πΌ= 1
2β
2π
β―βΈβΈβ· 11
2π2π§
+ 12π 2
= 12
βπ2
π§
π2π§ + π 2 .
We also note that since since exp(βπ‘)πΌ0(π‘) < 1/β
ππ‘, we can obtain the bound P[ππππ < 0] β€1βπ
ππ§βxβ , but one can show that the previous bound will dominate this whenever βxβ β€ π .
B.3 Case π = 3
For the case π = 3, ππ βΌ [β1, 1] is itself distributed uniformly. In this case we have
P[ππππ < 0] β€ 12π
β3π
β« 1
0
β« β
ββexp
(β(ππ βxβ β ππ)2
2π2π§
β 3π2π
4π 2
)dππ dππ.
Expanding and setting πΌ, π½, and πΎ appropriately,
P[ππππ < 0] β€ 12π
β3π
β« 1
0
β« β
ββexp
(βπ2
π βxβ2
2π2π§
+ 2ππ βxβ ππ
2π2π§
β π2π
2π2π§
β 3π2π
4π 2
)dππ dππ
= 12π
β3π
β« 1
0
β« β
ββexp
(βπΎπ2
π + π½ππππ β πΌπ2π
)dππ dππ.
Completing the square for ππ,
P[ππππ < 0] = 12π
β3π
β« 1
0
β« β
ββexp
(βπΌ
(ππ + πππ½
2πΌ
)2+ (πππ½)2
4πΌβ πΎππ
)dππ dππ
= 12π
β3π
β« 1
0
βπ
πΌexp
(βπ2
π
(πΎ β π½2
4πΌ
))dππ
= 12π
β3πΌ
βπ
2erf(β
πΎ β π½2/4πΌ)
βπΎ β π½2/4πΌ
.
33
Massimino and Davenport
Since erf(π‘)/π‘ β€ 2/β
π, by plugging back in for πΌ we obtain
P[ππππ < 0] β€ 12π
β―βΈβΈβ· 31
2π2π§
+ 34π 2
=β
π2π§
π2π§ + 2π 2/3 .
Additionally, since erf(π‘) β€ 1,
P[ππππ < 0] β€β
3π
4π
(πΎπΌβ π½2/4
)β1/2
=β
3π
4π
(βxβ2
2π2π§
( 12π2
π§
+ 34π 2
)β βxβ
2
4π4π§
)β1/2
=β
3π
4π
(3 βxβ2
8π2π§π 2
)β1/2
=β
π
2ππ§
βxβ ,
which can be tighter when ππ§ is small and βxβ is large.
B.4 Case π β₯ 4
Combining (23) with our upper bound (24) on ππ(ππ), we obtain
P[ππππ < 0] β€ π
4β
2ππ
β« 1
0
β« β
ββexp
(β(ππ βxβ β ππ)2
2π2π§
β ππ2π
4π 2 βππ2
π
8
)dππ dππ.
Expanding and setting πΌ, π½, and πΎ appropriately,
P[ππππ < 0] β€ π
4β
2ππ
β« 1
0
β« β
ββexp
(βπ2
π
(βxβ2
2π2π§
+ π
8
)+ 2ππ βxβ ππ
2π2π§
β π2π
2π2π§
β ππ2π
4π 2
)dππ dππ
= π
4β
2ππ
β« 1
0
β« β
ββexp
(βπΎπ2
π + π½ππππ β πΌπ2π
)dππ dππ.
Completing the square for ππ,
P[ππππ < 0] = π
4β
2ππ
β« 1
0
β« β
ββexp
(βπΌ
(ππ β
πππ½
2πΌ
)2+ (πππ½)2
4πΌβ πΎπ2
π
)dππ dππ
= π
4β
2ππ
β« 1
0
βπ
πΌexp
(βπ2
π
(πΎ β π½2
4πΌ
))dππ
= π
4β
2ππΌπ
βπ
2erf(β
πΎ β π½2/4πΌ)
βπΎ β π½2/4πΌ
.
34
Localization via Paired Comparisons
Since erf(π‘) β€ 1, we have
P[ππππ < 0] β€ π
4β
2π
(πΎπΌβ π½2/4
)β1/2
= π
4β
2π
((βxβ2
2π2π§
+ π
8
)( 12π2
π§
+ π
4π 2
)β βxβ
2
4π4π§
)β1/2
= 12
(32π 2
π2
(βxβ2 π
8π2π§π 2 + π
16π2π§
+ π2
32π 2
))β1/2
= 12
βπ2
π§
π2π§ + 2π 2/π + 4 βxβ2 /π
.
We also note that since erf(π‘)/π‘ β€ 2/β
π, it is also possible to obtain the bound
P[ππππ < 0] β€ 12
βπ
2π
βπ2
π§
π2π§ + 2π 2/π
.
However, this bound can only be tighter when βxβ is small and when π2π < 1 (i.e., for π β€ 6).
Given this narrow range of applicability, we omit this from the formal statement of theresult.
35
Localization via Paired Comparisons
References
S. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D. Kriegman, and S. Belongie. Generalizednon-metric multidimensional scaling. In Proc. Int. Conf. Art. Intell. Stat. (AISTATS),San Juan, Puerto Rico, Mar. 2007.
N. Ailon. Active learning ranking from pairwise preferences with almost optimal querycomplexity. In Proc. Adv. in Neural Processing Systems (NIPS), Grenada, Spain, Dec.2011.
E. Arias-Castro et al. Some theory for ordinal embedding. Bernoulli, 23(3):1663β1693, 2017.
R. Baraniuk, S. Foucart, D. Needell, Y. Plan, and M. Wootters. Exponential decay ofreconstruction error from binary measurements of sparse signals. IEEE Trans. Inform.Theory, 63(6):3368β3385, 2017.
D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method ofpaired comparisons. Biometrika, 39(3/4):324β345, 1952.
A. Brieden, P. Gritzmann, R. Kannan, V. Klee, L. LovΓ‘sz, and M. Simonovits. Deterministicand randomized polynomial-time approximation of radii. Mathematika, 48(1-2):63β105,2001.
R. Buck. Partition of space. Amer. Math. Monthly, 50(9):541β544, 1943.
E. CandΓ¨s and B. Recht. Exact matrix completion via convex optimization. Found. Comput.Math., 9(6):717β772, 2009.
P.-H. Chen, C.-J. Lin, and B. SchΓΆlkopf. A tutorial on π-support vector machines. Appl.Stoch. Models Bus. Ind., 21(2):111β136, 2005.
C. Coombs. Psychological scaling without a unit of measurement. Psych. Rev., 57(3):145β158, 1950.
M. Davenport. Lost without a compass: Nonmetric triangulation and landmark multidimen-sional scaling. In Proc. IEEE Int. Work. Comput. Adv. Multi-Sensor Adaptive Processing(CAMSAP), Saint Martin, Dec. 2013.
H. David. The Method of Paired Comparisons. Charles Griffin & Company Limited, London,UK, 1963.
B. Dubois. Ideal point versus attribute models of brand preference: A comparison ofpredictive validity. Adv. Consumer Research, 2(1):321β333, 1975.
B. Eriksson. Learning to Top-K search using pairwise comparisons. In Proc. Int. Conf. Art.Intell. Stat. (AISTATS), Scottsdale, AZ, Apr. 2013.
M. Giaquinta and G. Modica. Mathematical Analysis: An Introduction to Functions ofSeveral Variables. BirkhΓ€user, New York, NY, 2010.
37
Massimino and Davenport
I. Gradshteyn and I. Ryzhik. Table of Integrals, Series, and Products. Academic Press,Burlington, MA, 2007.
L. GreniΓ© and G. Molteni. Inequalities for the beta function. Math. Inequal. Appl, 18(4):1427β1442, 2015.
L. Jacques, J. Laska, P. Boufounos, and R. Baraniuk. Robust 1-bit compressive sensing viabinary stable embeddings of sparse vectors. IEEE Trans. Inform. Theory, 59(4):2082β2102,2013.
L. Jain, K. G. Jamieson, and R. Nowak. Finite sample prediction and recovery boundsfor ordinal embedding. In Advances In Neural Information Processing Systems, pages2711β2719, 2016.
K. Jamieson and R. Nowak. Active ranking using pairwise comparisons. In Proc. Adv. inNeural Processing Systems (NIPS), Granada, Spain, Dec. 2011a.
K. Jamieson and R. Nowak. Low-dimensional embedding using adaptively selected ordinaldata. In Proc. Allerton Conf. Communication, Control, and Computing, Monticello, IL,Sept. 2011b.
N. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distributions, volume 1.Wiley, New York, NY, 1994.
M. Kleindessner and U. Luxburg. Uniqueness of ordinal embedding. In Conference onLearning Theory, pages 40β67, 2014.
K. Knudson, R. Saab, and R. Ward. One-bit compressive sensing with norm estimation.IEEE Trans. Inform. Theory, 62(5):2748β2758, 2016.
M. Konoshima and Y. Noma. Hyperplane arrangements and locality-sensitive hashing withlift. arXiv preprint arXiv:1212.6110, 2012.
Y. Lu and S. Negahban. Individualized rank aggregation using nuclear norm regularization.In Proc. Allerton Conf. Communication, Control, and Computing, Monticello, IL, Sept.2015.
A. Massimino and M. Davenport. Binary stable embedding via paired comparisons. In Proc.IEEE Work. Stat. Signal Processing, Palma de Mallorca, Spain, June 2016.
A. Massimino and M. Davenport. The geometry of random paired comparisons. In Proc.IEEE Int. Conf. Acoust., Speech, and Signal Processing (ICASSP), New Orleans, Louisiana,Mar. 2017.
A. Maydeu-Olivares and U. BΓΆckenholt. Modeling preference data. In R. Millsap andA. Maydeu-Olivares, editors, The SAGE Handbook of Quantitative Methods in Psychology,chapter 12, pages 264β282. SAGE Publications Ltd., London, UK, 2009.
G. Miller. The magical number seven, plus or minus two: Some limits on our capacity forprocessing information. Psych. Rev., 63(2):81, 1956.
38
Localization via Paired Comparisons
R. Nowak. Noisy generalized binary search. In Advances in neural information processingsystems, pages 1366β1374, 2009.
S. Oh, K. Thekumparampil, and J. Xu. Collaboratively learning preferences from ordinaldata. In Proc. Adv. in Neural Processing Systems (NIPS), MontrΓ©al, QuΓ©bec, Dec. 2014.
M. OβShaughnessy and M. Davenport. Localizing users and items from paired comparisons.In Proc. IEEE Int. Work. Machine Learning for Signal Processing (MLSP), Vietri sulMare, Salerno, Italy, Sept. 2016.
D. Park, J. Neeman, J. Zhang, S. Sanghavi, and I. Dhillon. Preference completion: Large-scale collaborative ranking form pairwise comparisons. In Proc. Int. Conf. MachineLearning (ICML), Lille, France, July 2015.
F. Perez-Cruz, J. Weston, D. Herrmann, and B. SchΓΆlkopf. Extension of the π-SVM rangefor classification. NATO Science Series Sub Series III Computer and Systems Sciences,190:179β196, 2003.
Y. Plan and R. Vershynin. One-bit compressed sensing by linear programming. Communi-cations on Pure and Applied Mathematics, 66(8):1275β1297, 2013a.
Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: Aconvex programming approach. IEEE Transactions on Information Theory, 59(1):482β494,2013b.
Y. Plan and R. Vershynin. Dimension reduction by random hyperplane tessellations. Discrete& Computational Geometry, 51(2):438β461, 2014.
F. Qi. Bounds for the ratio of two gamma functions. J. Inequal. Appl., 2010(1):493058, 2010.
F. Radlinski and T. Joachims. Active exploration for learning rankings from clickthroughdata. In Proc. ACM SIGKDD Int. Conf. on Knowledge, Discovery, and Data Mining(KDD), San Jose, CA, Aug. 2007.
J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborativeprediction. In Proc. Int. Conf. Machine Learning (ICML), Bonn, Germany, Aug. 2005.
B. SchΓΆlkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization,Optimization, and Beyond. MIT Press, Cambridge, MA, 2001.
N. Shah, S. Balakrishnan, J. Bradley, A. Parekh, K. Ramchandran, and M. Wainwright.Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence.J. Machine Learning Research, 17(58):1β47, 2016.
M. Spruill. Asymptotic distribution of coordinates on high dimensional spheres. Elec. Comm.in Prob., 12:234β247, 2007.
L. Thurstone. A law of comparative judgment. Psych. Rev., 34(4):273, 1927.
F. Wauthier, M. Jordan, and N. Jojic. Efficient ranking from pairwise comparisons. In Proc.Int. Conf. Machine Learning (ICML), Atlanta, GA, June 2013.
39