68W01, 68W05, 68W40, 68P10
Peter N. Yianilos
July 14, 1999 and, in revised form, June 18, 2002.
Netrics Inc., 707 State Road #212, Princeton, NJ 08540
Nearest neighbor search, kd-tree, curse of dimensionality
We consider the problem of nearest neighbor search in the Euclidean hypercube with uniform distributions, and the additional natural assumption that the nearest neighbor is located within a constant fraction of the maximum interpoint distance in this space, i.e. within distance of the query.
We introduce the idea of aggressive pruning and give a family of practical algorithms, an idealized analysis, and describe experiments. Our main result is that search complexity measured in terms of -dimensional inner product operations, is i) strongly sublinear with respect to the data set size for moderate , ii) asymptotically, and as a practical matter, independent of dimension.
Given a random data set, a random query within distance of some database element, and a randomly constructed data structure, the search succeeds with a specified probability, which is a parameter of the search algorithm. On average a search performs distance computations where is the number of points in the database, and is calculated in our analysis. Linear and near-linear space structures are described, and our algorithms and analysis are free of large hidden constants, i.e. the algorithms perform far less work than exhaustive search - both in theory and practice.
2002American Mathematical Society
Finding nearest neighbors in Euclidean spaces of very low dimension is theoretically fast, and practical, using the notion of a Voronoi diagram [Aur91]. In moderate dimension, or in general metric spaces with intrinsically moderate dimension, recursive projection-decomposition techniques such as kd-trees [FBS75,FBF77,BF79,Ben80] and vantage-point techniques [Uhl91b,Uhl91a,RP92,Yia93] for general metric spaces of intrinsically low dimension, are effective.
As dimension grows large, these tree techniques perform significantly better than brute-force search only if the number of dataset points grows exponentially in . Or, in the case of Voronoi diagrams, if space increases exponentially.
The motivation for this work is the observation that, in practice, one is usually interested in nearest neighbors only if they are somewhat close to the query. The main contribution of this paper is the algorithmic idea of aggressive pruning and its analysis for the uniformly distributed hypercube. In this setting, with a natural definition of somewhat close, we show that the expected time complexity of finding nearest neighbors is invariant with respect to dimension3 -- and the space cost4 is independent of , and linear in .
In a Euclidean hypercube the maximum distance between two points grows with . By somewhat close we mean within a neighborhood whose radius is a constant fraction of this distance. A parameter controls the probability that a search will locate the nearest neighbor within this search domain. For each choice of and we calculate an exponent such that the search will perform on average distance computations, and this dominates the work performed. Notice that is independent of .
The practical significance of our work is that search time is strongly sublinear given moderately large values for and acceptable success probabilities. For example, searching points uniformly distributed in with our experimental software, given , requires on average distance computations, and succeeds with probability . In this example from our analysis. Arbitrarily high success probabilities can be obtained at the expense of distance computations.
We remark that this work was motivated by the author's recent work of [Yia99], where the objective is to build a data structure that provides worst-case sublinear-time radius-limited nearest neighbor search (independent of query) for a given dataset. With uniform distributions in Euclidean space, the resulting structures support search neighborhood of only size, in contrast with those of this paper that scale linearly with the maximum interpoint distance.
Early work in nearest neighbor search is surveyed in [Das91]. There is a large literature on the search problem, much of it elaborating on a single fact: that certain projections from a metric space to have the property that projected distances are dominated by those in the original space.
The two most important such projections are i) inner product with a unit vector in Euclidean space, and ii) distance from a chosen vantage point.5 These ideas were recognized early on in work including [BK73,Fuk75,Fuk90,FS82,Sha77].
Taking the inner product with a canonical basis element in Euclidean space leads to the well-known kd-tree of Friedman and Bentley [FBS75,FBF77,BF79,Ben80]. They recursively divide a pointset in by projecting each element onto a distinguished coordinate. Improvements, distribution adaptation, and incremental searches, are described in [EW82], [KP86], and [Bro90] respectively.
The search tree we build has essentially the same structure as a kd-tree built given randomly pretransformed data6, and constrained to use the same cut coordinate for all nodes at a given tree level. Our criterion for pruning branches during search, and its analysis, are the primary contributions of this paper and distinguish our methods from kd-tree search.
More recently, the field of computational geometry has yielded many interesting results such as those of [Vai89,Cla88,Cla87,FMT92] and earlier [DL76].
Very recently a number of papers have appeared striving for more efficient algorithms with worst-case time bounds for the approximate nearest neighbor problem [AMN$^+$94,Kle97,KOR98,IM98]. These exploit properties of random projections beyond the simple projection distance dominance fact mentioned above, and additional ideas to establish worst case bounds. Our work may be viewed as exploiting the fact that random projections of uniformly random data in a hypercube, and of neighborhood balls of radius proportional to , both have constant variance with respect to . See also [Cla97] for very recent work on search in general metric spaces.
Several of the papers mentioned above include interesting theoretical constructions that trade polynomial space, and in some cases expensive preprocessing, for fast performance in the worst case. We remark that to be useful in practice, a nearest neighbor algorithm must require very nearly linear space -- a stringent requirement. As datasets grow, even low-degree polynomial space becomes rapidly unacceptable.
For completeness, early work dealing with two special cases should be mentioned. Retrieval of similar binary keys is considered by Rivest in [Riv74] and the setting is the focus of [Yun76]. Also see [BM80] for worst case data structures for the range search problem. Finally, recent works [BOR99] and [CCGL99] establish nontrivial lower bounds for nearest neighbor search.
In the following section we give construction and search algorithms specialized for our uniform setting. Section 3 gives a concrete analysis of these algorithms that include calculations of the applicable search time complexity exponent, and failure probability. Experiments are presented in section 4, which confirm in practice the dimensional invariance established by analysis. Finally, some directions for further work are mentioned in section 5.
A search tree is built with the set of data points as its leaves. It has essentially the same structure as a kd-tree built on data that has first been transformed to a random coordinate system. Construction time is easily seen to be and space is linear.
Construction proceeds recursively. Each interior node has as input a list of points. The root's list is the entire set. Associated with each interior node is a randomly selected unit vector , where denotes level (distance from the root). The number of such vectors is then equal to the depth of tree minus one (because there is no vector associated with a leaf). This set is constructed so as to be orthonormal. First, random vectors are drawn, then they are orthonormalized in the usual way (Gram-Schmidt).
Construction of a node consists of reading its input list, computing the inner product of each element with the node's associated , and adding the element to a left or right output list depending on its sign. In general the dividing point is chosen more intelligently, e.g. by computing the median of the inner product values. But in our uniform setting we just use zero for simplicity. Left and right children are then generated recursively. If a node's input list consists of a single element, then that node becomes a leaf and stores the element within it.
The search is parameterized by a query , a value giving the proportional size of the search domain, and a probability that is related to the success rate we expect.
The inner product of and each is first computed. Next the positive threshold distance is computed, where denotes the cumulative distribution function for a normal density with the variance indicated by subscript.
Search proceeds recursively from the root. For a node at level the value is examined. If it is less than then the left child is recursively explored. If it exceeds then the right child is explored. Notice that when both children are explored. When a leaf is encountered, the distance is computed between the element it contains and .
This decision rule is centered at zero because of our particular uniform setting, but is easily translated to an arbitrary cut point, e.g. the median of the projected values.
After each distance computation is performed the proportion is computed. If smaller than , then is reduced and recomputed.
This concludes the description of our search algorithm and we now briefly discuss related issues and extensions.
An important idea in kd-tree search involves the computation of the minimum distance from the query to the subspace corresponding to a node to be explored. If this distance grows beyond the radius of interest, the node is pruned. We do not, however, include it in our analysis or experimental implementation because in our high dimensional setting, in the absence of exponentially many data elements, this idea has vanishingly little effect. Intuitively, this is because the search tree is not nearly deep enough for the minimum distance to grow larger than the search radius.
The analogue of this kd-tree idea in our setting is an alternative version of our algorithm above that slightly reduces while descending through interior nodes to reflect the fact that the distribution of data elements within a ball about the query is no longer uniform. But, again, this is a second order effect.
Finally we remark that the -cutoff approach taken above might be replaced with an entirely probabilistic pruning scheme that passes probabilities from our analysis down to each child during search. The probabilities upward to the root are then multiplied and search continues downward until the result falls below a specified threshold.
We assume both data points and queries are uniformly distributed within the hypercube . The Euclidean distance metric applies and the maximum distance between two points in this space is . We consider the problem of finding the nearest neighbor of a query within some distance that is a constant proportion of the maximum interpoint distance, i.e. with .
Any unit vector gives rise to a projection mapping into defined by . It is immediate that distances in the range of this projection are dominated by distances in the domain. It is this fact that kd-trees exploit to prune branches during branch-and-bound search.
If represents the distance to the nearest neighbor encountered so far during a search, then this fact implies that every member of the ball of radius centered at the query maps under any into the interval . So kd-tree search may confidently disregard any data points that project outside of this interval. Unlike the kd-tree, which finds nearest neighbors with certainty, we will consider randomized constructions that succeed with some specified probability.
Since scales up linearly with , the interval grows too, and soon the kd-tree can confidently prune almost nothing, and performs a nearly exhaustive search.
But in our uniformly random setting we will show that the ball about projects to a distribution about having constant variance. This is why it is possible to continue to probabilistically prune just as effectively even as . Since dimension does not affect our ability to prune, we have lifted the curse of dimensionality in our setting.
We now proceed with the description and idealized analysis of our algorithm with the following proposition, which establishes two elementary but key dimensional invariants
Now fix some unit vector and consider the query's projection . Then by Proposition 3.1, and ignoring hypercube corner effects, the projection of the data points within the domain of search are distributed about with variance no greater than .
Because distances between projected points are dominated by their original distance it is clear that for any in the search domain.
So any data points with projections farther than from can be confidently ruled out during search. It is this hard pruning that kd-trees (and many related structures) exploit to avoid considering every point during nearest neighbor search.
But notice that this pruning cutoff distance with while the projection variance remains constant. As a result hard pruning is asymptotically ineffective, and this gives rise to the curse of dimensionality for nearest neighbor search using kd-trees in our setting.
Next observe that as a consequence of the central limit theorem the distributions of Proposition 3.1 are asymptotically normal -- and we remark that as a practical matter this is a good approximation; even for moderate dimension. So from this point on in our analysis we will simply assume normality, i.e. that is sufficiently large so that its deviation from normality is negligible.
We then arrive at one of the main observations of this paper: that the nearest neighbor's projection is within constant distance of with arbitrarily high and easily calculated probability depending only on this constant, and not on .
Recall that the tree we build recursively bisects the set of data points using a randomly selected by separating its projection into left and right branches at a cut point near the median. Given a query we prune one of these branches if is at least some cutoff distance from . We refer to this idea as aggressive pruning.
Again, note that this cutoff distance is constant despite the fact that the absolute size of our search domain is expanding with as . We can now establish a single tree's expected search complexity and success probability.
Notice that the exponent of in proposition 3.3 is within . Also, the requirement that is not very restrictive since otherwise the number of data points is exponential in and earlier techniques can be used to efficiently locate nearest neighbors.
The probability estimate of Proposition 3.3 is extremely conservative. The actual failure probabilities are much smaller because: i) our analysis has implicitly assumed that the query always projects to a point exactly from the cut point. This is the worst case. If the distance is smaller, then nothing is pruned and no error can be made; and if the distance is larger, the effective cutoff distance is effectively increased; and ii) we have assumed that the nearest neighbor is always just within the domain of search. In reality queries will be distributed with this domain.
Even without considering these factors our calculations suggest that efficient search is possible for moderately large values of .
Table 3 gives exponent values for selected values of and . For example, if and then the exponent is . Given data points our results state that no more than inner product operations will be performed, and that the nearest neighbor will be located with probability . Choosing corresponds to inner products with the probability of success increasing to . All of these calculations are asymptotically independent of dimension.
We continue this example to illustrate the use of forests to improve performance in some settings. Later we will give a theoretical consequence of this idea. Consider two independent trees with . That is, the choice of random unit vectors is independent. Both trees are searched for each query. We then expect a failure rate of , and inner products are performed.8 Now this failure rate is considerably better than the resulting from a single tree using , and far fewer inner products are computed, but space consumption is doubled. This example illustrates the space-time design tradeoff that arises when applying our ideas in practice.
The natural generalization of this idea leads to a forest of independent trees. Recall that a failure rate of a single tree grows, albeit slowly, with . By using enough trees one can force the forest's failure rate to zero as increases while preserving sublinear search times; at the expense of polynomial space to pay for the additional trees. To analyze this let denote the tree's depth and observe that gives the probability of failure for a forest of distinct random trees. It can be shown that if for any , then this probability tends to zero as (and therefore ) increases9. Since the number of trees in terms of . Search time (the number of inner product operations) also increases by a factor of . Space increases from linear to . Finally, to maintain independence we must increase the lower bound on to .
It is apparent from table 3 that this idea works for many values of R and p. For example, if and then the original exponent increases by only - far less than the table's precision. But it does not work for arbitrary and . For large values of the exponent is so close to that the added complexity makes the total superlinear.
Note that space is required to store the data elements themselves, but only space is needed for the interior tree nodes, which point to data elements. So, in practice, we can afford trees while remaining linear space. In practice this means that in high dimension one can easily afford many trees without materially affecting space consumption. In theory it means that if is assumed to grow as an appropriate small power of as , then the entire forest occupies linear space.
Returning to our choice of the hypercube, we observe that i) All of our arguments apply immediately to after compensating for the shifted mean, and ii) that our arguments apply as well to the -dimensional hypersphere. To see this note that the dot product of a random unit vector and a random database element has variance , and the maximum interpoint distance is constant, so the projection of a proportional neighborhood balls also falls with .
Finally we remark that as an alternative to forests, one can build a single tree in which each level is overlapped, i.e. items near the center are stored in both the left and right branches. This effectively reduces and with it search time - at the expense of polynomial space (given a constant fraction of overlap at each level).
To confirm our analysis we wrote an ANSI-C program to build and search trees. It accepts , , , as arguments, in addition to the number of random queries to test, and a random seed.
The data points are distributed uniformly in . Queries are generated by choosing a data point at random, then a random unit vector , and forming the sum - so that the query lies just inside of the search domain. We use . This ensures that a nearest neighbor exists within the domain.
An indexed set of projection vectors is generated by starting with random vectors and orthonormalizing them in the standard way. Each level in the tree uses the corresponding vector in this set as its projector. So the size of the set is , and in practice is because the resulting trees are not perfectly balanced. For simplicity zero is always used as the cut point during tree construction.
In order to allow us to simultaneously explore large and without exceeding available RAM, we represent the data points to be searched using a pseudorandom generator seed that suffices to construct them. As a result we can easily study and higher in almost arbitrarily high dimension.
Figure 1 illustrates that search complexity measured in the number of inner product operations performed is very nearly dimension-invariant -- confirming our analysis. So total work scales up only linearly with dimension with the complexity of an inner product operation.
Note that for simplicity we regard an inner product operation and a Euclidean distance computation as having identical complexity. Also, the table's results exclude the roughly inner products required to compute the query's projection at each level of the tree.
Recall that our analysis of failure rate was quite conservative and experiments confirm this. Actual failure rates for every case in Figure 1 were somewhat lower than those from analysis. For example, in the case, analysis predicts that the search will succeed with probability , and the actual value is . Note that one reason for using the moderate value in our study is that we have confirmed that values far closer to zero result in no observable experimental error given the limited number of queries we have time to perform.
We compare performance in our setting with kd-tree search [FBS75,FBF77,BF79,Ben80]. We are interested in neighbors within a specified distance of the query. To reproduce this setting, kd-tree search is initialized as though an element at the specified distance has already been encountered.
A recent kd-tree implementation [MA99] is modified for our experiments. Its core routines are altered to support radius-limited search as described above, and the distribution's sample program is adapted to report the average number of distance computations performed in response to each query. As in our earlier experiments, queries are generated by starting at a database element, and moving the specified distance away in a random direction. So every query in our experiment locates a nearest neighbor within the radius of interest. As a self-check the sample program is further modified to verify this condition. After these modifications, a direct comparison is possible between our method and kd-tree search.
From our analysis it is clear that kd-tree search will not exhibit the same dimensional invariance, and figure 2 confirms this. Moreover, search time grows so rapidly for the cases that the tree is essentially ineffective by dimension . By contrast (figure 1) our method remains effective for arbitrarily high dimension. Recall, however, that kd-tree search is guaranteed to find the nearest neighbor within in the domain, while our method will fail with some small probability.
A primary motivation for this paper was our discovery in [Yia99] that kd-trees, and a related general structure called a vantage point forest, do exhibit search complexity that is invariant with respect to dimension for a domain of constant radius -- and substantial savings over exhaustive search are realized only for disappointingly small radii. The experiment reported in figure 3 confirms this, and illustrates the fundamentally different asymptotic behavior (with respect to ) of kd-tree search and our aggressive pruning approach. The flat graphs of figure 1 are in the presence of a search radius that grows with , where those of 3 assume a search radius that is constant with respect to . For example, by dimension a domain of unity radius generates a nearly exhaustive search. By contrast figure 1 shows that for and (corresponding to a radius of ), our search visits only roughly of the leaf nodes.
We conclude our discussion of experiments by describing a single example of our method in greater detail that pushes upward to , up to , and the number of queries up to . The dimension is . Here, analysis predicts success probability and we observe . The tree's depth is and leaves are visited on average during a search. The predicted value is . The experiment required seconds on a 400Mhz Pentium II processor.
We hopes this work takes us closer to practical and useful nearest neighbor search in high dimension. In addition to sharpening our analysis for the uniform case, future theoretical work might target distribution-independent bounds for approximate nearest neighbor search within an -domain. While generalizing we expect that the definition of an -domain will have to change, and that it may be necessary to move to the approximate nearest neighbor framework where the objective is to find neighbors within a factor of the closest.
Also, the work of this paper motivates us to consider techniques for general metric spaces and arbitrary distributions that learn the projection distributions and from examples -- adapting to the problem domain.
Ultimately we envision a high-level procedure that like the qsort() function from the standard Unix library, views its underlying database through a layer of abstraction. In the case of sorting, a comparison function is provided. For nearest neighbor search projection and distance computation must be supported.
Sorting is, however, a much simpler problem in as much as practical algorithms exist with acceptable worst case time and space complexity -- independent of the dataset. For nearest neighbor search we suggest that a general tool must be flexible and ideally support:
If the projectors are known to be independent (as in orthogonal Euclidean projections) then search should exploit this. Most generally a system might try to learn patterns of independence from examples.
Special support for discrete-valued distance functions is also desirable.
We hope that the ideas and results of this paper brings us closer to the development of such a tool.
The author thanks Joe Kilian and Warren Smith for helpful discussions.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -image_type gif -nonext_page_in_navigation -noprevious_page_in_navigation -up_url ../main.html -up_title 'Publication homepage' -accent_images textrm -numbered_footnotes vp3.tex
The translation was initiated by Peter N. Yianilos on 2002-06-24