68W01, 68W05, 68W40, 68P10
Peter N. Yianilos
July 14, 1999 and, in revised form, June 18, 2002.
Netrics Inc., 707 State Road #212, Princeton, NJ 08540
Nearest neighbor search, kd-tree, curse of dimensionality
We consider the problem of nearest neighbor search in
the Euclidean hypercube with uniform
distributions, and the additional natural assumption
that the nearest neighbor is located within a constant
fraction
of the maximum interpoint distance in this space,
i.e. within distance
of the query.
We introduce the idea of aggressive pruning and
give a family of practical algorithms, an idealized
analysis, and describe experiments. Our main result
is that search complexity measured in terms of
-dimensional inner product operations, is i)
strongly sublinear with respect to the data set size
for moderate
, ii) asymptotically, and as a
practical matter, independent of dimension.
Given a random data set, a random query within
distance of some database element, and
a randomly constructed data structure, the search
succeeds with a specified probability, which is a
parameter of the search algorithm. On average a
search performs
distance computations
where
is the number of points in the database,
and
is calculated in our analysis. Linear
and near-linear space structures are described, and
our algorithms and analysis are free of large hidden
constants, i.e. the algorithms perform far less work
than exhaustive search - both in theory and practice.
2002American Mathematical Society
Finding nearest neighbors in Euclidean spaces of very low dimension is theoretically fast, and practical, using the notion of a Voronoi diagram [Aur91]. In moderate dimension, or in general metric spaces with intrinsically moderate dimension, recursive projection-decomposition techniques such as kd-trees [FBS75,FBF77,BF79,Ben80] and vantage-point techniques [Uhl91b,Uhl91a,RP92,Yia93] for general metric spaces of intrinsically low dimension, are effective.
As dimension grows large, these tree techniques perform
significantly better than brute-force search only if the
number of dataset points
grows exponentially in
. Or,
in the case of Voronoi diagrams, if space increases
exponentially.
The motivation for this work is the observation that, in
practice, one is usually interested in nearest neighbors only
if they are somewhat close to the query. The main
contribution of this paper is the algorithmic idea of aggressive pruning and its analysis for the uniformly
distributed hypercube. In this setting, with a natural
definition of somewhat close, we show that the expected time
complexity of finding nearest neighbors is invariant with
respect to dimension3 -- and the space cost4 is
independent of , and linear in
.
In a Euclidean hypercube the maximum distance between two
points grows with . By somewhat close we mean
within a neighborhood whose radius is a constant fraction
of this distance.
A parameter
controls the probability that a search
will locate the nearest neighbor within this search domain.
For each choice of
and
we calculate an exponent
such that the search will perform on average
distance computations, and this dominates the work
performed. Notice that
is independent of
.
The practical significance of our work is that search time is
strongly sublinear given moderately large values for and
acceptable success probabilities. For example, searching
points uniformly distributed in
with our experimental software, given
, requires on
average
distance computations, and succeeds
with probability
. In this example
from our analysis. Arbitrarily high success
probabilities can be obtained at the expense of distance
computations.
We remark that this work was motivated by the author's recent
work of [Yia99], where the objective is to build
a data structure that provides worst-case sublinear-time
radius-limited nearest neighbor search (independent of query)
for a given dataset. With uniform distributions in
Euclidean space, the resulting structures support search
neighborhood of only size, in contrast with those of
this paper that scale linearly with the maximum interpoint
distance.
Early work in nearest neighbor search is surveyed in
[Das91].
There is a large literature on the search problem, much
of it elaborating on a single fact: that certain projections
from a metric space to have the property that
projected distances are dominated by those in the original
space.
The two most important such projections are i) inner product with a unit vector in Euclidean space, and ii) distance from a chosen vantage point.5 These ideas were recognized early on in work including [BK73,Fuk75,Fuk90,FS82,Sha77].
Taking the inner product with a canonical basis element in
Euclidean space leads to the well-known kd-tree of
Friedman and Bentley
[FBS75,FBF77,BF79,Ben80].
They recursively divide a pointset in by projecting
each element onto a distinguished coordinate. Improvements,
distribution adaptation, and incremental searches, are
described in [EW82], [KP86], and
[Bro90] respectively.
The search tree we build has essentially the same structure as a kd-tree built given randomly pretransformed data6, and constrained to use the same cut coordinate for all nodes at a given tree level. Our criterion for pruning branches during search, and its analysis, are the primary contributions of this paper and distinguish our methods from kd-tree search.
More recently, the field of computational geometry has yielded many interesting results such as those of [Vai89,Cla88,Cla87,FMT92] and earlier [DL76].
Very recently a number of papers have appeared striving for
more efficient algorithms with worst-case time bounds for the
approximate nearest neighbor problem
[AMN$^+$94,Kle97,KOR98,IM98]. These exploit
properties of random projections beyond the simple projection distance dominance fact mentioned above, and
additional ideas to establish worst case bounds. Our work may
be viewed as exploiting the fact that random projections of
uniformly random data in a hypercube, and of neighborhood
balls of radius proportional to , both have constant
variance with respect to
. See also [Cla97] for
very recent work on search in general metric spaces.
Several of the papers mentioned above include interesting theoretical constructions that trade polynomial space, and in some cases expensive preprocessing, for fast performance in the worst case. We remark that to be useful in practice, a nearest neighbor algorithm must require very nearly linear space -- a stringent requirement. As datasets grow, even low-degree polynomial space becomes rapidly unacceptable.
For completeness, early work dealing with two special cases
should be mentioned. Retrieval of similar binary keys is
considered by Rivest in [Riv74] and the
setting is the focus of [Yun76]. Also see
[BM80] for worst case data structures for the range
search problem. Finally, recent works [BOR99] and
[CCGL99] establish nontrivial lower bounds for
nearest neighbor search.
In the following section we give construction and search algorithms specialized for our uniform setting. Section 3 gives a concrete analysis of these algorithms that include calculations of the applicable search time complexity exponent, and failure probability. Experiments are presented in section 4, which confirm in practice the dimensional invariance established by analysis. Finally, some directions for further work are mentioned in section 5.
A search tree is built with the set of data points as its
leaves. It has essentially the same structure as a kd-tree
built on data that has first been transformed to a random
coordinate system. Construction time is easily seen to be
and space is linear.
Construction proceeds recursively. Each interior node
has as input a list of points. The root's list is the entire
set. Associated with each interior node is a randomly selected
unit vector , where
denotes level (distance from the
root). The number of such vectors is then equal to the depth
of tree minus one (because there is no vector associated with
a leaf). This set is constructed so as to be orthonormal.
First, random vectors are drawn, then they are orthonormalized
in the usual way (Gram-Schmidt).
Construction of a node consists of reading its input list,
computing the inner product of each element with the node's
associated , and adding the element to a left or
right output list depending on its sign. In general the
dividing point is chosen more intelligently, e.g. by
computing the median of the inner product values. But in our
uniform
setting we just use zero for simplicity.
Left and right children are then generated recursively. If a
node's input list consists of a single element, then that node
becomes a leaf and stores the element within it.
The search is parameterized by a query , a value
giving the proportional size of the search domain, and
a probability
that is related to the success rate we
expect.
The inner product of and each
is first computed.
Next the positive threshold distance
is computed, where
denotes the
cumulative distribution function for a normal density with
the variance indicated by subscript.
Search proceeds recursively from the root. For a node at
level the value
is examined. If it is less than
then the left child is recursively explored. If it
exceeds
then the right child is explored. Notice that
when
both children are explored.
When a leaf is encountered, the distance is computed between
the element it contains and
.
This decision rule is centered at zero because of our
particular uniform setting, but is easily
translated to an arbitrary cut point, e.g. the median of the
projected values.
After each distance computation is performed the
proportion
is computed. If smaller than
, then
is reduced and
recomputed.
This concludes the description of our search algorithm and we now briefly discuss related issues and extensions.
An important idea in kd-tree search involves the computation of the minimum distance from the query to the subspace corresponding to a node to be explored. If this distance grows beyond the radius of interest, the node is pruned. We do not, however, include it in our analysis or experimental implementation because in our high dimensional setting, in the absence of exponentially many data elements, this idea has vanishingly little effect. Intuitively, this is because the search tree is not nearly deep enough for the minimum distance to grow larger than the search radius.
The analogue of this kd-tree idea in our setting is an
alternative version of our algorithm above that slightly
reduces while descending through interior nodes to
reflect the fact that the distribution of data elements within
a ball about the query is no longer uniform. But, again, this
is a second order effect.
Finally we remark that the -cutoff approach taken
above might be replaced with an entirely probabilistic pruning
scheme that passes probabilities from our analysis down to
each child during search. The probabilities upward to the
root are then multiplied and search continues downward until
the result falls below a specified threshold.
We assume both data points and queries are uniformly
distributed within the hypercube . The Euclidean
distance metric applies and the maximum distance between two
points in this space is
.
We consider the problem of finding the nearest neighbor of a query
within some distance
that is a constant proportion
of the maximum interpoint distance, i.e.
with
.
Any unit vector gives rise to a projection
mapping
into
defined by
. It is
immediate that distances in the range of this projection are
dominated by distances in the domain. It is this fact that
kd-trees exploit to prune branches during branch-and-bound
search.
If represents the distance to the nearest neighbor
encountered so far during a search, then this fact implies
that every member of the ball of radius
centered at the
query
maps under any
into the interval
. So kd-tree search may
confidently disregard any data points that project outside of
this interval.
Unlike the kd-tree, which finds nearest neighbors with certainty,
we will consider randomized constructions that succeed with
some specified probability.
Since scales up linearly with
, the interval
grows
too, and soon the
kd-tree can confidently prune almost nothing, and
performs a nearly exhaustive search.
But in our uniformly random setting we will show that the
ball about
projects to a distribution about
having constant variance. This is why it is possible
to continue to probabilistically prune just as
effectively even as
. Since dimension
does not affect our ability to prune, we have lifted the curse
of dimensionality in our setting.
We now proceed with the description and idealized analysis of our algorithm with the following proposition, which establishes two elementary but key dimensional invariants
Now fix some unit vector and consider the query's
projection
. Then by Proposition 3.1, and
ignoring hypercube corner effects, the projection of the
data points within the domain of search are distributed about
with variance no greater than
.
Because distances between projected points are dominated by
their original distance it is clear that
for any
in the search domain.
So any data points with projections farther than from
can be confidently ruled out during search. It is
this hard pruning that kd-trees (and many related
structures) exploit to avoid considering every point during
nearest neighbor search.
But notice that this pruning cutoff distance
with
while the projection variance
remains constant. As a result hard pruning is asymptotically
ineffective, and this gives rise to the curse of
dimensionality for nearest neighbor search using kd-trees in
our setting.
Next observe that as a consequence of the central limit
theorem the distributions of Proposition 3.1 are
asymptotically normal -- and we remark that as a practical
matter this is a good approximation; even for moderate
dimension. So from this point on in our analysis we will
simply assume normality, i.e. that is sufficiently large
so that its deviation from normality is negligible.
We then arrive at one of the main observations of this paper:
that the nearest neighbor's projection is within constant
distance of with arbitrarily high and easily
calculated probability
depending only on this constant,
and not on
.
Recall that the tree we build recursively bisects the set of
data points using a randomly selected by separating its
projection into left and right branches at a cut point
near the median. Given a query
we prune one of these
branches if
is at least some cutoff distance
from
. We refer to this idea as aggressive pruning.
Again, note that this cutoff distance is constant
despite the fact that the absolute size of our search domain
is expanding with
as
.
We can now establish a single tree's expected search complexity
and success probability.
Notice that the exponent of in proposition 3.3
is within
. Also, the requirement that
is not very restrictive since otherwise the number
of data points is exponential in
and earlier techniques
can be used to efficiently locate nearest neighbors.
The probability estimate of Proposition 3.3 is
extremely conservative. The actual failure
probabilities are much smaller because: i) our analysis has
implicitly assumed that the query always projects to a point
exactly from the cut point. This is the worst case.
If the distance is smaller, then nothing is pruned and no
error can be made; and if the distance is larger, the
effective cutoff distance is effectively increased; and ii) we
have assumed that the nearest neighbor is always just within
the domain of search. In reality queries will be distributed
with this domain.
Even without considering these factors our calculations
suggest that efficient search is possible for moderately
large values of .
Table 3 gives exponent values for selected
values of and
. For example, if
and
then the exponent is
.
Given
data points our results state that no
more than
inner product operations
will be performed, and that the nearest neighbor will be
located with probability
.
Choosing
corresponds to
inner
products with the probability of success increasing to
. All of these calculations are asymptotically
independent of dimension.
We continue this example to illustrate the use of forests to
improve performance in some settings. Later we will give a
theoretical consequence of this idea. Consider two
independent trees with
. That is, the choice
of random unit vectors is independent. Both trees are
searched for each query. We then expect a failure rate of
, and
inner products are performed.8 Now this failure
rate is considerably better than the
resulting from a single tree using
, and far
fewer inner products are computed, but space consumption is
doubled. This example illustrates the space-time design
tradeoff that arises when applying our ideas in practice.
The natural generalization of this idea leads to a forest of
independent trees. Recall that a failure rate of a single
tree grows, albeit slowly, with . By using enough trees
one can force the forest's failure rate to zero as
increases while preserving sublinear search times; at the
expense of polynomial space to pay for the additional trees.
To analyze this let
denote the tree's depth and observe
that
gives the probability of failure for a forest
of
distinct random trees. It can be shown that if
for any
, then this
probability tends to zero as
(and therefore
)
increases9. Since
the number of
trees
in terms of
.
Search time (the number of inner product operations) also
increases by a factor of
. Space increases from linear to
. Finally, to maintain independence we must increase
the lower bound on
to
.
It is apparent from table 3 that this idea works
for many values of R and p. For example, if and
then the original exponent
increases by
only
- far less than the
table's precision. But it does not work for arbitrary
and
. For large values of
the exponent is so close to
that the added complexity makes the total superlinear.
Note that space is required to store the data elements
themselves, but only
space is needed for the interior
tree nodes, which point to data elements. So, in practice, we
can afford
trees while remaining linear space. In
practice this means that in high dimension one can easily
afford many trees without materially affecting space
consumption. In theory it means that if
is assumed to
grow as an appropriate small power of
as
, then the entire forest occupies linear space.
Returning to our choice of the hypercube, we observe
that i) All of our arguments apply immediately to
after compensating for the shifted mean, and ii) that our
arguments apply as well to the
-dimensional hypersphere.
To see this note that the dot product of a random unit vector
and a random database element has variance
, and the maximum
interpoint distance is constant, so the projection of a
proportional neighborhood balls also falls with
.
Finally we remark that as an alternative to forests, one can
build a single tree in which each level is overlapped,
i.e. items near the center are stored in both the left and
right branches. This effectively reduces and with it
search time - at the expense of polynomial space (given a
constant fraction of overlap at each level).
To confirm our analysis we wrote an ANSI-C program to build
and search trees. It accepts ,
,
,
as arguments,
in addition to the number of random queries to test, and a
random seed.
The data points are distributed uniformly in .
Queries are generated by choosing a data point
at random,
then a random unit vector
, and forming the sum
- so that the query lies just
inside of the search domain. We use
. This
ensures that a nearest neighbor exists within the domain.
An indexed set of projection vectors is generated by starting
with random vectors and orthonormalizing them in the standard
way. Each level in the tree uses the corresponding vector in
this set as its projector. So the size of the set is
, and in practice is
because
the resulting trees are not perfectly balanced. For
simplicity zero is always used as the cut point during tree
construction.
In order to allow us to simultaneously explore large and
without exceeding available RAM, we represent the data
points to be searched using a pseudorandom generator seed that
suffices to construct them. As a result we can easily study
and higher in almost arbitrarily high
dimension.
Figure 1 illustrates that search complexity measured in the number of inner product operations performed is very nearly dimension-invariant -- confirming our analysis. So total work scales up only linearly with dimension with the complexity of an inner product operation.
Note that for simplicity we regard an inner product operation
and a Euclidean distance computation as having identical
complexity. Also, the table's results exclude the roughly inner products required to compute the query's
projection at each level of the tree.
![]() |
Recall that our analysis of failure rate was quite
conservative and experiments confirm this. Actual failure
rates for every case in Figure 1 were somewhat
lower than those from analysis. For example, in the
case, analysis predicts that the search will succeed with
probability
, and the actual value is
. Note that one reason for using the moderate value
in our study is that we have confirmed that values
far closer to zero result in no observable experimental error
given the limited number of queries we have time to perform.
We compare performance in our setting with kd-tree search [FBS75,FBF77,BF79,Ben80]. We are interested in neighbors within a specified distance of the query. To reproduce this setting, kd-tree search is initialized as though an element at the specified distance has already been encountered.
A recent kd-tree implementation [MA99] is modified for our experiments. Its core routines are altered to support radius-limited search as described above, and the distribution's sample program is adapted to report the average number of distance computations performed in response to each query. As in our earlier experiments, queries are generated by starting at a database element, and moving the specified distance away in a random direction. So every query in our experiment locates a nearest neighbor within the radius of interest. As a self-check the sample program is further modified to verify this condition. After these modifications, a direct comparison is possible between our method and kd-tree search.
From our analysis it is clear that kd-tree search will not
exhibit the same dimensional invariance, and figure 2
confirms this.
Moreover, search time grows so rapidly for the
cases that the tree is essentially ineffective by
dimension
. By contrast (figure 1) our method
remains effective for arbitrarily high dimension. Recall,
however, that kd-tree search is guaranteed to find the nearest
neighbor within in the domain, while our method will fail with
some small probability.
A primary motivation for this paper was our discovery in
[Yia99] that kd-trees, and a related general
structure called a vantage point forest, do exhibit
search complexity that is invariant with respect to dimension
for a domain of constant radius -- and substantial
savings over exhaustive search are realized only for
disappointingly small radii. The experiment reported in
figure 3 confirms this, and illustrates the
fundamentally different asymptotic behavior (with respect to
) of kd-tree search and our aggressive pruning
approach.
The flat graphs of figure 1 are in the presence
of a search radius that grows with
, where those
of 3 assume a search radius that is constant
with respect to
.
For example, by dimension
a domain of unity
radius generates a nearly exhaustive search. By contrast
figure 1 shows that for
and
(corresponding to a radius of
), our
search visits only roughly
of the leaf nodes.
![]() |
We conclude our discussion of experiments by describing a
single example of our method in greater detail that pushes
upward to
,
up to
, and the number of
queries up to
. The dimension is
. Here,
analysis predicts success probability
and we observe
. The tree's depth is
and
leaves are
visited on average during a search. The predicted value is
. The experiment required
seconds on a 400Mhz
Pentium II processor.
![]() |
We hopes this work takes us closer to practical and useful
nearest neighbor search in high dimension. In addition to
sharpening our analysis for the uniform case, future
theoretical work might target distribution-independent bounds
for approximate nearest neighbor search within an -domain.
While generalizing we expect that the definition of an
-domain will have to change, and that it may be necessary
to move to the approximate nearest neighbor framework where
the objective is to find neighbors within a
factor of the closest.
Also, the work of this paper motivates us to consider
techniques for general metric spaces and arbitrary
distributions that learn the projection distributions
and from examples -- adapting to the problem domain.
Ultimately we envision a high-level procedure that like the qsort() function from the standard Unix library, views its underlying database through a layer of abstraction. In the case of sorting, a comparison function is provided. For nearest neighbor search projection and distance computation must be supported.
Sorting is, however, a much simpler problem in as much as practical algorithms exist with acceptable worst case time and space complexity -- independent of the dataset. For nearest neighbor search we suggest that a general tool must be flexible and ideally support:
If the projectors are known to be independent (as in orthogonal Euclidean projections) then search should exploit this. Most generally a system might try to learn patterns of independence from examples.
Special support for discrete-valued distance functions is also desirable.
We hope that the ideas and results of this paper brings us closer to the development of such a tool.
The author thanks Joe Kilian and Warren Smith for helpful discussions.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -image_type gif -nonext_page_in_navigation -noprevious_page_in_navigation -up_url ../main.html -up_title 'Publication homepage' -accent_images textrm -numbered_footnotes vp3.tex
The translation was initiated by Peter N. Yianilos on 2002-06-24