Peter N. Yianilos^{1}
July 20, 1998
Our analysis predicts vp-forest performance in simple settings such as spaces with uniform random datasets -- and experiments confirm these predictions. Another contribution of the analysis is a new perspective on the curse of dimensionality in the context of our methods and kd-trees as well. In our idealized setting the dataset is organized into a forest of trees, each of depth . Here may be viewed as depending on , the distance function, and on the dataset. The radius of interest is an input to the organization process and the result is a linear space data structure specialized to answer queries within this distance. Searches then require time, or time given processors.
Our conclusion is that these new data structures exhibit useful behavior only for small radius searches, where despite their variation in search times, conventional kd-trees perform much better.
Keywords: Nearest neighbor search, Vantage point tree (vp-tree), kd-tree, Computational geometry, Metric space.
We consider the radius-limited nearest neighbor problem in a metric space. That is, given an point dataset and a radius of interest , produce a data structure and associated search algorithm to rapidly locate the dataset point nearest to any query . Our focus is on practical data structures that provide worst-case search time bounds.
Vantage point trees (vp-trees) [32] and kd-trees [15,16,4,3] organize an point dataset so that sublinear time nearest neighbor searches may be performed on an expected basis for some fixed distribution. Performance depends on the dataset and on the assumed distribution of queries.
The excluded middle vantage point forest (vp-forest) is a new related data structure that supports worst case sublinear time searches for nearest neighbors within a fixed radius of arbitrary queries. Worst case performance depends on the dataset but is not affected by the distribution of queries.
The dataset is organized into a forest of trees, each of depth . Here may be viewed as depending on , the distance measure, and on the dataset. The radius of interest is an input to the organization process and the result is a data structure specialized to answer queries within this distance.
Each element of the dataset occurs in exactly one tree so that the entire forest remains linear space. Searches follow a single root-leaf path in each tree. There is no backtracking when the search is limited to neighbors within distance . Along its way every neighbor within is necessarily encountered. The query's effect is to guide the descent through each tree.
There are no significant ancillary computational burdens at search time. So upon creating the forest, the user simply adds the depths of each tree in the forest to arrive at the maximum number of distance evaluations each search will require. Each tree may be searched independently. Searches then require time given one processor for each tree in the forest.
We also discuss design variations that trade space for reductions in search time -- and compare forests with single trees constructed similarly.
The general idea behind vp-forests is easily understood. Both vp-trees and kd-trees recursively divide the dataset. At each node the remaining dataset elements have an associated value, and the node has a corresponding fixed threshold that is roughly central in the distribution of values. Elements below this threshold are assigned to, say, the left child, and those above to the right. For kd-trees these values are those of individual coordinates within each data vector. For vp-trees they are the distance of a metric space element to some fixed vantage point.
Elements near to the threshold lead to backtracking during search. When building a vp-forest, such elements are deleted from the tree and added instead to a bucket. Once the tree is complete, the bucket is organized into a tree in the same way, resulting in another (smaller) bucket of elements. This continues until the forest is built. This effectively eliminates backtracking. Because elements near to the threshold are recursively deleted, and this threshold lies near the middle of the distribution of values, we refer to our data structure as an excluded middle vantage point forest. Both vp-trees and kd-trees may be regarded as trivial instances of vp-forests with no excluded middle.
We present an idealized analysis that allows us to predict vp-forest performance in simple settings such as spaces with uniform random datasets. Experiments are reported that confirm these predictions. One contribution of this analysis is an interesting new perspective on the so called curse of dimensionality (that is that nearest neighbor search increases in difficulty with dimension). In Euclidean space given a uniform random dataset drawn from the hypercube, we observe that the worst-case difficulty of vp-forest search for any fixed ought to be asymptotically constant with respect to dimension -- and our experiments confirm this. Using we expect constant difficulty if is allowed to increase with ( denotes dimension), and this too is confirmed by experiment.
Our analysis also suggests that kd-trees should exhibit the same dimension invariance for fixed search radii, and experiments confirm this. For a worst case query, kd-tree search visits essentially the entire dataset, but on average performs far less work than the vp-forest.
The vp-forests described in this report stimulated the developments of [33], but do not themselves appear to have immediate practical value.
We conclude our introduction with a brief discussion of the nearest neighbor search problem and literature. See [32] for additional discussion.
Nearest neighbor search is an important task for non-parametric density estimation, pattern recognition, information retrieval, memory-based reasoning, and vector quantization. See [11] for a survey.
The notion of a mathematical metric space [20] provides a useful abstraction for nearness. Examples of metric spaces include Euclidean space, the Minkowski spaces, and many others. Exploiting the metric space triangle inequality to eliminate points during nearest neighbor search has a long history. Our work belongs to this line. This paper extends it by i) introducing structures that give worst case time bounds for limited radius searches, ii) providing analysis for them, iii) introducing a method for trading space for time, and iv) defining our data structures and algorithms in terms of abstract projectors, which combines approaches that use distance from a distinguished element with those such as kd-trees that use the value of a distinguished coordinate.
In early work Burkhard and Keller [7] describe methods for nearest neighbor retrieval by evaluating distances from distinguished elements. Their data structures are multi-way trees corresponding to integral valued metrics.
Fukunaga in [18,19] exploits clustering techniques [27] to produce a hierarchical decomposition of Euclidean Space. During a branch and bound search, the triangle inequality is used to rule out an entire cluster if the query is far enough outside of it. While exploring a cluster he observes that the triangle inequality may be used to eliminate some distance computations. A key point missed is that when the query is well inside of a cluster, the exterior need not be searched.
Collections of graphs are considered in [14] as an abstract metric space with a metric assuming discrete values only. This work is related to the constructions of [7]. In their concluding remarks the authors clearly anticipate generalization to continuous settings such as .
The idea that vantage points near the corners of the space are better than those near the center was described in [28] and much later in [32].
More recent papers describing vantage-point approaches are [30,29,25] and [32] who describe variants of what we refer to as a vantage-point tree. Also see [10] for very recent work on search in metric spaces.
The well-known kd-tree of Friedman and Bentley [15,16,4,3] recursively divides a pointset in by projecting each element onto a distinguished coordinates. Improvements, distribution adaptation, and incremental searches, are described in [13], [21], and [6] respectively. In our framework kd-trees correspond to unit vector projection with the canonical basis.
More recently, the Voronoi digram [2] has provided a useful tool in low- dimensional Euclidean settings - and the overall field and outlook of Computational Geometry has yielded many interesting results such as those of [31,9,8,17] and earlier [12]. It appears that [12] may be the first work focusing on worst case bounds.
Very recently Kleinberg [22] gives two algorithms for an approximate form of the nearest neighbor problem. The space requirements of the first are prohibitive but the second, which almost always finds approximate nearest neighbors seems to to be of more practical interest. The recent work reported in [1] also considers an approximate form of the problem. Their analysis gives exponential dependence on but the heuristic version of their approach they describe may be of practical interest.
For completeness, early work dealing with two special cases should be mentioned. Retrieval of similar binary keys is considered by Rivest in [26] and the setting is the focus of [34].
See [5] for worst case data structures for the range search problem. This problem is related to but distinct from nearest neighbor search since a neighbor can be nearby even if a single coordinate is distant. But the nearest neighbor problem may be viewed as an instance of range search. Their paper also describes a particular approach to trading space for time via an overlapping cover. Our discussion of this topic in section 5 also takes this general approach.
We begin by formalizing the ideas and construction sketched in the introduction.
That is, a balanced 3-way partition of with a central proportion of approximately .
We remark that each tree of the forest may be viewed as analogous to the Cantor set from real analysis. That is, the subset of constructed by removing the central third of the interval -- and proceeding recursively for both the left and right thirds. Our forest then corresponds to a decomposition of the space into a union of Cantor sets.
Two important examples of a suitable family of functions are vantage point projection for general metric spaces, and unit vector projection for Euclidean space.
A vantage point projector is defined for any by . The split points in tree construction correspond to abstract spheres about . It is easily verified that as required. The range of this projector is the nonnegative reals. Letting gives rise to a vantage point tree [32].
The unit vector projector is defined for any as . This is easily seen to satisfy the required inequality as well, and its range is not limited to nonnegative values. Here the split points correspond to hyperplanes. Choosing as canonical unit vectors and letting builds a form of kd-tree [15,16,4,3]. It is important to note that may be computed more rapidly for canonical unit vectors -- in constant time with respect to dimension. Also, we remark that whenever orthogonal vectors are used, projection distances are, in a sense, additive, and that this fact can be exploited (as kd-trees do) when designing solutions specialized for spaces. We do not consider either of these optimizations in this paper.
proof: At each step in the construction the central proportion of elements is removed so that the left and right subsets are each of size . The first tree's depth is then . So the number of elements left in the tree is .
The number of elements left after the first tree has been constructed is then . After the second are left, and so on. Clearly trees are required to reduce the population to any fixed size.
Once the population has been reduced to of its original size, the number of elements removed as each tree is built will have declined to . Since no fewer than are removed. So the number of steps required to reach is no more than . The number required to reach is then -- and so on forming a geometric series. So the number required to reach any fixed level is . The number of trees in the forest is then -- and the total space is clearly linear.
A search for nearest neighbors within distance is then made by following a single root-leaf path in each tree establishing the required time bound. Each tree can be considered independently and the results merged to arrive at a single nearest neighbor. This establishes the stated parallel complexity.
Finally we consider organizational time. The first level of
the first tree requires work to produce projection
values of each point in the space, and then time to
effect the split (linear time order statistics). The next
level considers a total of records, and so on,
leading to overall time from which the stated
organization time bound follows.
For example, if there are trees in the forest, and search time is . Organization requires time, but as always only space is consumed. As we expect to increase but the number of trees approaches and searches come ever closer to examining every point. As we expect to decrease while search time approaches the logarithmic ideal. The parameter then, in a sense, interpolates between exhaustive large- search, and logarithmic small- search.
Our development above is very general and describes idealized forests. We now explain in what sense they are idealized and begin our discussion of concrete settings.
Algorithm 1 removes a fixed central proportion of the points as each tree node is processed. This simplifies analysis but, because is defined as the minimum diameter of all such central cuts, might wind up too small to be of any practical value. Also, the projection function will not in general be 1-1, and this detail must be considered in any implementation. The user's objective is a satisfactory tradeoff between the conflicting goals of minimizing search time, and maximizing .
One approach is to minimize search time given a lower bound on . In contrast with the idealized case, here it makes sense to cut a variable central proportion -- one of diameter . Notice in this case that the construction recursion may terminate before reaching a single node. Since the choice of projector may affect this proportion, it may be wise to invest additional time during construction to evaluate multiple projectors. This corresponds to the idea of selecting a vantage point for a vp-tree, or a cut-dimension for a kd-tree. When fixed diameter cuts are made a more complicated stopping rule for forest construction is needed. Otherwise a tiny ball of points, smaller than in diameter, will never be assigned to any tree.
It is also important to note that while our focus is on worst case performance, and there is no backtracking required to search for neighbors within distance , that backtracking can be performed as for vp-trees or kd-trees in order to satisfy queries beyond .
Another difference between our idealized construction and practical implementations is that in the idealized case points are stored only at tree leaves.
In the vp-tree case the projector is formed by selecting an element of subspace and computing its distance to the query and to every other element of . Having computed its distance to the query, there is no need to examine again, and we therefore view it as stored at the interior tree node at which is was used. So the difference is that the set is split at each level of the recursion. When using unit vector projection it is also possible to use an element of the database. Here too, it is unnecessary to examine again since is easily computed given .
The Minkowski -norm is denoted and defined by for . A metric results from evaluating the norm of the difference of two vectors. The Euclidean metric is , the city block metric is , and by convention gives the maximum absolute value over individual coordinates.
We begin with an asymptotic statement that allows to understand the expected performance of excluded middle vantage point forests in high dimensional instances of these spaces given distributions such as the uniform one.
proof: Consider
. Because the
are i.i.d. both the mean and variance of scale linearly
with . So relative to the magnitude of the mean of , its
standard deviation shrinks as , whence the
distribution of values is increasingly concentrated
about its mean. Taking the th root to arrive at has
the effect of scaling all of these values downward by a factor of
approximately
so the variance
changes by a factor of . But the variance of
scales with , so the variance of scales with
, and its standard deviation with
.
To keep its statement simple the proposition includes an i.i.d. assumption, but less is necessary. Independence is required but the mean and variance of each variable need not be the same. It is merely necessary that the mean and variance of scales approximately linearly with .
The proposition is then relevant because of two observations about uniformly distributed high dimensional spaces. Firstly, that choosing a point at random, the distribution of over in the space exhibits the required linear scaling behavior. Secondly, that making cuts during construction (whether spherical or hyperplanar) in a high dimensional space, leaves subspaces that for the purposes of this proposition still behave like uniformly random ones. The same is true for unit vector projection.
For Euclidean space () we then expect to be asymptotically constant with dimension. Then the distribution of values the construction algorithm is presented with at each step, becomes the same for sufficiently high dimension. For we have that scales upward with , and for it scales downward with . The result is that we expect the following behavior when using excluded middle vantage point forests to search uniformly distributed point sets:
Finally we remark that the case of unit vector projection in Euclidean space directly leads to the first observation above without the proposition.
We have implemented our ideas as an ANSI-C program. Figures 1 and 2 describe experiments which confirm the behavior above for and . It is important to reiterate that searching the forest involves no decisions or backtracking, so that search time is essentially constant. So:
Other experiments confirmed the expected behavior when . Vantage point projection is used but unit vector projection gives similar results. A vantage point is selected at random from the current subset of the dataset. The implementation forces fixed cuts and stops forest construction when all remaining points are failed to be assigned to a tree after a large number of attempts. These points are then stored as a simple list. The implementation also stores points at interior nodes, not just at leaves. Finally, its splitting algorithm is general, i.e. does not assume 1-1 projection functions.
The uniform random case is, of course, of little practical interest. While we give no general worst case bounds it is important to note that performing a fixed construction will always generate a forest, and a corresponding query-independent worst case bound on search time for that particular point set.
We compare performance in our setting with kd-tree search [15,16,4,3], adapted from [23] as described in [33] for radius limited search.
Figure 3 shows the result of our experiment. The same dimensional invariance is evident, but now with respect to expected search time. As for our worst-case structures, this follows from our earlier observations about projection distributions in high dimensional spaces. The figure shows that expected search complexity is somewhat lower than the worst case results of figure 1 where saturation is evident exclusion width is approached. For kd-trees, saturation is delayed until roughly width .
The expected kd-tree search times are much lower than the worst case values of figure 1. In fact, given random queries, the probability is vanishingly small that kd-tree search will cost as much as our worst case search.
In the uniform case a query at the center of the hypercube is costly for the kd-tree, and we have confirmed that essentially the entire dataset is searched in this case. We observe that one might build a pair of kd-trees whose cut points are offset to eliminate these hot spots and perhaps even provide worst case performance bounds in the radius-limited case.
By slightly modifying the forest construction algorithm one can build a single 3-way tree. Instead of adding the central elements to a bucket for later use in building the next tree, we instead leave them in place forming a third central branch. To search such a tree one must follow two of the three branches: either the left and middle, or the middle and right. We remark that this tree structure was our first idea for enhancing vantage point trees to provide worst-case search time bounds.
Like the forest, this tree requires linear space, but a simple example and analysis illustrates that its search performance is usually much worse.
Consider the 3-way tree built by an equal 3-way division, i.e. where the central exclusion proportion is . Then the tree's depth is . The path taken by any single search consists of a binary tree embedded within this trinary tree. So nodes will be visited. But we have seen that a forest with exclusion proportion results in node visits, which is superior despite the fact that the tree was built using a smaller exclusion proportion ( vs. ).
Table 4 compares the performance of idealized trees and forests for various database sizes and central exclusion proportions. Figure 4 compares the performance of actual trees and forests in a uniform distribution Euclidean space setting. The results are consistent with our analysis and the computations of table 4. We briefly remark that for small , trees and forests with a higher branching factor make sense. Here the projected point set is partitioned into bands with alternating ones corresponding to the excluded middle of our 3-way construction.
Until now we have considered linear space structures. In this section we describe a general technique for trading space for time. For analysis we will return to the idealized setting in which construction removes fixed proportions -- not fixed diameter central subsets as in the implementation and experiments. This will allow us to make calculations intended to illuminate the engineering tradeoffs one faces in practice.
The 3-way split performed by algorithm 1 is illustrated in figure 5(a). The central proportion is imagined to correspond to some fixed radius as shown. We motivate the construction in terms of but will analyze it in terms of .
When the projection of a query lies just to the right of the dotted centerline, a search for a nearest neighbor within will explore and . Increasing by some value forces the search to explore as well.
To avoid exploring all three we store the points in interval in two places (see figure 5(b)). First, they are stored as always in . Second, they are stored with those of . As a result of this overlap only and must be explored as before. Points in interval are similarly stored twice.
Continuing this modified division throughout forest construction yields a data structure that:
For example, let and . Then space is .
From another viewpoint the same space tradeoff may be used to reduce search time for a fixed . For example, an ordinary forest provides searches. Letting and reduces the time to but space is increased from to .
Finally consider the interesting extreme case where . Here the forest consists of a single binary tree. Search time is then with space .
Excluded middle forests might be applied to problems other than nearest neighbor search. They might, for example, improve the effectiveness of decision trees, an important tool for machine learning [24]. The motivation here is that values near to the decision threshold may be more difficult to classify based on the selected decision variable.
We now discuss several possibilities related to nearest neighbor search. For each fixed our implementation generates a forest with some associated worst case query time bound. Smaller values, in general, result in smaller bounds. Now suppose that a forest is built with and the system is then presented with a query. Next suppose that early in the search a neighbor is encountered that is well within the radius, say at distance . Despite our good fortune the remaining search will take no less time. Root-leave paths in every remaining tree must be examined.
One idea is to build multiple forests. One for , and additional forests for , etc. Then upon encountering a point at distance the search algorithm might consider aborting its processing of the forest and switch over to the forest, which offers a better worst case time bound.
It is interesting to note that this decision to switch can be made very easily by considering the number of points remaining in the current forest, and comparing it to the worst case bound for the new one. If the latter is smaller it makes sense to switch over. Implementations may consider another factor in their decisions. The distance from the query to all points already visited might be remembered to avoid unnecessary metric evaluations when processing restarts at the beginning of a new forest.
Another topic for future work involves two settings in which our methods seem far from optimal: boolean vectors with Hamming distance, and the case.
The boolean Hamming distance setting corresponds to , and the number of errors we can tolerate while maintaining constant search complexity for the uniform distribution grows with . But experiments suggest that our forests do not compete in practice with a particular simple method. If is the maximum Hamming distance to be tolerated, this method divides the high-dimensional vector into segments and exploits the fact that the dataset projected onto one of these segments may be assumed to have Hamming distance zero to the query. Simple hashing may then be used to maintain these projections. We are interested in better understanding the relationship of this simple technique to the ideas of this paper. We remark that the same idea of dimensional partitioning applies to any Minkowski metric, but so far we can obtain an advantage only in the Boolean case.
As described earlier, our methods grow weaker with growing , becoming essentially useless for . This is interesting since suddenly, with , the problem becomes an instance of range search. We speculate that ideas from range search might in some form be of use before reaches -- if only as an approximate solution to the nearest neighbor problem.
Finally we observe that in storing the data vectors themselves requires space. The tree/forest we build can use space pointers to these vectors. For this reason we can tolerate super-linear space in the tree/forest, namely . We suggest that a practical implementation for Euclidean space should take advantage of this and also exploit the canonical/orthogonal basis optimizations mentioned in section 2.
Our work towards a general and practical tool for nearest neighbor search began with [32] and continues with this paper. We envision a high-level procedure that like the qsort() function from the standard Unix library, views its underlying database through a layer of abstraction. In the case of sorting, a comparison function is provided. For nearest neighbor search projection and distance computation must be supported.
Sorting is, however, a much simpler problem in as much as practical algorithms exist with acceptable worst case time and space complexity -- independent of the dataset. For nearest neighbor search we suggest that a general tool must be flexible and support:
We hope that the ideas and results of this paper brings us closer to the development of such a tool.
The author thanks Joe Kilian for helpful discussions.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -image_type gif -nonext_page_in_navigation -noprevious_page_in_navigation -up_url ../main.html -up_title 'Publication homepage' -accent_images textrm -numbered_footnotes vp2.tex
The translation was initiated by Peter N. Yianilos on 2002-06-25