Up: Publication homepage

Normalized Forms for Two Common Metrics

Peter N. Yianilos1

12/5/91 Rev 2/27/92 Rev 7/7/2002

Abstract:

We show that symmetric set difference and Euclidian distance have particular $[0-1]$ normalized forms that remain metrics.

The first of these $\vert A \triangle B\vert / \vert A \cup B\vert$ is easily established and generalizes to measure spaces.

The second applies to vectors in ${\mathbb{R}}^n$ and is given by $\Vert X-Y\Vert/(\Vert X\Vert+\Vert Y\Vert)$. That this is a metric is more difficult to demonstrate and is true for Euclidian distance (the $L_2$ norm) but for no other integral Minkowski metric. The short and elegant proof we give is due to David Robbins and Marshall Buck [1].

We also explore a number of variations.

Keywords: Metric Space, Distance Function, Similarity Function/Coefficient, Euclidian Distance, Association, Clustering, Vector Quantization, Pattern Recognition, Statistical Methods.

Introduction

The notion of Metric Space [2], is a cornerstone of mathematical topology and analysis, and is often employed in pattern recognition and clustering systems.

Definition 1  

Let $X$ be a set and $d$ a non-negative real valued function on $X \times X$ satisfying for $a,b,c \in X$:

  1. $d(a,b) = d(b,a)$

  2. $d(a,b) = 0$ iff $a=b$

  3. $d(a,b) + d(b,c) \ge d(a,c)$

Then $(X,d)$ is a metric space. Alternatively $d$ is said to be a distance function and impose a metric on $X$.

The third item in this definition is usually referred to as the triangle inequality. It is worthwhile noting that not all approaches to pattern recognition and clustering impose this requirement. The term similarity measure or dissimilarity measure, or some variation thereof, is then used to describe the comparison function.

Two well known examples of metric spaces are Euclidian $n$-space with:


\begin{displaymath}
d(A,B) = \Vert A-B\Vert _2 = \sqrt{\sum_i^n (A_i-B_i)^2}
\end{displaymath}

and the symmetric difference metric on members of ${\cal F}$, the set consisting of all finite sets:


\begin{displaymath}
d_\triangle(A,B) =
\vert (A \setminus B) \cup (B \setminus A) \vert =
\vert A \triangle B\vert
\end{displaymath}

$d(A,B)$ is of course geometrical length, and $d_\triangle(A,B)$ simply counts the number of elements on which $A$ and $B$ disagree.

Both of these metrics are in general unbounded, and their value is independent of the size of $A$ and $B$. I.e. two very large sets that have three points of disagreement, are just as distant under $d_\triangle$ as two very small sets which also differ by three elements. In Euclidian $1$-space, values $1,000$ and $1,000.01$ are just as distant as $0.01$ and $0.02$.

Hence we may in a sense view $d$ and $d_\triangle$ as measures of absolute distance or difference. $d_\triangle$ is seen to be insensitive to $\vert A \cap B\vert$ while $d$ does not independently consider $\Vert A\Vert$ and $\Vert B\Vert$ when measuring distance.

It is therefore natural to consider forming relative distance measures for these underlying spaces, since such measures may be more effective in the solution of certain problems. Certainly the notion of relative error is an important one in numerical analysis. For sets, and Euclidian space, size might naturally be taken to mean cardinality, and norm respectively. Thus we are interested in set metrics which measure relative to the empty set, and in an alternative to the Euclidian distance metric which measures relative to the origin.

With no domain assumptions, $d$ and $d_\triangle$ are unbounded. This may create algorithmic difficulty (or at a minimum inconvenience). Thus converting these metrics to bounded forms may sometimes be useful.

The simplest way in general to effect a bound is to compose the metric with another function $f$ which acts as a range compander such that the combination still satisfies definition 1. One well known example of such a function that bounds any metric to $[0,1]$ is:


\begin{displaymath}
\bar{d}(A,B) = \frac{d(A,B)}{1+d(A,B)}
\end{displaymath}

There are many such formulas for bounding metrics. It may for example be shown that the sum of metrics is a metric, and beyond this that a metric results from composition with any continuous, differentiable, strictly increasing function $f$ such that $f(0)=0$, and $f'$ is non-increasing. 2

Since however any such method depends only on the value $d(A,B)$, the bounded form will inherit the absolute or relative behavior of its unbounded parent.

These metric companding methods are in contrast to normalization to which we now turn.

In the sections that follow we will introduce the metric $d_{\triangle_n}$ defined by $d_\triangle(A,B) / \vert A \cup B\vert$ and also metric $d_n$ defined 3 by $d(A,B) / (\Vert A\Vert+\Vert B\Vert)$.

In contrast to the absolute behavior of $d$ and $d_\triangle$ these functions judge distance with consideration to the relative location of the origin and empty set respectively.

That they are in fact metrics is not obvious and a considerable portion of this paper is devoted to the required proofs.

We will also present a number of alternative forms including mixed metrics that combine absolute and relative behavior.

Normalized Symmetric Difference

To normalize the symmetric difference metric we choose to divide its value by the size of the union of its arguments. More formally we have:

Definition 2   Let ${\cal F}$ be the set consisting of all finite sets We define function $d_{\triangle_n}$ on ${\cal F} \times {\cal F}$ by:


\begin{displaymath}
d_{\triangle_n}(A,B) = \left\{
\begin{array}{ll}
\frac {
...
...neq \emptyset$}\\
0 & \mbox{otherwise}
\end{array} \right.
\end{displaymath} (1)

We would point out here that not every reasonable attempt at normalization results in a metric. Dividing instead by $\vert A\vert+\vert B\vert$, an altogether sensible thing to do, fails to satisfy the triangle inequality. This may be seen by letting $A=\{x\}$, $B=\{x,y\}$, and $C=\{y\}$. The $AB$ and $BC$ distances are then both $1/3$ while the distance $AC$ is $1$. 4

While it is clear that this function is bounded by $[0,1]$, and that its behavior is more relative in that $\vert A \cap B\vert$ will affect its value, it must be demonstrated that $d_{\triangle_n}$ is in fact a metric.

Theorem 1   $({\cal F}, d_{\triangle_n})$ is a metric space.

Proof: We must show that for all finite sets $A$, $B$, and $C$, the following are true:

We have i) because $A \cap B = A \cup B$ only when $A=B$. The second requirement ii) is clear from the commutivity of basic set operations.

Item iii), the triangle inequality, requires more work. Using EQ 1, we must show:


\begin{displaymath}
1 - \frac {\vert A \cap B\vert} {\vert A \cup B\vert} +
1 ...
...} \geq
1 - \frac {\vert A \cap C\vert} {\vert A \cup C\vert}
\end{displaymath} (2)

It is easy to verify that this is true if $A \cup B =
\emptyset$, $B \cup C = \emptyset$, or $A \cup C =
\emptyset$, So we restrict our attention to the case in which none of the denominators in EQ 2 is zero.

Now the union of sets A,B, and C, may be partitioned (see figure 1) into seven disjoint subsets whose orders we denote:

\begin{eqnarray*}
a & = & \vert A \setminus (B \cup C) \vert \\
b & = & \vert...
...\cap B \cap C) \vert \\
abc & = & \vert A \cap B \cap C \vert
\end{eqnarray*}



Figure 1: Partitioning $A \cup B \cup C$
\scalebox{1.0}{\includegraphics{figs/venn.ps}}

For later convenience we also define $\gamma = ab+ac+bc+abc$.

With these definitions, EQ 2 becomes with some simplification:


\begin{displaymath}
\frac {ab + abc} {a + b + \gamma} +
\frac {bc + abc} {b + c + \gamma} \leq
\frac {ac + abc} {a + c + \gamma} + 1
\end{displaymath} (3)

Now observe that $b$ may be removed without loss of generality from the left denominator, since its removal can only increase that side. It will then suffice to show that:


\begin{displaymath}
\frac {ab + abc} {a + \gamma} +
\frac {bc + abc} {c + \gamma} \leq
\frac {ac + abc} {a + c + \gamma} + 1
\end{displaymath}

Now replacing $1$ with $\gamma / \gamma$ and adding fractions, we arrive at:


$\displaystyle {
\frac { (c + \gamma) (ab + abc) + (a + \gamma) (bc + abc)
} {
(a + \gamma) (c + \gamma)
} \leq
} \hspace{1in}$
    $\displaystyle \frac { (ac + abc) \gamma + (a + c + \gamma) \gamma
} {
(a + c + \gamma) \gamma
}$ (5)

The denominator of EQ 4 is equal to the denominator in EQ 5 plus $ac$, and is therefore no smaller.

It therefore suffices to show the that the inequality is true for the numerators alone, i.e. that:

\begin{eqnarray*}
\lefteqn{
(c + \gamma) (ab + abc) + (a + \gamma) (bc + abc) ...
...space{1in} \\
& & (ac + abc) \gamma + (a + c + \gamma) \gamma
\end{eqnarray*}



Starting from the left side of this inequality we have:

\begin{eqnarray*}
\lefteqn{
(c + \gamma) (ab + abc) + (a + \gamma) (bc + abc) ...
...gamma \leq \\
& & (ac + abc) \gamma + (a + c + \gamma) \gamma
\end{eqnarray*}



and we are done. $\Box$

Our arguments for sets in ${\cal F}$ generalize easily to measure spaces[3,4]. The case of finite non-zero measures is straightforward since these correspond well with finite sets and our earlier arguments.

To see this, remember that for any $A,B,C \in {\cal F}$, we earlier expressed the triangle inequality in terms of the orders of their natural decomposition into seven disjoint parts. Our proof of theorem 1 may then be viewed as establishing the truth of an inequality in these seven independent variables - an inequality that holds for any assignment of finite non-negative values; subject only to the restriction that the denominators may not vanish. The same decomposition applies if $A,B,C \in {\cal M}$, with the inequality relating measure instead of set order, thus motivating the connection between $d_{\triangle_n}$ and measure spaces.

As a practical matter the finite setting is the most important case. However the generalization extends fully to measures which assume value $+\infty$, and we will take the time to show this.

For elements of finite non-zero measure, definition of our metric will correspond to simple generalization of the notion of set order, to that of measure. For elements in general, several cases are necessary to patch together a definition. While these special cases manage to define the metric for all members of the space, only the simplest discrete distance notion applies to elements with zero or infinite measure.

Definition 3   Let ${\cal M}=(X, {\cal B}, \mu)$, be a measure space, and $A, B \in {\cal B}$. We define function $d_\mu$ on ${\cal M} \times {\cal M}$ by:


\begin{displaymath}
d_\mu(A,B) = \left\{
\begin{array}{ll}
\frac {
\mu (A \t...
...: $\mu(A \cup B)=\infty$, $A \ne B$} \\
\end{array} \right.
\end{displaymath} (6)

We now state the corollary:

Corollary 1   $({\cal M}, d_\mu )$ is a metric space.

Proof: Everything but the triangle inequality follows immediately from the definition.

Note first that if $d_\mu(A,C)=0$ or either one of $d_\mu(A,B)$, $d_\mu(B,C)$ is one, then the triangle inequality is trivially established. So in particular the inequality is satisfied if $A=B=C$.

Further observe that if $\mu(A)$ or $\mu(B)$ is infinite or zero, then $d_\mu(A,B)=0$ iff $A=B$, and one otherwise. I.e. the definition reduces to a simple equality test for these cases.

With these points in mind we distinguish three cases:

  1. $\mu(A)$, $\mu(B)$, $\mu(C)$ are finite with no two zero: Interpreting set order $\vert \cdot \vert$ instead as measure $\mu(\cdot)$, the denominators of EQ 2 are all non-zero and our proof of theorem 1 applies.

  2. $\mu(A)$, $\mu(B)$, $\mu(C)$ are finite and at least two are zero: If $\mu(A)=\mu(C)=0$ and $A = C$, then $d_\mu(A,C)=0$. On the other hand $A \ne C$ implies $A \ne B$ or $B \ne C$ so that either $d_\mu(A,B)=1$ or $d_\mu(B,C)=1$.

    Otherwise we may assume without loss of generality that $\mu(A)=\mu(B)=0$. Here $A \ne B$ implies $d_\mu(A,B)=1$ - and if $A=B$, then $d_\mu(B,C)=1$ unless $A=B=C$.

  3. At least one of $\mu(A)$, $\mu(B)$, $\mu(C)$ is infinite: If $\mu(B)=\infty$ then $d_\mu(A,B)=1$ or $d_\mu(B,C)=1$, unless $A=B=C$.

    Otherwise we may assume without loss of generality that $\mu(A)=\infty$. But then $d_\mu(A,B)=1$ unless $A=B$, in which case $\mu(B)=\infty$; a situation covered by the preceding argument. $\Box$

Proceeding further we can develop a slightly more general form for $d_\mu$. Consider some $A,B,C \in {\cal M}$ and their associated decomposition. Now for some $\alpha \ge 0$, adding $\alpha/2$ to each of $a$, $b$, and $c$, leaves them non-negative and the inequality therefore valid.

This may be thought of as extending each of $A$, $B$, and $C$, by a new region which does not intersect the others.

Equation 3 then becomes:


\begin{displaymath}
\frac {ab + abc} {a + b + \gamma + \alpha} +
\frac {bc + a...
... \alpha} \leq
\frac {ac + abc} {a + c + \gamma + \alpha} + 1
\end{displaymath} (7)

This all motivates:

Definition 4   Let ${\cal M}=(X, {\cal B}, \mu)$ be a measure space, and $A, B \in {\cal B}$. We define function $d_{\mu_\alpha}$ on ${\cal M} \times {\cal M}$ by:


\begin{displaymath}
d_{\mu_\alpha}(A,B) = \left\{
\begin{array}{ll}
\frac {
...
...: $\mu(A \cup B)=\infty$, $A \ne B$} \\
\end{array} \right.
\end{displaymath} (8)

for $\alpha \ge 0$.

And from our discussion above it follows that:

Corollary 2   $({\cal M}, d_{\mu_\alpha} )$ is a metric space.

Thus we see that the $d_\mu$ discontinuity when $A=B$ may be eliminated 5 by biasing the denominator term $\vert A \cup B\vert$. Function $d_{\mu_0}$ is just $d_\mu$, and as $\alpha \rightarrow
\infty$, the behavior of $d_{\mu_\alpha}$ approaches that of simple symmetric difference.

In a later section we will present several examples of metrics that arise from the results above. Before doing so however, we turn our attention back to Euclidian space and our goal of establishing a normalized metric there.

The Normalized Euclidian Metric

In this section we will demonstrate that Euclidian distance normalized by combined norm, defines an alternative and $[0,1]$ bounded metric on ${\mathbb{R}}^n$.

Definition 5   Let $({\mathbb{R}}^n, d)$ denote Euclidian N-space (The $L_2$ norm and corresponding standard distance function). Then the following [0,1] bounded function $d_n$ of any two vectors $X,Y$ is defined to be the Normalized Euclidian Distance between $X$ and $Y$:


\begin{displaymath}
d_n(X,Y) = \left\{
\begin{array}{cl}
\frac{\Vert X-Y\Vert...
...or $Y \neq 0$} \\
0 & \mbox{otherwise}
\end{array} \right.
\end{displaymath} (9)

That this definition is in fact a metric will later be stated as a theorem. Its proof will require a series of Lemmas which build insight into the function.

It is worthwhile noting that in ${\mathbb{R}}^1$, an easy proof exists and that a very similar ${\mathbb{R}}^1$ metric (a special case of our definition 6), is employed in [5].

We would also observe that this definition fails to generate a metric if the $L_1$ norm is substituted. 6 For if $A=(1,0)$, $B=(1,1)$, and $C=(0,1)$, the triangle inequality does not hold. The $L_\infty$ norm also fails; consider $A=(1,-1)$, $B=(2,0)$, and $C=(1,1)$. For this example, and the $L_p$ norm, the triangle inequality becomes:


\begin{displaymath}
\frac{2 \cdot 2^{1/p}}{2^{1/p}+2} \ge \frac{1}{2^{1/p}}
\end{displaymath}

This fails for $p > 2$. Therefore, we have that our definition does not generate a metric for any of the integral Minkowski metrics $L_p$ where $p \ne 2$.

Our to-be-proven positive result for $L_2$ is thus that much more interesting.

Theorem 2   $({\mathbb{R}}^n, d_n)$ is a metric space.

Proof:

Let $O$, $A$, $B$, $C$ be four distinct points in Euclidean space. Then

\begin{displaymath}{AB \over OA+OB}+{BC \over OB +OC} \ge {AC \over OA + OC}. \end{displaymath}


proof: Let $OA=a$, $OB=b$, $OC=c$, $BC=\alpha$, $AC=\beta$, $BC=\gamma$. Then we need to prove that

\begin{displaymath}\alpha a^2-\beta b^2+ \gamma c^2 +(\alpha-\beta+\gamma)(ab+bc+ac)\ge 0. \end{displaymath}

Note that we have $\alpha-\beta+\gamma \ge 0$ by the triangle inequality and $ \alpha a - \beta b + \gamma c \ge 0$ by Ptolemy's inequality. We may suppose that $a\le c$.

Case 1: $ b \le a$. Then we have

\begin{displaymath}\alpha a^2 - \beta b^2 + \gamma c^2 \ge (\alpha -\beta +\gamma)a^2 \ge 0\end{displaymath}

and the theorem follows immediately.

Case 2: $ a \le b \le c.$ Then we have


    $\displaystyle \alpha a^2-\beta b^2+ \gamma c^2 +(\alpha-\beta+\gamma)(ac)$ (10)
  $\textstyle =$ $\displaystyle (a+c)(\alpha a -\beta b +\gamma c)-\beta (b-a)(b-c) \ge 0$ (11)

from which the result follows immediately as well.

Case 3: $ c \le b$. Observe that the theorem is true for four points $O$, $A$, $B$, $C$ if and only if it is true for $O$ and the points $A'$, $B'$, $C'$ obtained by inverting $A$, $B$ and $C$ in the sphere of radius 1 centered at $O$. (Recall that $A$ and $A'$ are on the same ray from $O$ and satisfy $OA\cdot OA'=1$. The key fact that is needed is $A'B'=AB /(OA\cdot OB) $.) This reduces case 3 to case 1. $\Box$

If we now label the origin as $D$, we may re-state the theorem as a corollary in purely geometrical terms:

Corollary 3   Let $A,B,C,D$ be four points in Euclidian Space and $ab, ac,
ad, bc, bd, cd$, denote their interpoint segment lengths. Then:


\begin{displaymath}
\begin{array}{lr}
\lefteqn{
(bd+cd)(ad+cd)ab +
(ad+bd)(ad+cd)bc \ge } \hspace{1in} \\
& (ad+bd)(bd+cd)ac
\end{array} \end{displaymath}

So despite a singularity at the origin, $d_n$ imposes a metric on ${\mathbb{R}}^n$. This amounts to a fully relative distance measure which clearly cannot well cope with zero. By contrast, standard Euclidian distance may be viewed as a fully absolute distance measure. It is easy to imagine a continuum of intermediate metrics which are partially relative.

Furthermore, some applications, while requiring a modicum of relative behavior, may also require better behavior at the origin than $d_n$ provides.

All of this motivates the following definition:

Definition 6   Let $({\mathbb{R}}^n, d)$ denote Euclidian N-space. Then for values $\alpha,\beta \geq 0$, with $\alpha\ne0$ or $\beta \ne 0$, the following $[0,1/\beta]$ bounded function $d_{n_{\alpha\beta}}$ of any two vectors $X,Y$ is defined to be the ${\alpha\beta}$-Normalized Euclidian Distance between $X$ and $Y$:


\begin{displaymath}
d_{n_{\alpha\beta}}(X,Y) = \left\{
\begin{array}{cl}
\fra...
...or $Y \neq 0$} \\
0 & \mbox{otherwise}
\end{array} \right.
\end{displaymath} (12)

The Normalized Euclidian Distance of definition 9 corresponds to $\alpha=0,
\beta=1$. Note that the definition 6's special condition at $X=Y=0$ is only required when $\alpha=0$, for when $\alpha > 0$, the denominator cannot vanish.

The case $\beta = 0$, merely corresponds to standard Euclidian distance scaled by $\alpha$.

In between these two extremes lies a family of hybrid functions that combine the relative behavior of $d_n$ with the absolute behavior of $d$.

To establish that $({\mathbb{R}}^n, d_{n_{\alpha\beta}})$ is in general a metric space, we will require one lemma concerning the behavior of quasi-quadrilaterals in ${\mathbb{R}}^n$.

Lemma 1   Let $A,B,C,D$ be any four points in ${\mathbb{R}}^n$, then with respect to the six interpoint segment lengths: $ab, ac,
ad, bc, bd, cd$, the following is true:


\begin{displaymath}
ab \cdot cd + bc \cdot ad \geq bd \cdot ac
\end{displaymath}

I.e. the sum of the product of the lengths of opposite sides of an arbitrary quadrilateral, is no smaller than the product of its diagonals.

Proof: We have only to argue that we may reduce setting to ${\mathbb{R}}^2$ since there, the lemma is recognized as a standard theorem of inversive geometry ([6] Theorem 5.12) - a direct generalization of Ptolemy's theorem.

Since any four points may be embedded in ${\mathbb{R}}^3$, we may assume our setting is at most this.

Within ${\mathbb{R}}^3$, choose coordinates so that $A,C,D$ lie on the $XY$ plane, and $B$ is located above (or on) it. I.e. $A_z=C_z=D_z=0$ and $B_z >= 0$.

Now translate so that $D$ is the origin.

Further choose rotation and orientation so that line $AC$ is parallel to and above the X-axis with $A$ left of $C$. I.e. $A_y=C_y$ and $A_x \leq C_x$.

Now consider spheres $S(A,ab)$ and $S(C,bc)$. The intersection of these spheres is a circle perpendicular to the XY plane and parallel to the Y-axis, having center along $AC$.

Any choice for $B$ along on this circle it will leave everything unchanged in our inequality but for $bd$.

Notice that the point on this circle farthest from the origin, is just its intersection with the XY plane.

Denoting this point $B'$ and relocating $B$ there, increases $bd$ and and the inequality's right side, while leaving the left unchanged.

Thus it is clear that the inequality is true for points $A,B,C,D$ in ${\mathbb{R}}^3$ iff it is true for $A,B',C,D$ in ${\mathbb{R}}^2$. $\Box$

We are now ready to establish that $d_{n_{\alpha\beta}}$ is a metric:

Corollary 4   $({\mathbb{R}}^n, d_{n_{\alpha\beta}})$ is a metric space.

Proof: The first and second metric conditions are straightforward. We therefore turn to the triangle inequality.

If $\beta = 0$, then by definition $\alpha\ne0$ and $d_{n_{\alpha\beta}}$ is nothing more than scaled Euclidian distance. If $\alpha=0$, then by definition $\beta \ne 0$ and $d_{n_{\alpha\beta}}$ is just scaled normalized Euclidian distance.

So the only interesting case is that in which $\alpha,\beta
\ne 0$. Here, we may assume without loss of generality that $\beta=1$ since other cases may be reduced to this one by multiplying each term in the triangle inequality by $\beta$ and then reducing.

Now defining for convenience: $a=\Vert A\Vert, b=\Vert B\Vert, c=\Vert C\Vert$, it remains for us to show that:


\begin{displaymath}
\frac{d(A,B)}{\alpha + (a+b)} +
\frac{d(B,C)}{\alpha + (b+c)} \ge
\frac{d(A,C)}{\alpha + (a+c)}
\end{displaymath}

Now forming a common denominator, discarding it, and then rearranging, we have:

\begin{eqnarray*}
\left.
\begin{array}{l}
\alpha^2 d(A,B) + \alpha^2 d(B,C) +...
...ptstyle (a+b)}+{\scriptstyle (b+c)}d(A,C)]
\end{array} \right.
\end{eqnarray*}



We now separately consider matched lines from either side of this expression. It turns out that the inequality holds for each such pairing, thus establishing the overall inequality. The inequality for the first line is just scaled ordinary Euclidian distance. Then, letting $D$ denote the origin, the inequality holds for the second line due to corollary 3, and for the third 7 due to lemma 1. $\Box$

Alternative Forms and Interpretations

Association, and more about Finite Sets

To illustrate the wide variety of metrics that can be constructed with our earlier results, we start by returning to metrics on ${\cal F}$.

We first observe that finite sets may be regarded as binary vectors in Euclidian space. With this observation theorem 2 gives us that the following is a metric:


\begin{displaymath}
d_{\triangle_2}(A,B) = \left\{
\begin{array}{ll}
\frac {
...
...neq \emptyset$}\\
0 & \mbox{otherwise}
\end{array} \right.
\end{displaymath} (13)

It is interesting to note that without the square root operations, this definition corresponds to normalization of $d_\triangle$ by $\vert A\vert+\vert B\vert$, which was earlier shown not to be a metric.

Again thinking of sets as binary vectors or variables, metric $d_{\triangle_n}$ is recognized as a well known measure of association referred to as the Jaccard coefficient [7], or S-coefficient:


\begin{displaymath}
\frac{b+c}{a+b+c}
\end{displaymath}

Where $a,b,c,d$ refer to the standard entries in a 2-by-2 contingency (or association) table for binary variables. This very basic comparison function may be expressed in many forms. Expressing essentially the same thing as a similarity measure corresponds to the Tanimoto Coefficient, [8], operating on binary vectors:


\begin{displaymath}
\frac{A^t B}{A^t A + B^t B - A^T B}
\end{displaymath}

In the language of computer programming this is just the count of 1's in the exclusive or of bit vectors $A$ and $B$, divided by the count of 1's in their logical or.

In [9], the measure is referred to as association. While the authors do not state that these various forms fail to satisfy the triangle inequality, they describe them along with other non-metric measures.

Other sources such as [10] mention explicitly the set function $d_{\triangle_n}$, but again fail to note that it satisfies the triangle inequality.

Other Statistical Measures

The Bray-Curtis measure[11], is written:


\begin{displaymath}
\left(
\sum_j \vert x_{1_j} - x_{2_j} \vert
\right) /
\left(
\sum_j x_{1_j} +
\sum_j x_{2_j}
\right)
\end{displaymath}

and is recognized as our $d_n$ but under the $L_1$ norm. Now we have seen that this does not form a metric, but since $d_n$ is a metric we may write:


\begin{displaymath}
\left(
\sum_j \vert x_{1_j} - x_{2_j} \vert^2
\right)^{1/...
...ight]^{1/2} +
\left[ \sum_j x_{2_j}^2 \right]^{1/2}
\right)
\end{displaymath}

with the knowledge that this does form a metric.

Also in [11], the reader will recognize the Canberra Metric as $d_n$ in ${\mathbb{R}}^1$, making $d_n$ a considerable generalization of this apparently well known metric.

Positive $n$-tuples

In familiar settings such as ${\mathbb{R}}^1$ and ${\mathbb{R}}^N$, with measure $\mu$ defined as length/ area/ volume, the metric $d_\mu$ corresponds to the amount of non-common length/ area/ volume, normalized by the total length/ area/ volume.

Here are some examples of metrics derived from corollary 1, that are defined for $n$-tuples of non-negative values. For brevity's sake, we will not show the $A=B$ case for which any metric must evaluate to zero.

  1. Thinking of our $n$-tuple as a histogram in the plane, we may write:


    \begin{displaymath}
1 - \frac {
\sum_i \mbox{min}(A_i, B_i)
} {
\sum_i \mbox...
...m_i \vert A_i - B_i\vert
} {
\sum_i \mbox{max}(A_i, B_i)
}
\end{displaymath}

    This may be viewed as a repaired Bray-Curtis measure in that $\mbox{max}(A_i,B_i) = \vert A_i\vert + \vert B_i\vert -
\mbox{min}(A_i,B_i)$. I.e. the denominator has an extra term which suffices to form a metric.

  2. We may take the direct product of metric spaces, and combine metrics by any linear combination with the result still a metric. Thus we have in particular that:


    \begin{displaymath}
1 - \sum_i \frac {
\mbox{min}(A_i, B_i)
} {
\mbox{max}(A...
...i \frac {
\vert A_i - B_i\vert
} {
\mbox{max}(A_i, B_i)
}
\end{displaymath}

    which resembles somewhat our first example but behaves quite differently.

  3. Imagining now that our vectors represent the dimensions of an N-dimensional solid object rooted at the origin, and defining measure as volume, we have:


    \begin{displaymath}
1 - \frac {
\prod_i \mbox{min}(A_i, B_i)
} {
\prod_i \mb...
..._i \vert A_i - B_i\vert
} {
\prod_i \mbox{max}(A_i, B_i)
}
\end{displaymath}

    An example in which a single component greatly influences the resulting value.

Function Spaces

Now corollary 1 may also be applied to function spaces. Let $C[0,1]$ denote the space of non-negative continuous functions on $[0,1]$. Given $f,g \in
C[0,1]$, the metric defined by:


\begin{displaymath}
1 - \frac {
\int_{0}^{1} \mbox{min}(f(x), g(x)) dx
} {
\...
... g(x)\vert dx
} {
\int_{0}^{1} \mbox{max}(f(x), g(x)) dx
}
\end{displaymath}

may be thought of as an extension of our first positive $n$-tuple metric above, as $N \rightarrow \infty$, or in a Lebesgue/Measure-Space context.

Extending our second $n$-tuple metric to Riemann integrable functions allows us to write:


\begin{displaymath}
1 - \int_{0}^1 \frac {
\mbox{min}(f(x), g(x))
} {
\mbox{...
... {
\vert f(x) - g(x)\vert
} {
\mbox{max}(f(x), g(x))
} dx
\end{displaymath}

Note that in some cases, and with the proper assumptions, the continuity and range of integration assumptions made above may be relaxed.

Now for the sake of brevity, we have focused mainly on examples using $d_{\triangle_n}$ / $d_\mu$. Similar constructions apply to $d_n$ for both $n$-tuples and function spaces.

The Complex Plane

It is worthwhile noting (however obvious) that $d_n$ may also be viewed as an alternate metric for the complex plane with $L_2$ norm corresponding to complex absolute value.

Numerical Analysis

We close this section by commenting that $d_n$ used as a measure of relative error in numerical analysis, may prove interesting. Given a sequence $\bar{X}_1, \bar{X}_2, ...$ of approximate solutions, we might define $R(i,j) =
d_n(\bar{X}_i, \bar{X}_j)$ and the triangle inequality would then give us that: $R(1,n) \le \sum_{i=1}^{n-1} R(i,i+1)$.

Concluding Remarks

We have developed normalized forms for two important metrics and shown each to be an endpoint of a continuum of metric functions ranging to the original unnormalized forms.

In a combinatorial setting, $d_{\triangle_n}$ may be viewed as a prototype for constructing a normalized bounded metric for many applications - for every injection of a pattern space into ${\cal F}$ induces a metric on the original pattern space. 8

Furthermore, evaluating this metric is possible given only $\vert A \cap B\vert$ and the individual orders $\Vert A\Vert$, and $\Vert B\Vert$ since $\vert A \cup B\vert = \vert A\vert + \vert B\vert - \vert A \cap B\vert$.

In practice, these values may sometimes be directly computed without explicitly forming injective images in set space. In [12] a VLSI chip is described which in computes these three orders for a particular mapping of finite symbol strings into ${\cal F}$, and combines them to define a measure of string similarity.

In more Euclidian settings, the family $d_{n_{\alpha\beta}}$ may be used where normalized behavior is required and the metric triangle inequality is of value, e.g. in finding nearest neighbors, or when bounding sequential sums of distances.

Beyond the existence and basic properties established in this paper, one may examine a number of topics including the notion of colinearity in these normalized spaces, the nature of the geodesics corresponding to $d_n$, and questions of statistical behavior.

Acknowledgements

I thank David Robbins and Marshall Buck for their proof that replaces the tortuous argument I first discovered and gave in an earlier version of this paper. I also thank C.W. Gear, N. Littlestone, I. Rivin, and W.D. Smith, for helpful discussions and references, E. Baum, for corrections and general assistance, - and S.R. Buss for his interest in and help with my work over many years.

Bibliography

1
D. Robbins and M. Buck. Private Communication, May 1993.

2
J. L. Kelly, General Topology.
New York: D. Van Nostrand, 1955.

3
H. L. Royden, Real Analysis.
New York: Macmillan Publishing, second ed., 1968.

4
R. G. Bartle, The Elements of Integration.
John Wiley & Sons, Inc., 1966.

5
D. Haussler, ``Decision theoretic generalizations of the pac model for neural net and other learning applications,'' Tech. Rep. UCSC-CRL-91-02, University of California, Santa Cruz, December 1990.

6
H. Coxeter and S. Greitzer, Geometry Revisited.
Washington, D.C.: Mathematical Association of America, 1967.

7
L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis.
John Wiley & Sons, Inc., 1990.

8
R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis.
John Wiley & Sons, Inc., 1973.

9
A. Kandel, Fuzzy techniques in pattern recognition.
John Wiley & Sons, Inc., 1982.

10
R. J. Schalkoff, Pattern recognition: statistical, structural, and neural approaches.
John Wiley & Sons, Inc., 1992.

11
A. Ralston, Statistical Methods for Digital Computers.
John Wiley & Sons, Inc., 1977.

12
P. N. Yianilos, ``A dedicated comparator matches symbol strings fast and intelligently,'' Electronics Magazine, December 1983.

About this document ...

Normalized Forms for Two Common Metrics

This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -image_type gif -nonext_page_in_navigation -noprevious_page_in_navigation -up_url ../main.html -up_title 'Publication homepage' -numbered_footnotes nmet.tex

The translation was initiated by Peter N. Yianilos on 2002-07-07


Footnotes

... Yianilos1
This is a revision of an earlier unpublished Technical Memorandum of NEC Research Institute, 4 Independence Way, Princeton, NJ 08540
... non-increasing.2
The key is that $f(a+b) \le f(a)+f(b)$.
... defined3
Both of these metrics are defined to have zero value if $A=B$.
.... 4
This example noticed by S.R. Buss who also pointed out that this definition satisfies a weakened triangle inequality.
... eliminated5
The same applies for $d_{\triangle_n}$.
... substituted.6
This may also be seen as a result of our earlier comments regarding other methods for normalizing $d_\triangle$.
... third7
Factor $\alpha$ may be disregarded as it applies to all terms, and re-grouping the paired sums, yields a common term $(a+b+c)$ which may be similarly disregarded.
... space.8
For measure spaces we must additionally require that if $A \ne B$, then their images differ by more than a set of measure zero.


Up: Publication homepage
Peter N. Yianilos
2002-07-07