The Deflation-Inflation Method for Certain Semidefinite Programming and Maximum Determinant Completion Problems

The deflation-inflation convex optimization method is introduced. One result is a simple and practical approximation algorithm for the MAX CUT problem based on the semidefinite programming relaxation and analysis of Goemans and Williamson. Another consequence is a closed-form expression for the maximum-determinant completion of a positive definite band matrix. Local and global convergence results are established, along with other properties. The overall development reveals an interesting interplay between the language and outlook of probability, and that of positive definite matrices and mathematical programming.

Keywords: Semidefinite programming, Maximum cut (MAX CUT ) problem, maximum entropy, maximum determinant completion, Bregman distance, Kullback-Leibler distance, normal (Gaussian) density, Sinkhorn iteration.

Introduction

Semidefinite programming has emerged as an important and practical optimization framework that is increasingly relevant to combinatorial problems. Given symmetric matrices

and $Q_1,\ldots,Q_k$ , and vector

, the objective is to minimize $C \bullet P$ subject to $Q_i \bullet P = b_i, i = 1,\ldots,k$ and $P \succeq 0$ .

There has been a great deal of recent work on the solution of semidefinite programming problems. The ellipsoid method together with a separation oracle due to Lo $\textrm{\'{v\/}}$ asz may be used; and the work of Nesterov and Nemirovsky [21] and Alizadeh [2] adapts the interior-point method to the task.

The connection between semidefinite programming and combinatorial optimization was first established by Lo $\textrm{\'{v\/}}$ asz in [17] and later by Lo $\textrm{\'{v\/}}$ asz and Schrijver in [18].

Goemans and Williamson [12] discovered a semidefinite programming relaxation of the NP-hard MAX CUT problem, which generates cuts with weight no less than $\approx 0.87856$ of optimal. Other optimization problems have been considered using essentially the same approach.

For MAX CUT their method begins by constructing the symmetric weighted adjacency matrix

for the graph

under consideration. We denote by

the number of vertices in

, and by the

the number of edges. The semidefinite programming problem is then to find matrix

that minimizes $C \bullet P$ subject to $P \succ 0$ and $\mbox{diag}(P) = I$ . This matrix is then factored as

using, for example, Cholesky decomposition. Given a vector

they interpret the sign pattern of

as a cut -- and show that for uniform random

the expected value of this cut is at least $\approx 0.87856$ times the weight of a maximum cut. It is easy to see that instead of uniform random

one can use a Gaussian vector $\xi$ with unit covariance matrix. Also, since only the distribution of the sign pattern matters it is enough to get

to within a diagonal factor, i.e.

,where

is a diagonal positive-definite matrix.

This approach and the general connection of semidefinite programming to combinatorial optimization is surveyed in [11]. An alternative solution approach is given by Klein and Lu [15] based on the power-method. Benson Ye and Zhang describe a dual scaling algorithm and report on experiments for semidefinite programs arising from instances of MAX CUT [3]. The MAX CUT problem is surveyed in [22].

This paper introduces the deflation-inflation algorithm and outlook for certain convex optimization problems including those arising from the MAX CUT relaxation above. Applied to MAX CUT the algorithm is quite simple. It begins by choosing diagonal values for

so that $C \succ 0$ . We solve an approximation to the semidefinite program and next choose $\epsilon > 0$ to select the desired precision.

In each cycle of the algorithm the diagonal entries of

are modified one-by-one, i.e. for each iteration only one diagonal entry is changed. The update rule is:

The algorithm terminates when each $C^{-1}_{i,i}$ is equal to $\epsilon$ within some error tolerance. Notice that the algorithm is essentially searching over the space of diagonal values for

We show that the algorithm above converges to a unique solution, and that this approximates well the original problem. Our analysis further establishes local linear convergence under rather general circumstances - and global convergence after $O(1/\epsilon \log 1/\epsilon)$ diagonal updates given a form of the algorithm that chooses diagonal elements greedily. The local analysis and portions of the global analysis apply to the well-studied Sinkhorn balancing problem [23] and seems relevant to the recent matching work of Linial, Samorodnitsky and Widgerson [20]. Our global analysis is the first we are aware of for projection methods of this general class, which includes the celebrated expectation maximization (EM) algorithm [9].

The critical operation in each update is $(C^{-1}_{i,i})^{-1}$ . Notice that only one element of $C^{-1}$ is examined. The operation then amounts to solving a system of linear equations. So when

has special structure, or is sparse, specialized statements about the algorithm's complexity can be made. When these systems are large and can be rapidly solved, our method may be the best available.

In the sparse case it is interesting to note that our basic algorithm is linear space, i.e.

We have implemented our algorithm for MAX CUT using Gauss-Seidel iteration with successive over-relaxation to solve the linear systems [8]. To avoid inverting the resulting

and performing Cholesky decomposition, we instead compute $P^{-1/2} r$ as a power series. For sparse constant degree random graphs our analysis suggests that this overall method should scale roughly as

and our experiments confirm this. The full paper describes this implementation and our experimental results.

Recall that the starting matrix

is essentially a combinatorial object representing the weighted adjacency of graph nodes. After the algorithm runs, modifying only its diagonal entries, $C^{-1}$ is the matrix

we seek. An interesting aspect of this work is that a probabalistic viewpoint is of utility when addressing this essentially combinatorial problem. We view

as the covariance matrix associated with a normal density, and

gives its inverse covariance. It is easily verified that the random cuts are then just sign patterns from random vectors drawn from this density.

Each step in our algorithm corresponds to a projection operation with respect to Kullback-Leibler distance [16,6] under Bregman's general framework for convex optimization [5]. This framework is closely related to the later work of I. Csiszár and G. Tusnády [7] and generalizes much the earlier work [1,19] and also [10,4].

Our main computational focus in this paper is on the semidefinite programming problem corresponding to MAX CUT relaxation, i.e. with diagonal constraints. Another class of problems which allows basically the same complexity and convergence analysis corresponds to block-diagonal positive-definite constraints. The case of two complimentary block-diagonal constraints can be solved using Singular Value Decomposition, while our method in this case is a Riccati-like iteration. We believe that beyond two block-diagonal constraints that our method is the best currently available practical method for large problems.

Another application of our method is the efficient solution of matrix completion problems [13]. Here the objective is to find the maximum determinant positive matrix with specified values on and near to its diagonal. This corresponds to finding the maximum entropy normal density having prescribed marginal distributions.

Finally, we want to clarify one important historical point: our choice of a barrier $-\log (\det P)$ was motivated by Bregman's seminal paper [5]. In a discrete setting this approach also has many merits, but lacks the crucial property self-concordance [21] property. But in the Gaussian case this barrier is (basically Bregman's) is a sense optimal [21]. This celebrated self-concordance property has been studied in the context of Newton-like methods. We used it to analyze the convergence of our projection-like method.

The Basic Framework

Given a joint probability distribution

on a product space $X \times Y$ , and another distribution

defined on

alone, we define a new distribution $p^\prime(x,y) \triangleq p(y\vert x) q(x)$ , which inherits as much as possible from

while forcing its

-marginal to match

. The focus of our paper is exposing certain properties and algorithms relating to this natural operation in which a joint distribution is deflated via conditionalization, and then reinflated by multiplication to exhibit some required marginal.

We denote by

the

-marginal of

, i.e. $\int p(x,y) dy$ . Similarly we denote by $p_{y\vert x}$ the function

, i.e. the conditional probability of

given

. Next, given distribution

and denoting it

we define the operator:

Among all distributions having this marginal, $S_{x,q}(p)$ is special in as much as no other such distribution is closer to

under the Kullback-Leibler (KL) distance [16,6] denoted $D(\cdot \Vert \cdot)$ . That is, $S_{x,q}(p)$ is a projector with respect to KL-distance and the set of distributions with

-marginal

. This is formalized by:

proof: We begin by establishing the following important Pythagorean KL-distance identity, which we will refer to again later in the paper:

for joint distributions

and

. From the definition of KL-distance and the observation that $p(x,y) = p_{y\vert x}(x,y) \cdot p_x(x)$ , the left side of this equality becomes:

which is clearly minimized by choosing

to be $p_{y\vert x} \cdot q(x)$ , i.e. $S_{x,q}(p)$ , since KL-distance is zero when its two arguments are the same distribution. $\;\Box$

Now $S_{x,q}(p)$ is defined in Eq-1 for product spaces $X \times Y$ . But the definition easily extends to more general finite product spaces $X_1 \times X_2 \times \ldots X_n$ by letting

stand for some subset of the dimensions $1,2,\ldots,n$ , and

stand for the others. The function

is then defined on the product space corresponding to the members of

More formally, let $K={i_1,i_2,\ldots,i_k}$ be a subset of the dimensions $1,2,\ldots,n$ and

be a function of $X_{i_1} \times, X_{i_2}, \ldots, X_{i_k}$ . Then the operator $S_{K,q}(p)$ transforms

into a new function whose

-marginal exactly matches

We view the pair

as a constraint on the behavior of a desired distribution. Then given a starting distribution

, and a collection

, $\ldots$ ,

of such constraints, a natural approach to the transformation of

into a distribution that satisfies all of them, is to iteratively apply the

operator -- cycling through the collection of constraints. One cycle corresponds to the composition:

We remark that in a particular discrete probability setting this iteration corresponds to Sinkhorn balancing [23,24]. That is, where $X_1 = \{1,2,\ldots,N\}$ , $X_2 = \{1,2,\ldots,N\}$ , $K_1 = \{1\}$ , and $K_2 = \{2\}$ . The distribution is then represented by an $N \times N$ matrix with nonnegative entries and the iteration corresponds to an alternating row-wise and column-wise renormalization procedure. In Sinkhorn's original paper

and

are uniform and simply scale down the entire matrix so that the sum of its entries is unity. There is a considerable literature on this subject, but it is not the focus of our paper.

In the Sinkhorn case, the iteration of Eq-3 gives rise to a constructive algorithm because the simple discrete distributions involved have the property that:

Normal densities represent another important class for which these properties hold and here too the approach of Eq-3 leads to a simple and concrete algorithm. Consider the zero-mean normal distribution on ${\mathbb{R}}^n$ :

Now for any $Q \succ 0$ consider normal density

on ${\mathbb{R}}^r$ corresponding to subspace

. The marginal of

corresponding to subspace

is $N_{P_{11}}$ . So $S_{X,N_Q}(N_P) = N_P N_Q / N_{P_{11}}$ . Ignoring the leading constant and the

term in the exponent the normal density corresponding to

may be written:

		$\displaystyle \hspace{5.2em} X \hspace{5.3em} X^\perp$	(4)
$\displaystyle A^{-1}$	$\textstyle =$	$\displaystyle \left( \begin{array}{cc} \widetilde{P}_{11} - (P_{11})^{-1} + Q^{... ...{P}_{12} \\ [2ex] \widetilde{P}_{12}^t & \widetilde{P}_{22} \end{array}\right)$	(5)

It is important to notice that this operation is best carried out using the inverses of the matrices involved. That is, given $P^{-1}$ one arrives at $A^{-1}$ by simply adding $Q^{-1}-(P_{11})^{-1}$ to its upper left block.

The next proposition connects our probabilistic development involving normal densities, with the simpler notion of positive definite matrices and their determinants. It establishes that the Bregman distance [5, equation 1.4] arising from the determinant functional operating on positive definite matrices

, is essentially the same as the Kullback-Leibler distance between the zero-mean normal densities arising from these matrices.

Proposition 2 Define $\psi(P) \triangleq -\log (\det P)$ , for $n \times n$ matrices $P \succ 0$ , and consider Bregman's distance

$\begin{displaymath} B(P_2,P_1) \triangleq \psi(P_1) - \psi(P_2) - <\nabla \psi\vert _{P=P_2},P_1-P_2> \end{displaymath}$

(6)

then $D(N(P_2)\Vert N(P_1)) = 1/2 \cdot B_\psi(P_2,P_1)$ , and $\psi$ is a strictly convex function.

proof: It is well known that $D(N(P_2)\Vert N(P_1)) = 1/2 (-\log (\det P_1) + \log (\det P_2) + \mbox{tr}(P_1 - P_2) P_2^{-1}) = 1/2 (\mbox{tr}P_1 P_2^{-1} - log(\det P_1 P_2^{-1}) - n)$ . It is also well known that $\nabla \psi = -P^{-1}$ . Next recall that the (Frobenius) inner product of two symmetric matrices is just the trace of their product [14]. So $B_\psi(P_2,P_1) = \mbox{tr} P_1 P_2^{-1} - \log(\det P_1 P_2^{-1}) - n$ , whence $D(N(P_2) \Vert N(P_1)) = 1/2 B(P_2,P_1)$ . Now from our knowledge of normal densities and KL-distance we have that

is nonnegative, and positive iff $P_2 \neq P_1$ -- from which it follows that $B_\psi$ is positive except when its arguments are equal. Since $B_\psi$ is continuous and differentiable its strict convexity is established, i.e., the function's graph lies strictly to one side of a tangent plane. We remark that convexity may also be proven by expressing $B_\psi$ in the basis where $P_1 P_2^{-1}$ is diagonal, noticing that all eigenvalues must be real and positive, and using the inequality $x - \log x - 1 \geq 0$ . $\;\Box$

In this light, our iterative approach is seen to be a Bregman projection method. That is, proposition 1 shows that the

operator minimizes KL distance between normal densities, but this is the same as projection under Bregman distance operating on positive definite matrices using the determinant functional.

So applying the operator $S_{X,Q}$ to a matrix $P \succ 0$ gives the matrix

that is the closest to

while agreeing with

on subspace

. Given this observation, the natural algorithm for discovering a matrix that satisfies a system of such constraints, is to cycle over them using the

operator to forcibly impose each constraint in turn.

Our first theorem states that this natural algorithm works provided that a solution exists, and that it discovers the closest solution to the initial matrix chosen. Its proof requires the following observation which establishes that all members of a bounded Bregman neighborhood are themselves bounded above, and bounded strictly away from zero below.

Proposition 3 Let

be a positive definite matrix and consider a Bregman neighborhood $N_{D,r} \triangleq \{ P \vert P \succ 0, B(P,D) \geq r \}$ of radius

about

. Then there exists $\ell, u$ such that $0 < \ell I < P < u I$ , $\forall P \in N_{D,r}$ .

proof: From the definition of Bregman distance we write $B(P,D) = \log \det P + \mbox{tr}D P^{-1} - (n + \log \det D)$ . We are free to choose a basis in which

is diagonal since

is easily seen to be basis-invariant. This follows from direct calculation, or, more interestingly by reasoning that Bregman distance is essentially the same as KL distance, which is an entropic property of two normal densities that fairly clearly coordinate system independent. Then $B(P,D) \geq \log \overline{\lambda}(P) + C$ where $\overline{\lambda}(P)$ denotes the largest eigenvalue of

, and

is a constant depending on

. So $B(P,D) \rightarrow \infty$ as $\overline{\lambda}(P) \rightarrow \infty$ . Next let $\mu > 0$ denote the smallest diagonal element of

. Then $B(P,D) \geq n \log \underline{\lambda}(P) + \mu / \underline{\lambda}(P)$ . Here $B(P,D) \rightarrow \infty$ as $\underline{\lambda}(P) \rightarrow 0$ , whence elements of any bounded Bregman neighborhood must have spectra bounded above, and bounded strictly away from zero below $\;\Box$

Theorem 1 Consider a set

, $0 \leq i < m$ , of linear subspaces $X_i \subset {\mathbb{R}}^n$ and linear mapping constraints $0 < Q_i:X_i \rightarrow X_i$ . If there exists $0 < D: {\mathbb{R}}^n \rightarrow {\mathbb{R}}^n$ such that $D_{X_i,X_i} = Q_i$ , then for any initial matrix $P_0 \succ 0$ , the algorithm that repeats the cyclic composition:

$\begin{displaymath} S_{X_{m-1},Q_{m-1}} \circ \ldots \circ S_{X_1,Q_1} \circ S_{X_0,Q_0} \end{displaymath}$

starting with

, converges to a limit $\bar{P} \succ 0$ such that:

$\bar{P}$ satisfies the constraints, and;
$B(P_0,D) = B(P_0,\bar{P}) + B(\bar{P},D)$ , $\forall D \succ 0$ satisfying the constraints.

Then for any positive definite matrix $D:{\mathbb{R}}^n \rightarrow {\mathbb{R}}^n$ such that $D_{X_i,X_i} = Q_i$ , $i = 0,\ldots,m-1$ , we have from equation 2 that:

It is then clear that $0 \leq D(P_{n+1},D) \leq D(P_{n},D) \leq \ldots \leq D(P_0,D) < \infty$ , for all

, and that:

Now the set $\{P_n\}$ of iterates has bounded Bregman distance to

, so by proposition 3 this set is bounded. Therefore a concentration point must exist. Since $\{P_n\}$ is bounded strictly away from zero, its closure, and therefore all concentration points must be positive definite.

Next observe that any concentration point must be stationary under

since this is a continuous mapping (in

). It then follows that there is a unique concentration point which we denote by $\bar{P}$ . Moreover, its stationarity implies that it satisfies all constraints.

Letting $D=\bar{P}$ this implies that $B(P_0,\bar{P}) = \sum_{n=0}^\infty B([P_n]_{X_{[n]_m},X_{[n]_m}}, Q_{[n]_m})$ . But then $B(P_0,D) = B(P_0,\bar{P}) + B(\bar{P},D)$ $\;\Box$

We remark that a similar theorem and proof exist for alternatives to cyclic iteration such as the greedy approach in which a subspace having maximal distance to the target distribution is selected at each iteration.

In the Bregman framework, the constraint set corresponds to an intersection of convex subspaces, and the iteration converges to the point in this intersection closest to the starting point. Elements of the intersection (matrices satisfying all constraints) are stationary under the iteration.

A lot more can be said about the iteration of theorem 1. Later we discuss its rate of convergence, describe cases for which it converges in a finite number of iterations, prove that similar results hold for alternatives to the cyclic constraint ordering, and give a setting under which its computation is particularly fast.

When the intersection consists of more than one point Bregman observes that his iterative projection framework can do more than discover an arbitrary one -- it can be guided, through appropriate selection of an initial value, to converge on the point that solves some minimization problem over the intersection region. In our case it can maximize the determinant, and we will show later that the our algorithm may be used to rapidly compute the maximum determinant positive definite completion of a band matrix.

But our next step is to directly connect theorem 1 to certain problems of semidefinite programming. See [12,11] for a review and connections to combinatorial optimization. In what follows we denote Frobenius matrix inner product by $A \bullet B$ . The semidefinite programming problem we address is:

and we will refer to it as CLS-SDP, standing for constrained linear subspace semidefinite programming. We observe that the semidefinite relaxation of MAX CUT given in [11] is an instance of CLS-SDP. The following proposition connects the distance minimization goal of theorem 1 to this problem.

As $\epsilon \rightarrow 0$ the role of $\log \det P$ diminishes and the focus is increasingly on $\mbox{tr}PC$ , or $P \bullet C$ . We remark that when the constraints correspond to central minors of a single band matrix, starting points of the form $\frac{1}{\epsilon} (C + D)^{-1}$ can be shown to function in the same way where

is any nonnegative diagonal matrix.

Now given $\epsilon > 0$ denote by $P_\epsilon$ the unique solution to the minimization of $C \bullet P - \epsilon \log \det P$ . Next let $\alpha$ denote the minimum value of $C \bullet P$ achieved for an instance of CLS-SDP. It is easy to see that if $\epsilon_1 > \epsilon_2$ then $C \bullet P_{\epsilon_1} \geq C \bullet P_{\epsilon_2} \geq \alpha$ . That $P_\epsilon$ may be viewed as an approximate solution to the original problem is established by proposition 6, but we first establish a technical result needed for its proof.

proof: We will bound $\overline{\lambda}(P_\epsilon)$ , the largest eigenvalue of $P_\epsilon$ . Choose some $\hat{P} \succ 0$ satisfying the constraints. Then $\mbox{tr}C P_\epsilon - \epsilon \log \det P_\epsilon \leq \mbox{tr}C \hat{P} ... ...ilon \log \det \hat{P} \leq \mbox{tr}C \hat{P} + \vert \log \det \hat{P} \vert$ establishing an upper bound on $\mbox{tr}C P_\epsilon - \epsilon \log \det P_\epsilon$ independent of $\epsilon$ .

For any unit length vector

it is easy to show that $(u,Cu) > \underline{\lambda}(C)$ , the smallest eigenvalue of

. It follows that $\underline{\lambda}(C)$ provides a lower bound on the diagonal elements of

that holds for any basis in which

is expressed. So choosing a basis that diagonalizes $P_\epsilon$ we have that $\mbox{tr}C P_\epsilon \geq \underline{\lambda}(C) \overline{\lambda}(P_\epsilon)$ . It also follows that $\epsilon \log \det P_\epsilon \leq \epsilon N \log \overline{\lambda}(P_\epsilon)$ . Combining these we establish the lower bound $\mbox{tr}C P_\epsilon - \epsilon \log \det P_\epsilon \geq \underline{\lambda}... ...erline{\lambda}(P_\epsilon) - \vert N \log \overline{\lambda}(P_\epsilon) \vert$ . Eventually, as increasing $\overline{\lambda}(P_\epsilon)$ are postulated, the lower bound exceeds the upper bound -- establishing an upper bound on $\overline{\lambda}(P_\epsilon)$ independent of $\epsilon$ . $\;\Box$

proof: Select some $\hat{P} \succ 0$ satisfying the constraints and choose $u,\ell$ such that $uI \succ \hat{P} \succ \ell I > 0$ . Next let $\bar{P}$ denote a solution to the CLS-SDP instance. That is, a matrix that satisfies the constraints such that $C \bullet P$ achieves its minimal value $\alpha$ . Define $\hat{P}_\epsilon = (1-\epsilon) \bar{P} + \epsilon \hat{P}$ . Notice $\hat{P}_\epsilon$ satisfies the constraints, that $\hat{P}_\epsilon \succ 0$ provided $\epsilon > 0$ , and that $\hat{P}_\epsilon \geq \epsilon \ell I$ . Now:

Now by definition $\mbox{tr}C P_\epsilon - \epsilon \log \det P_\epsilon \leq \mbox{tr}C \hat{P}_... ...n \log \det \hat{P}_\epsilon \leq \alpha + O(\epsilon \log \frac{1}{\epsilon})$ . From proposition 5 we have that $\log \det P_\epsilon = O(1)$ so $\mbox{tr}C P_\epsilon - \alpha \leq O(\epsilon \log \frac{1}{\epsilon}) + \epsilon O(1) = O(\epsilon \log \frac{1}{\epsilon})$ $\;\Box$

The Band Matrix Setting

The ``band matrix'' setting provides an elegant application of proposition 4 and theorem 1. That is, $n \times n$ matrices with nonzero entries confined to distance

from the diagonal. Propositions 8 gives a key result for this setting and the following general observation is key to both of them. It tells us that when a marginal constraint that straddles the two components of a product distribution is forced by applying

, then the result exhibits marginals that agree with the component distributions provided that the constraint agrees with a given component on the region of overlap.

Proposition 7 Let

be a probability function that may be written as a product

. Next let

be a probability function thought of as a constraint. Then the

marginal of $S_{\{X_2,X_3\},q}(p)$ remains

provided that

. That is, provided that

is consistent with

on the overlapping variable. Similarly the

marginal of $S_{\{X_2,X_3\},q}(p)$ remains

provided that

To evaluate its

marginal we first integrate over

yielding

, and then integrate over

giving

. Now if

then the result is

as desired. The same argument applies to the

marginal case. $\;\Box$

Our first band-matrix result applies the development above to solve the ``band matrix completion problem.'' [13]. It is fascinating that we can reason about this rather pure matrix problem and so easily solve it using a probabalistic outlook and the language of normal densities.

Proposition 8 Given a band-structured positive definite matrix

, a closed-form formula exists for its unique maximum determinant completion, and this formula corresponds to exactly

iterations of the deflation-inflation computation of theorem 1.

proof: We first argue that the process terminates in

steps. The sequence of computation is depicted in figure 1. Notice that since the starting matrix is diagonal, and each application of

only affects a central minor, the final result must also have band structure. By definition the first application of

enforces the constraint corresponding the the first central minor of

After

applications we consider the space to be a product of two subspaces. The first corresponding to diagonal elements $1,\ldots,m+k-1$ , and the second to elements $m+k,\ldots,n$ . Since the later portion of the matrix is diagonal, it corresponds to the product of unit variance independent normal densities, and the product-form requirement of proposition 7 is satisfied.

The constraint corresponding to the following application (

) of

is consistent with the constraints enforced earlier because the overlap of its corresponding central minor of

with the $1,\ldots,m+k-1$ subspace is a submatrix of the central minors already enforced. Thus, the

of the proposition is satisfied. It then follows that application

will not disturb the marginals already enforced -- so after

applications all constraints are satisfied and we denote the result $\bar{P}$ .

Now by theorem 1 $\bar{P}$ minimizes the Bregman distance $I \bullet P - \log \det P$ subject to the constraints given by the central minors of

. But the diagonal of any

satisfying them must match the diagonal of

, so the $I \bullet P$ term is constant in the minimization, whence $\bar{P}$ maximizes $\log \det P$ . $\;\Box$

**Figure 1:** The inverse of the maximum determinant completion of a positive definite band matrix is constructed by starting with the identity matrix and applying the operator to successive central minors. After applications the process is complete, and may be interpreted as a closed form expression for the final result.
$\begin{figure}\centering \ \psfig{figure=band.fig.ps,width=2.0in} \end{figure}$

A naive implementation of the computation of proposition 8 requires

time. But if

is an $n \times n$

-band matrix, recall that the linear system

can be solved in

operations. If

is nonsingular, its inverse can then be constructed column by column, and any constant-sized sub-block is obtained in

time. So regarding

as fixed, it is then clear that the computation of proposition 8 is reduced to

time. That is, $\bar{P}^{-1}$ (the inverse of the maximum-determinant completion) may be found in

time. Since $\bar{P}^{-1}$ is band-structured, it may be inverted in

time to yield $\bar{P}$ , so that the entire problem has this complexity.

The concrete example above builds intuition, but by instead framing the problem in more abstract terms we are quickly led to an

time highly general solution.

Given marginals $q^i(x_i,\ldots,x_{i+d-1})$ on $[1,\ldots,d], [2,\ldots,d+1], \ldots, [n-d+1,n]$ , where we assume without loss of generality that

and $n=2m^\prime$ , we can group variables as $[x_1,\ldots,x_k]$ where $x_1 = [1,\ldots,d/2], x_2 = [d/2+1,d]$ , and so on. One may think of this as a block tridiagonal representation of the system of marginals.

Starting with any product distribution $f_0(x_1,\ldots,x_k) = s_1(x_1) \cdot \ldots \cdot s_k(x_k)$ we perform the following Bregman-Sinkhorn iteration:

Then the maximal determinant completion

has block tridiagonal inverse structure with diagonal blocks $A_1,\ldots,A_k$ and off-diagonal blocks $T_1,\ldots,T_{k-1}$ - where

, $A_3 = L_2 + D_3 - P_2, \ldots, A_k = L_{k-1}$ .

Local Analysis

The local convergence properties of our positive definite matrix iteration, and those of other instances of deflation-inflation, may be analyzed using the following

Lemma 1 Consider a $f: B \mapsto B$ where

is an open set of ${\mathbb{R}}^n \oplus {\mathbb{R}}^m$ containing the origin. Further assume that

, and that for a particular $\bar{z} = (\bar{x},0)$ ,

is analytic in a neighborhood about it and the Jacobian of

has the following structure:

$\begin{eqnarray*} & & \hspace{1.2em} {\mathbb{R}}^n \hspace{0.5em} {\mathbb{R}}... ... \begin{array}{cc} I & 0 \\ [1ex] 0 & A \end{array}\right) \end{eqnarray*}$

where $\Vert A\Vert = \alpha < 1$ . Finally, assume that for some initial point , the sequence of iterates defined by $z_{n+1} = f(z_n)$ converges to $\bar{z} = (\bar{x},0)$ . Then the rate of convergence is linear.

proof: Convergence implies that for sufficiently large

where $\Vert y_n\Vert \ll 1$ and $\Vert x_n - \bar{x}\Vert \ll 1$ and

is analytic at

. Thus $z_{n+1} = f((x_n,0)+(0,y_n)) = (x_n,0) + J_n \cdot (0,y_n) + O(\Vert y_n\Vert^2)$ , where $J_n \triangleq \left. \frac{\partial f}{\partial z} \right\vert _{(x_n,0)}$ . Observe that the condition

implies

Corollaries to these lemmas establish linear convergence for our normal-density inspired positive definite case, and provide a concise proof for a rather general form of Sinkhorn balancing. For a general Sinkhorn balancing we prove that the rate of convergence is linear iff the initial zero pattern is equal to the zero pattern of the limit. The statement and proofs of these results will be presented in the full paper.

Global Analysis

Beyond local rate of convergence we are interested in a bound on the total number of operations required to produce an approximate solution. Our analysis applies to a form of the algorithm in which the next diagonal position to update is chosen greedily, rather than in a simple cyclic fashion. That is, we update the diagonal element that generates the largest immediate reduction in KL distance. A similar analysis is possible for the somewhat more practical ordering based on KL distance for each marginal, i.e. for each diagonal element.

The analysis requires a number of technical lemmas and is rather involved. We sketch its development here leaving the details for the full paper. The main theorem we establish is:

Theorem 2 Deflation-Inflation in its greedy form with appropriate scheduling of $\epsilon$ reductions, requires $O(1/\epsilon \log 1/\epsilon)$ diagonal updates to produce a solution that leads to cuts that are on expectation within a factor of $(1+\epsilon)$ of the value arising from the Goemans-Williamson analysis.

As the algorithm runs, the diagonal of $C^{-1}$ approaches $\epsilon I$ . It is first shown that the error in diagonal elements must be no more than $\epsilon^{3/2}$ to achieve the desired $(1+\epsilon)$ approximation. This results from the observation that it is enough to continue until the KL distance to the optimal solution is $O(\epsilon^2)$ .

Next, the initial KL distance is easily shown to be $O(1/\epsilon)$ . We denote by $\mu$ the maximum diagonal error we will accept. That is, the algorithm will terminate once the maximum diagonal error less than $\mu$ . Then from proposition 1 it follows that every diagonal update, up until we stop, must generate at least $O(\mu^2)$ improvement in KL distance. So $O(1/\epsilon \mu^2)$ steps are required. Letting $\mu = \epsilon^{3/2}$ we arrive at an easy $O(1/\epsilon^4)$ naive bound.

Fortunately, much stronger bounds may be developed for our normal density setting. The first step consists of letting $\mu = \sqrt{\epsilon}$ and showing that in $O(1/\epsilon^2)$ steps the solution comes within a unit KL ball.

Once within this ball it may be shown that only $O(1/\epsilon \log 1/\epsilon)$ steps are required to come within any fixed power of $\epsilon$ distance from the optimal solution - in particular within $\epsilon^{3/2}$ . This improves our bound to $O(1/\epsilon^2)$ , but additional improvement is possible.

The final idea consists of appropriate $\epsilon$ scheduling. This improves the first phase of the algorithm where the objective is to come within unit distance. We employ a series of shrinking $\epsilon$ values where each is a constant factor smaller then the previous one. This constant is of the form

) for some constant

. For each value $\epsilon_i$ we perform $O(1/\epsilon_i)$ diagonal operations bringing us close enough to $P_{\epsilon_i}$ so that the ultimate effect is that after $O(1/\epsilon \log 1/\epsilon)$ steps we come within unit distance of $P_\epsilon$ as required. The second phase of the algorithm then runs, completing the solution.

We remark that while we do not use the interior point method in the typical fashion as a black box, several aspects of our analysis borrow from ideas at the heart of it's analysis. Also, while our development follows an entirely different probabalistic line, it is interesting to note that we arrive at the same barrier function, which appears there.

To give a more detailed view of the technical analysis we state main lemmas here without proof.

Lemma 3 $\lambda_{\mbox{min}}(A \circ B) \geq 1 - (1 - \lambda_{\mbox{min}} (A)) (1-\lambda_{\mbox{min}}(B))$ , where $A \circ B$ denotes the Schur (componentwise) product of positive definite matrices

and

, each with unit diagonal.

Lemma 4 Define $P(\lambda) = (P_\epsilon^{-1} + \mbox{Diag}(\lambda_i))^{-1} > 0$ and $f(\lambda) = ((1 - P(\lambda)(1,1), \ldots, (1-P(\lambda)(n,n)))$ , then for all $\epsilon > 0$ :

If $D(P(\lambda) \Vert P_\epsilon) > 1$ then $\Vert f(\lambda)\Vert^2 \geq O(\epsilon/n)$
If $D(P(\lambda) \Vert P_\epsilon) \leq 1$ then $\Vert f(\lambda)\Vert^2 \geq O(\epsilon/n) D(P(\lambda) \Vert P_\epsilon)$

As remarked above, essentially the same naive analysis applies to ``greedy'' Sinkhorn balancing. Suppose

is a row-normalized square incidence matrix of a bipartite graph

On each step of our algorithm in its greedy form we choose that row or column having sum of elements

such that $\vert S-1\vert$ is maximal, and divide all entries in it by

It follows directly from [5] that this algorithm converges if and only if

has a perfect matching.

Using Hall's theorem it is easy to prove (and well known) that if at some iteration $\vert S-1\vert < 1/(2n+1)$ then a perfect matching exists.

Thus the number of iterations required to reach the condition $\vert S-1\vert < 1/(2n+1)$ is at most $C \cdot B(A_0 \Vert B) \cdot n^2$ , since $B(A_n \Vert B) \geq 0$ for all $n \geq 1$ , where

is some small universal constant.

Next, if

is a permutational matrix of a perfect matching for

, then $B(A_0 \Vert B) = \sum (\log d_i )$ , where

denotes the degree of vertex

We have therefore produced a balancing algorithm that checks the existence of a perfect matching requiring $O(n^4 \log m/n)$ time or $O(n^3 \log m/n)$ parallel steps given

processors. Here

denotes the number of edges in

Concluding Remarks

Our Deflation-Inflation approach can not address semidefinite programming problems in general form but we remark that our CLS-SDP variant includes problems that can not be formulated as conventional instances of semidefinite programming without the introduction of a large number of constraints. Namely, instances that involve constraints that are expressed in terms of transformed linear subspaces. We also emphasize that our method is particularly attractive for specially structured matrices, and in particular for band matrices.

In our probabalistic framework we use normal densities, which may be thought of as capturing second-order relationships between vertices - namely edges. We speculate that different probability functions might lead to interesting results. Other members of the exponential family might be considered, or higher-order generalizations of normal densities.

It is clear that any global analysis of projection-based convex optimization methods (such as EM) requires bounds on the relationship between global KL distance and marginal distances. During our analysis we remark that machinery from [21] was borrowed to allow us to establish some bounds of this sort. These entropic inequalities may be of independent interest.

Finally we remark that there is tension between work invested in generating random cuts, and work aimed at solving the semidefinite program for ever smaller $\epsilon$ values. Another interesting area for future work is to focus on the variance of the cuts generated in order to better understand this tradeoff.

Bibliography

About this document ...

This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)

The command line arguments were:
latex2html -split 0 -image_type gif -nonext_page_in_navigation -noprevious_page_in_navigation -up_url ../main.html -up_title 'Publication homepage' -accent_images textrm -numbered_footnotes maxg.tex