Leonid Gurvits and Peter N. Yianilos
Keywords: Semidefinite programming, Maximum cut (MAX CUT ) problem, maximum entropy, maximum determinant completion, Bregman distance, Kullback-Leibler distance, normal (Gaussian) density, Sinkhorn iteration.
Semidefinite programming has emerged as an important and practical optimization framework that is increasingly relevant to combinatorial problems. Given symmetric matrices and , and vector , the objective is to minimize subject to and .
There has been a great deal of recent work on the solution of semidefinite programming problems. The ellipsoid method together with a separation oracle due to Loasz may be used; and the work of Nesterov and Nemirovsky  and Alizadeh  adapts the interior-point method to the task.
The connection between semidefinite programming and combinatorial optimization was first established by Loasz in  and later by Loasz and Schrijver in .
Goemans and Williamson  discovered a semidefinite programming relaxation of the NP-hard MAX CUT problem, which generates cuts with weight no less than of optimal. Other optimization problems have been considered using essentially the same approach.
For MAX CUT their method begins by constructing the symmetric weighted adjacency matrix for the graph under consideration. We denote by the number of vertices in , and by the the number of edges. The semidefinite programming problem is then to find matrix that minimizes subject to and . This matrix is then factored as using, for example, Cholesky decomposition. Given a vector they interpret the sign pattern of as a cut -- and show that for uniform random the expected value of this cut is at least times the weight of a maximum cut. It is easy to see that instead of uniform random one can use a Gaussian vector with unit covariance matrix. Also, since only the distribution of the sign pattern matters it is enough to get to within a diagonal factor, i.e. ,where is a diagonal positive-definite matrix.
This approach and the general connection of semidefinite programming to combinatorial optimization is surveyed in . An alternative solution approach is given by Klein and Lu  based on the power-method. Benson Ye and Zhang describe a dual scaling algorithm and report on experiments for semidefinite programs arising from instances of MAX CUT . The MAX CUT problem is surveyed in .
This paper introduces the deflation-inflation algorithm and outlook for certain convex optimization problems including those arising from the MAX CUT relaxation above. Applied to MAX CUT the algorithm is quite simple. It begins by choosing diagonal values for so that . We solve an approximation to the semidefinite program and next choose to select the desired precision.
In each cycle of the algorithm the diagonal entries of are modified one-by-one, i.e. for each iteration only one diagonal entry is changed. The update rule is:
The algorithm terminates when each is equal to within some error tolerance. Notice that the algorithm is essentially searching over the space of diagonal values for .
We show that the algorithm above converges to a unique solution, and that this approximates well the original problem. Our analysis further establishes local linear convergence under rather general circumstances - and global convergence after diagonal updates given a form of the algorithm that chooses diagonal elements greedily. The local analysis and portions of the global analysis apply to the well-studied Sinkhorn balancing problem  and seems relevant to the recent matching work of Linial, Samorodnitsky and Widgerson . Our global analysis is the first we are aware of for projection methods of this general class, which includes the celebrated expectation maximization (EM) algorithm .
The critical operation in each update is . Notice that only one element of is examined. The operation then amounts to solving a system of linear equations. So when has special structure, or is sparse, specialized statements about the algorithm's complexity can be made. When these systems are large and can be rapidly solved, our method may be the best available.
In the sparse case it is interesting to note that our basic algorithm is linear space, i.e. .
We have implemented our algorithm for MAX CUT using Gauss-Seidel iteration with successive over-relaxation to solve the linear systems . To avoid inverting the resulting and performing Cholesky decomposition, we instead compute as a power series. For sparse constant degree random graphs our analysis suggests that this overall method should scale roughly as and our experiments confirm this. The full paper describes this implementation and our experimental results.
Recall that the starting matrix is essentially a combinatorial object representing the weighted adjacency of graph nodes. After the algorithm runs, modifying only its diagonal entries, is the matrix we seek. An interesting aspect of this work is that a probabalistic viewpoint is of utility when addressing this essentially combinatorial problem. We view as the covariance matrix associated with a normal density, and gives its inverse covariance. It is easily verified that the random cuts are then just sign patterns from random vectors drawn from this density.
Each step in our algorithm corresponds to a projection operation with respect to Kullback-Leibler distance [16,6] under Bregman's general framework for convex optimization . This framework is closely related to the later work of I. Csiszár and G. Tusnády  and generalizes much the earlier work [1,19] and also [10,4].
Our main computational focus in this paper is on the semidefinite programming problem corresponding to MAX CUT relaxation, i.e. with diagonal constraints. Another class of problems which allows basically the same complexity and convergence analysis corresponds to block-diagonal positive-definite constraints. The case of two complimentary block-diagonal constraints can be solved using Singular Value Decomposition, while our method in this case is a Riccati-like iteration. We believe that beyond two block-diagonal constraints that our method is the best currently available practical method for large problems.
Another application of our method is the efficient solution of matrix completion problems . Here the objective is to find the maximum determinant positive matrix with specified values on and near to its diagonal. This corresponds to finding the maximum entropy normal density having prescribed marginal distributions.
Finally, we want to clarify one important historical point: our choice of a barrier was motivated by Bregman's seminal paper . In a discrete setting this approach also has many merits, but lacks the crucial property self-concordance  property. But in the Gaussian case this barrier is (basically Bregman's) is a sense optimal . This celebrated self-concordance property has been studied in the context of Newton-like methods. We used it to analyze the convergence of our projection-like method.
Given a joint probability distribution on a product space , and another distribution defined on alone, we define a new distribution , which inherits as much as possible from while forcing its -marginal to match . The focus of our paper is exposing certain properties and algorithms relating to this natural operation in which a joint distribution is deflated via conditionalization, and then reinflated by multiplication to exhibit some required marginal.
We denote by the -marginal of , i.e. . Similarly we denote by the function , i.e. the conditional probability of given . Next, given distribution on and denoting it we define the operator:
which transforms into a new distribution whose -marginal is exactly .
Among all distributions having this marginal, is special in as much as no other such distribution is closer to under the Kullback-Leibler (KL) distance [16,6] denoted . That is, is a projector with respect to KL-distance and the set of distributions with -marginal . This is formalized by:
proof: We begin by establishing the following important Pythagorean KL-distance identity, which we will refer to again later in the paper:
for joint distributions and . From the definition of KL-distance and the observation that , the left side of this equality becomes:
Since by assumption we may then write:
which is clearly minimized by choosing to be
i.e. , since KL-distance is zero when
its two arguments are the same distribution.
Now is defined in Eq-1 for product spaces . But the definition easily extends to more general finite product spaces by letting stand for some subset of the dimensions , and stand for the others. The function is then defined on the product space corresponding to the members of .
More formally, let be a subset of the dimensions and be a function of . Then the operator transforms into a new function whose -marginal exactly matches .
We view the pair as a constraint on the behavior of a desired distribution. Then given a starting distribution , and a collection , , , of such constraints, a natural approach to the transformation of into a distribution that satisfies all of them, is to iteratively apply the operator -- cycling through the collection of constraints. One cycle corresponds to the composition:
and many such cycles may be needed.
We remark that in a particular discrete probability setting this iteration corresponds to Sinkhorn balancing [23,24]. That is, where , , , and . The distribution is then represented by an matrix with nonnegative entries and the iteration corresponds to an alternating row-wise and column-wise renormalization procedure. In Sinkhorn's original paper and are uniform and simply scale down the entire matrix so that the sum of its entries is unity. There is a considerable literature on this subject, but it is not the focus of our paper.
In the Sinkhorn case, the iteration of Eq-3 gives rise to a constructive algorithm because the simple discrete distributions involved have the property that:
Normal densities represent another important class for which these properties hold and here too the approach of Eq-3 leads to a simple and concrete algorithm. Consider the zero-mean normal distribution on :
Now for any consider normal density on corresponding to subspace . The marginal of corresponding to subspace is . So . Ignoring the leading constant and the term in the exponent the normal density corresponding to may be written:
from which it is clear that where
and for brevity in what follows we abuse our notation and write:
It is important to notice that this operation is best carried out using the inverses of the matrices involved. That is, given one arrives at by simply adding to its upper left block.
The next proposition connects our probabilistic development involving normal densities, with the simpler notion of positive definite matrices and their determinants. It establishes that the Bregman distance [5, equation 1.4] arising from the determinant functional operating on positive definite matrices , is essentially the same as the Kullback-Leibler distance between the zero-mean normal densities arising from these matrices.
then , and is a strictly convex function.
proof: It is well known that
It is also well known that
recall that the (Frobenius) inner product of two symmetric matrices is
just the trace of their product . So
Now from our knowledge of normal densities and KL-distance we have that
is nonnegative, and positive iff --
from which it follows that is positive except when
its arguments are equal.
Since is continuous and differentiable its strict
convexity is established, i.e., the function's graph lies strictly
to one side of a tangent plane.
We remark that convexity may also be proven by expressing
in the basis where is
diagonal, noticing that all eigenvalues must be real and
positive, and using the inequality
In this light, our iterative approach is seen to be a Bregman projection method. That is, proposition 1 shows that the operator minimizes KL distance between normal densities, but this is the same as projection under Bregman distance operating on positive definite matrices using the determinant functional.
So applying the operator to a matrix gives the matrix that is the closest to while agreeing with on subspace . Given this observation, the natural algorithm for discovering a matrix that satisfies a system of such constraints, is to cycle over them using the operator to forcibly impose each constraint in turn.
Our first theorem states that this natural algorithm works provided that a solution exists, and that it discovers the closest solution to the initial matrix chosen. Its proof requires the following observation which establishes that all members of a bounded Bregman neighborhood are themselves bounded above, and bounded strictly away from zero below.
proof: From the definition of Bregman distance we write
. We are free to choose a basis in which is
diagonal since is easily seen to be
basis-invariant. This follows from direct
calculation, or, more interestingly by reasoning that
Bregman distance is essentially the same as KL
distance, which is an entropic property of two normal
densities that fairly clearly coordinate system independent.
the largest eigenvalue of , and is a constant
depending on . So
Next let denote the smallest diagonal element of .
, whence elements of any
bounded Bregman neighborhood must have spectra
bounded above, and bounded strictly away from zero below
proof: We write the cyclic iteration as:
Then for any positive definite matrix such that , , we have from equation 2 that:
It is then clear that , for all , and that:
Now the set of iterates has bounded Bregman distance to , so by proposition 3 this set is bounded. Therefore a concentration point must exist. Since is bounded strictly away from zero, its closure, and therefore all concentration points must be positive definite.
Next observe that any concentration point must be stationary under since this is a continuous mapping (in ). It then follows that there is a unique concentration point which we denote by . Moreover, its stationarity implies that it satisfies all constraints.
Now it is easy to see that:
Letting this implies that
. But then
We remark that a similar theorem and proof exist for alternatives to cyclic iteration such as the greedy approach in which a subspace having maximal distance to the target distribution is selected at each iteration.
In the Bregman framework, the constraint set corresponds to an intersection of convex subspaces, and the iteration converges to the point in this intersection closest to the starting point. Elements of the intersection (matrices satisfying all constraints) are stationary under the iteration.
A lot more can be said about the iteration of theorem 1. Later we discuss its rate of convergence, describe cases for which it converges in a finite number of iterations, prove that similar results hold for alternatives to the cyclic constraint ordering, and give a setting under which its computation is particularly fast.
When the intersection consists of more than one point Bregman observes that his iterative projection framework can do more than discover an arbitrary one -- it can be guided, through appropriate selection of an initial value, to converge on the point that solves some minimization problem over the intersection region. In our case it can maximize the determinant, and we will show later that the our algorithm may be used to rapidly compute the maximum determinant positive definite completion of a band matrix.
But our next step is to directly connect theorem 1 to certain problems of semidefinite programming. See [12,11] for a review and connections to combinatorial optimization. In what follows we denote Frobenius matrix inner product by . The semidefinite programming problem we address is:
Given , find a matrix that minimizes , subject to and , .
and we will refer to it as CLS-SDP, standing for constrained linear subspace semidefinite programming. We observe that the semidefinite relaxation of MAX CUT given in  is an instance of CLS-SDP. The following proposition connects the distance minimization goal of theorem 1 to this problem.
proof: Given , we have from Eq 6:
As the role of diminishes and the focus is increasingly on , or . We remark that when the constraints correspond to central minors of a single band matrix, starting points of the form can be shown to function in the same way where is any nonnegative diagonal matrix.
Now given denote by the unique solution to the minimization of . Next let denote the minimum value of achieved for an instance of CLS-SDP. It is easy to see that if then . That may be viewed as an approximate solution to the original problem is established by proposition 6, but we first establish a technical result needed for its proof.
proof: We will bound , the largest eigenvalue of . Choose some satisfying the constraints. Then establishing an upper bound on independent of .
For any unit length vector it is easy to show that
, the smallest eigenvalue of .
It follows that
provides a lower bound
on the diagonal elements of that holds
for any basis in which is expressed. So choosing
a basis that diagonalizes we have that
It also follows that
. Combining these we establish
the lower bound
Eventually, as increasing
postulated, the lower bound exceeds the upper bound --
establishing an upper bound on
independent of .
proof: Select some satisfying the constraints and choose such that . Next let denote a solution to the CLS-SDP instance. That is, a matrix that satisfies the constraints such that achieves its minimal value . Define . Notice satisfies the constraints, that provided , and that . Now:
Now by definition
. From proposition
5 we have that
The ``band matrix'' setting provides an elegant application of proposition 4 and theorem 1. That is, matrices with nonzero entries confined to distance from the diagonal. Propositions 8 gives a key result for this setting and the following general observation is key to both of them. It tells us that when a marginal constraint that straddles the two components of a product distribution is forced by applying , then the result exhibits marginals that agree with the component distributions provided that the constraint agrees with a given component on the region of overlap.
proof: We write:
To evaluate its marginal we first integrate over
, and then integrate
then the result is as
desired. The same argument applies to the marginal
Our first band-matrix result applies the development above to solve the ``band matrix completion problem.'' . It is fascinating that we can reason about this rather pure matrix problem and so easily solve it using a probabalistic outlook and the language of normal densities.
proof: We first argue that the process terminates in steps. The sequence of computation is depicted in figure 1. Notice that since the starting matrix is diagonal, and each application of only affects a central minor, the final result must also have band structure. By definition the first application of enforces the constraint corresponding the the first central minor of .
After applications we consider the space to be a product of two subspaces. The first corresponding to diagonal elements , and the second to elements . Since the later portion of the matrix is diagonal, it corresponds to the product of unit variance independent normal densities, and the product-form requirement of proposition 7 is satisfied.
The constraint corresponding to the following application () of is consistent with the constraints enforced earlier because the overlap of its corresponding central minor of with the subspace is a submatrix of the central minors already enforced. Thus, the of the proposition is satisfied. It then follows that application will not disturb the marginals already enforced -- so after applications all constraints are satisfied and we denote the result .
Now by theorem 1 minimizes the
to the constraints given by the central minors of .
But the diagonal of any satisfying them must match
the diagonal of , so the term is
constant in the minimization, whence
A naive implementation of the computation of proposition 8 requires time. But if is an -band matrix, recall that the linear system can be solved in operations. If is nonsingular, its inverse can then be constructed column by column, and any constant-sized sub-block is obtained in time. So regarding as fixed, it is then clear that the computation of proposition 8 is reduced to time. That is, (the inverse of the maximum-determinant completion) may be found in time. Since is band-structured, it may be inverted in time to yield , so that the entire problem has this complexity.
The concrete example above builds intuition, but by instead framing the problem in more abstract terms we are quickly led to an time highly general solution.
Given marginals on , where we assume without loss of generality that and , we can group variables as where , and so on. One may think of this as a block tridiagonal representation of the system of marginals.
Starting with any product distribution we perform the following Bregman-Sinkhorn iteration:
Then the resulting maximum entropy distribution is by induction:
In the Gaussian case we have where:
Then the maximal determinant completion has block tridiagonal inverse structure with diagonal blocks and off-diagonal blocks - where , , .
The result may be computed in time, or only parallel time with processors.
The local convergence properties of our positive definite matrix iteration, and those of other instances of deflation-inflation, may be analyzed using the following
where . Finally, assume that for some initial point , the sequence of iterates defined by converges to . Then the rate of convergence is linear.
proof: Convergence implies that for sufficiently large , where and and is analytic at . Thus , where . Observe that the condition implies
Corollaries to these lemmas establish linear convergence for our normal-density inspired positive definite case, and provide a concise proof for a rather general form of Sinkhorn balancing. For a general Sinkhorn balancing we prove that the rate of convergence is linear iff the initial zero pattern is equal to the zero pattern of the limit. The statement and proofs of these results will be presented in the full paper.
Beyond local rate of convergence we are interested in a bound on the total number of operations required to produce an approximate solution. Our analysis applies to a form of the algorithm in which the next diagonal position to update is chosen greedily, rather than in a simple cyclic fashion. That is, we update the diagonal element that generates the largest immediate reduction in KL distance. A similar analysis is possible for the somewhat more practical ordering based on KL distance for each marginal, i.e. for each diagonal element.
The analysis requires a number of technical lemmas and is rather involved. We sketch its development here leaving the details for the full paper. The main theorem we establish is:
As the algorithm runs, the diagonal of approaches . It is first shown that the error in diagonal elements must be no more than to achieve the desired approximation. This results from the observation that it is enough to continue until the KL distance to the optimal solution is .
Next, the initial KL distance is easily shown to be . We denote by the maximum diagonal error we will accept. That is, the algorithm will terminate once the maximum diagonal error less than . Then from proposition 1 it follows that every diagonal update, up until we stop, must generate at least improvement in KL distance. So steps are required. Letting we arrive at an easy naive bound.
We remark that this naive analysis may be adapted to the Sinkhorn case.
Fortunately, much stronger bounds may be developed for our normal density setting. The first step consists of letting and showing that in steps the solution comes within a unit KL ball.
Once within this ball it may be shown that only steps are required to come within any fixed power of distance from the optimal solution - in particular within . This improves our bound to , but additional improvement is possible.
The final idea consists of appropriate scheduling. This improves the first phase of the algorithm where the objective is to come within unit distance. We employ a series of shrinking values where each is a constant factor smaller then the previous one. This constant is of the form ) for some constant . For each value we perform diagonal operations bringing us close enough to so that the ultimate effect is that after steps we come within unit distance of as required. The second phase of the algorithm then runs, completing the solution.
We remark that while we do not use the interior point method in the typical fashion as a black box, several aspects of our analysis borrow from ideas at the heart of it's analysis. Also, while our development follows an entirely different probabalistic line, it is interesting to note that we arrive at the same barrier function, which appears there.
To give a more detailed view of the technical analysis we state main lemmas here without proof.
As remarked above, essentially the same naive analysis applies to ``greedy'' Sinkhorn balancing. Suppose is a row-normalized square incidence matrix of a bipartite graph .
On each step of our algorithm in its greedy form we choose that row or column having sum of elements such that is maximal, and divide all entries in it by .
It follows directly from  that this algorithm converges if and only if has a perfect matching.
Using Hall's theorem it is easy to prove (and well known) that if at some iteration then a perfect matching exists.
Now suppose that is any double-stochastic matrix with:
then from proposition 1, adapted to this distance, it follows that:
Thus the number of iterations required to reach the condition is at most , since for all , where is some small universal constant.
Next, if is a permutational matrix of a perfect matching for , then , where denotes the degree of vertex .
We have therefore produced a balancing algorithm that checks the existence of a perfect matching requiring time or parallel steps given processors. Here denotes the number of edges in .
Our Deflation-Inflation approach can not address semidefinite programming problems in general form but we remark that our CLS-SDP variant includes problems that can not be formulated as conventional instances of semidefinite programming without the introduction of a large number of constraints. Namely, instances that involve constraints that are expressed in terms of transformed linear subspaces. We also emphasize that our method is particularly attractive for specially structured matrices, and in particular for band matrices.
In our probabalistic framework we use normal densities, which may be thought of as capturing second-order relationships between vertices - namely edges. We speculate that different probability functions might lead to interesting results. Other members of the exponential family might be considered, or higher-order generalizations of normal densities.
It is clear that any global analysis of projection-based convex optimization methods (such as EM) requires bounds on the relationship between global KL distance and marginal distances. During our analysis we remark that machinery from  was borrowed to allow us to establish some bounds of this sort. These entropic inequalities may be of independent interest.
Finally we remark that there is tension between work invested in generating random cuts, and work aimed at solving the semidefinite program for ever smaller values. Another interesting area for future work is to focus on the variance of the cuts generated in order to better understand this tradeoff.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -image_type gif -nonext_page_in_navigation -noprevious_page_in_navigation -up_url ../main.html -up_title 'Publication homepage' -accent_images textrm -numbered_footnotes maxg.tex
The translation was initiated by Peter N. Yianilos on 2002-06-25