On the Convergence of a Regularized Jacobi Algorithm for Convex Optimization

In this paper, we consider the regularized version of the Jacobi algorithm, a block coordinate descent method for convex optimization with an objective function consisting of the sum of a differentiable function and a block-separable function. Under certain regularity assumptions on the objective function, this algorithm has been shown to satisfy the so-called sufficient decrease condition, and consequently, to converge in objective function value. In this paper, we revisit the convergence analysis of the regularized Jacobi algorithm and show that it also converges in iterates under very mild conditions on the objective function. Moreover, we establish conditions under which the algorithm achieves a linear convergence rate.


I. INTRODUCTION
In this paper we consider large-scale optimization problems in which a collection of individual actors (or agents) cooperate to minimize some common objective function while incorporating local constraints or additional local utility functions. We consider a decentralized optimization method based on block coordinate descent, an iterative coordinating procedure which has attracted significant attention for solving large-scale optimization problems [1]- [3].
Solving large-scale optimization problems via an iterative procedure that coordinates among blocks of variables enables the solution of very large problem instances by parallelizing computation across agents. This enables one to overcome computational challenges that would be prohibitive otherwise, without requiring agents to reveal their local utility functions and constraints to other agents. Due to its pricing mechanism implications, decentralized optimization is also a natural choice for many applications, including demand side management in smart grids, charging coordination for plugin electric vehicles, coordination of multiple agents in robotic systems etc. [4]- [6].
Based on the algorithms outlined in [2], two classes of iterative methods have been employed recently for solving such optimization problems in a decentralized way. The first covers block coordinate gradient descent (BCGD) methods and it requires each agent to perform, at every iteration, a local (proximal) gradient descent step [1], [6]. Under certain regularity assumptions (differentiability of the objective function and Lipschitz continuity of its gradient), and for an The  appropriately chosen gradient step size, this method converges to a minimizer of the centralized problem. This class of algorithms includes both sequential [7] and parallel [8], [9] implementations.
The second covers block coordinate minimization (BCM) methods, does not assume differentiability of the objective and is based on minimizing the common objective function in each block by fixing variables associated with other agents to their previously computed values. Although BCM methods have a larger per iteration cost than the BCGD methods in the case when there are no local utility functions (constraints) in the problem, or when their proximal operators (projections) have closed-form solutions, in the general case both approaches require solutions of ancillary optimization problems. On the other hand, iterations of BCM methods are numerically more stable than gradient iterations, as observed in [10].
If the block-wise minimizations are done in a cyclic fashion across agents, then the algorithm is known as the Gauss-Seidel algorithm [3], [7], [11]. An alternative implementation, known as the Jacobi algorithm, involves performing the block-wise minimizations in parallel. However, convergence of the Jacobi algorithm is not guaranteed in general, even in the case when the objective function is smooth and convex, unless certain contractiveness properties are satisfied [ The authors in [12] have proposed a regularized Jacobi algorithm wherein, at each iteration, each agent minimizes the weighted sum of the common objective function and a quadratic regularization term penalizing the distance to the previous iterate of the algorithm. A similar regularization has been used in Gauss-Seidel methods [7], [11] which are however not parallelizable. Under certain regularity assumptions, and for an appropriately selected regularization weight, the algorithm converges in objective value to the optimal value of the centralized problem [12]. Recently, the authors in [13] have quantified the regularization weighting required to ensure convergence in objective value as a function of the number of agents and other problem parameters. However, convergence of the algorithm in its iterates to an optimizer of the centralized problem counterpart was not established, apart from the particular case where the objective function is quadratic.
In this paper we revisit the algorithm proposed in [12] and enhance its convergence properties under milder conditions. By adopting an analysis based on a power growth property, which is in turn sufficient for the satisfaction of the so-called Kurdyka-Łojasiewicz condition [11], [14], we show that the algorithm's iterates converge under much milder assumptions on the objective function than those used in [2] and [13]. A similar approach was used in [3], [11] to establish convergence of iterates generated by Gauss-Seidel type methods. We also show that the algorithm achieves a linear convergence rate without imposing restrictive strong convexity assumptions on the objective function, in contrast to typical methods in the literature. Our analysis is based on the quadratic growth condition, which is closely related to the so-called error bound property [15], [16] that is used in [8] to establish linear convergence of parallel BCGD methods in objective value.
The remainder of the paper is organized as follows. In Section II we introduce the class of problems under study, outline the regularized Jacobi algorithm for solving such problems in a decentralized fashion, and state the main convergence result of the paper. Section III provides the proof of the main result. Section IV provides a convergence rate analysis, while Section V concludes the paper.

Notation
Let N denote the set of nonnegative integers, R the set of real numbers, R + the set of nonnegative real numbers,R := R∪{∞} the extended real line, and R n the n-dimensional real space equipped with inner product x, y and induced norm x . Consider a vector x = ( The subdifferential of f at x is denoted by ∂f (x). If f is continuously differentiable, then ∇f (x) denotes the gradient of f evaluated at x. We denote by [a ≤ f ≤ b] := {x ∈ R n | a ≤ f (x) ≤ b} a set of points whose value of function f is between a and b; similar notation will be used for strict inequalities and for one-sided bounds. The set of minimizers of f is denoted by argmin f := {x ∈ dom f | f (x) = min f }, where min f is the minimum value of f . We say that a differentiable function f is strongly convex with convexity parameter σ > 0 if holds for all x and y. The distance of a point x to a closed convex set C is denoted by dist(x, C) := inf c∈C x − c , and the projection of x onto C is denoted by

II. PROBLEM DESCRIPTION AND MAIN RESULT A. Regularized Jacobi algorithm
We consider the following optimization problem: . . × dom g m , and the combined objective function in P as (1) Problems in the form P can be viewed as multi-agent optimization programs wherein each agent has its own local decision vector x i and agents cooperate to determine a minimizer of h, which couples the local decision vectors of all agents through the common objective function f . Since the number of agents can be large, solving the problem in a centralized fashion may be computationally intensive. Moreover, even if this were possible from a computational point of view, agents may not be willing to share their local objectives g i , i = 1, . . . , m, with other agents, since this may encode information about their local utility functions or constraint sets.
For each i = 1, . . . , m, we let f i ( · ; x −i ) : R ni → R be a function of the decision vector of the i-th block of variables, with the remaining variables x −i ∈ R n−ni treated as a fixed set of parameters, i.e., We wish to solve P in a decentralized fashion using Algorithm 1. At the (k + 1) th iteration of Algorithm 1, agent i solves a local optimization problem accounting for its local function g i and the function f i with the parameter vector set to the decisions x −i k of the other agents from the previous iteration. Moreover in the local cost function an additional term penalizes the squared distance between the optimization variables and their values at the previous iteration x i k . The relative importance of the original cost function and the penalty term is regulated by the weight c > 0, which should be selected large enough to guarantee convergence [12], [13]. We show in the Appendix that the fixed points of Algorithm 1 coincide with optimal solutions of problem P.
A problem structure equivalent to P was considered in [13], with the difference that a collection of convex constraints x i ∈ X i for each i = 1, . . . , m were introduced instead of the functions g i . We can rewrite this problem in the form of P by selecting g i to be an indicator function of a given convex set. On the other hand, problem P can be written in epigraph form, and thus reformulated in the framework of [13]. The reason that we use the problem structure of P is twofold. First, some widely used problems such as 1 -regularized least squares are typically posed in the form P. Second, the absence of constraints will ease the convergence analysis of Section III since many results in the relevant literature use the same problem structure.

B. Statement of the main result
Before stating the main result we provide some necessary definitions and assumptions. Let h denote the minimum value of P. We then have the following definition.
It should be noted that (2) is a very mild condition, since it requires only that the function h is not excessively 'flat' in the neighborhood of the set argmin h. For instance, all polynomial, real-analytic and semi-algebraic functions satisfy this condition [14], [17]. We impose the following standing assumptions on problem P: Assumption 1: a) The function f is convex and differentiable. b) The gradient ∇f is Lipschitz continuous on dom g with Lipschitz constant L, i.e., c) The functions g i are all convex, lower semicontinuous and e) The function h exhibits the power-type growth condition of Definition 1. Notice that we do not require differentiability of the functions g i . Coerciveness of h implies the existence of some ζ ∈ R for which the sublevel set [h ≤ ζ] is nonempty and bounded, which is sufficient to prove existence of a minimizer of h [18, Prop. 11.12 & Thm. 11.9].
We are now in a position to state the main result of the paper.
then the iterates {x k } k∈N generated by Algorithm 1 converge to a minimizer of problem P, i.e., lim k→∞ x k = x * , where x * is a minimizer of P.
The proof of Theorem 1 involves several intermediate statements and is provided in the next section.

III. PROOF OF THE MAIN RESULT
Many results on convergence of optimization algorithms establish only convergence in function value [2], [13], [19], without guaranteeing convergence of the iterates {x k } k∈N as well. Convergence of iterates is straightforward to show when h is strongly convex, or when {x k } k∈N is Fejér monotone with respect to argmin h, which is true whenever the operator underlying the iteration update is nonexpansive [18]. The latter condition was used in [13] to establish convergence of the sequence {x k } k∈N in the special case that f is a convex quadratic function.
In the single-agent case, i.e. when m = 1, Algorithm 1 reduces to the proximal minimization algorithm whose associated fixed-point operator is nonexpansive for any convex, proper and closed function h. However, in the multi-agent setting the resulting fixed-point operator is not necessarily nonexpansive, which implies that the Fejér monotonicity based analysis can not be employed to establish convergence of the sequence {x k } k∈N . To achieve this and prove Theorem 1 we exploit the following result, which follows directly from Theorem 14 in [14].
Theorem 2 ( [14, Thm. 14]): Consider Assumption 1, with argmin h = ∅ and h := min h. Assume that the initial iterate x 0 of Algorithm 1 satisfies h(x 0 ) < h + r, where r is as in Definition 1. Finally, assume that subsequent iterates {x k } k∈N generated by Algorithm 1 possess the following properties: 1) Sufficient decrease condition: where a > 0. 2) Relative error condition: There exists w k+1 ∈ ∂h(x k+1 ) such that where b > 0. Then the sequence {x k } k∈N converges to some x ∈ argmin h, i.e. lim k→∞ x k = x * , and for all k ≥ 1 It should be noted that Theorem 2 constitutes a relaxed version of Theorem 14 in [14]. This is due to the fact that we impose the power-type growth property as an assumption, which is in turn a sufficient condition for the satisfaction of the so-called Kurdyka-Łojasiewicz (KL) property 1 [11], [17]. Specifically, we could replace the last part of Assumption 1 with the KL property and the conclusion of Theorem 2 would remain valid.
Notice that, under the assumptions of Theorem 2, {x k } k∈N converges to some x ∈ argmin h even if h(x 0 ) ≥ h + r. Since {h(x k )} k∈N converges to h (as a consequence of the sufficient decrease condition (4)), there exists some k 0 ∈ N such that h(x k0 ) < h + r, and hence Theorem 2 remains valid if x k is replaced by x k+k0 .
To prove Theorem 1 it suffices to show that, given Assumption 1, the iterates generated by Algorithm 1 satisfy the sufficient decrease condition and the relative error condition. To show this we first provide an auxiliary lemma.
Lemma 1: Under Assumption 1, for all x, y, z ∈ dom g 1 This can be seen by choosing the so-called desingularizing function ϕ that appears in the definition of the KL property [11], [17] such that ϕ(s) = p (s/γ) We can then show that the sufficient decrease condition is satisfied.
Proposition 1 (Sufficient decrease condition): Under Assumption 1, if c is chosen according to (3), then Algorithm 1 converges to the minimum of problem P in value, i.e. h(x k ) → min h, and for all k the sufficient decrease condition (4) is satisfied with Proof: The result follows from [13, Theorem 2], with the Lipschitz constant established in Lemma 1.
Note that the proofs of Lemma 1 and Proposition 1 do not require the last part of Assumption 1 related to the power-type growth condition of h.
If c is chosen according to Theorem 1, then (4) implies that x k+1 − x k → 0. To show this, suppose that x 0 ∈ dom h and thus h(x 0 ) is finite. Iterating the inequality (4) gives which means that x k+1 − x k converges to zero. Note however that this does not necessarily imply convergence of the sequence {x k } k∈N . Proposition 2 (Relative error condition): Consider Algorithm 1. Under Assumption 1, there exists w k+1 ∈ ∂h(x k+1 ) such that the relative error condition (5) is satisfied with Proof: Iterate x k+1 in Algorithm 1 can be characterized via the subdifferential of the associated objective function, i.e., which ensures the existence of some v k+1 ∈ ∂g(x k+1 ) such that Notice that in the last equality we used the identity Let us now define w k+1 := ∇f (x k+1 ) + v k+1 ∈ ∂h(x k+1 ). From the above equality we can bound the norm of w k+1 as The last step follows from the triangle inequality, and due to Lemma 1, we obtain Propositions 1 and 2 show that the conditions of Theorem 2 are satisfied. As a direct consequence the iterates generated by Algorithm 1 converge to some minimizer of P, thus concluding the proof of Theorem 1.

IV. CONVERGENCE RATE ANALYSIS
It is shown in [13] that if f is a strongly convex quadratic function and g i are indicator functions of convex compact sets, then Algorithm 1 converges linearly. We show in this section that Algorithm 1 converges linearly under much milder assumptions. In particular, if h has the quadratic growth property, i.e., if p in (2) is equal to 2, then Algorithm 1 admits a linear convergence rate. This property is employed in [20] to establish linear convergence of some first-order methods in a single-agent setting, and is, according to [15], [16] closely related to the error bound, which was used in [21], [22] to establish linear convergence of feasible descent methods. Note that the feasible descent methods are not applicable to problem P since we allow for nondifferentiable objective functions.
Theorem 3: Consider Assumption 1, and further assume that power-type growth property is satisfied with p = 2. Let the initial iterate of Algorithm 1 be selected such that h(x 0 ) < h + r, where r appears in Definition 1. Then the iterates {x k } k∈N converge to some x ∈ argmin h, and for all k ≥ 1 x k − x ≤ M 1 where Proof: The quadratic growth property and convexity of h, together with the relative error condition (5) imply that for where w k+1 ∈ ∂h(x k+1 ). Note that since h is lower semicontinuous, the set argmin h is closed and thus the projection onto argmin h is well defined. From the right-hand sides of the first and last inequality in (11) we have Dividing the left-hand side of the first inequality and the right-hand side of the last inequality in (11) by γ dist(x k+1 , argmin h) > 0, we obtain Substituting this inequality into the preceding one, we obtain where the second inequality follows from the sufficient decrease condition (4). Rearranging the terms, we have that which proves (9). Substituting the above inequality into (6) we obtain (10), which concludes the proof. A direct consequence of Theorem 3 is that Algorithm 1, with c selected as in Theorem 1, converges linearly when h satisfies the quadratic growth condition This is the case when f is strongly convex with convexity parameter σ f , implying that argmin h is a singleton and h has the quadratic growth property with γ = σ f /2 for any x ∈ dom h. It is shown in [22], [23] that if f (x) = v(Ex) + b, x has a Lipschitz continuous gradient, with v being strongly convex, and g being an indicator function of a convex polyhedral set, then the problem exhibits the quadratic growth property. Note that if E does not have full column rank, then f is not strongly convex. In [14], [23] it is shown that a similar bound can be established for the 1 -regularized least-squares problem. Here, we adopt an approach from [14] and show that a similar result can be provided for more general problems in which g can be any polyhedral function. The core idea is to rewrite the problem in epigraph form for which such a property is shown to hold.
We impose the following assumption. Assumption 2: a) The function f is defined as with v(·) being a strongly convex function with convexity parameter σ v . b) The component functions g i are all globally non-negative convex polyhedral functions whose composite epigraph can be represented as where C ∈ R p×n , c ∈ R p and d ∈ R p . Note that the inequality Cx + ct ≤ d should be taken component-wise.
The conditions of Assumption 2 are satisfied when f is quadratic, and g i , i = 1, . . . , m, is an indicator function of convex polyhedral sets or any polyhedral norm. Note that the dual of a quadratic program satisfies this assumption. The Lipschitz constant of ∇f , which is required for computing the appropriate parameter c for Algorithm 1, can be upper bounded by E 2 L v , where E is the spectral norm of E, and L v is the Lipschitz constant of ∇v. We will now define the Hoffman constant which will be used in the further analysis. Lemma 2 (Hoffman constant, see e.g., [23]): Let X and Y be two polyhedra defined as where A ∈ R m×n , a ∈ R m , E ∈ R p×n , e ∈ R p , and assume that X ∩ Y = ∅. Then there exists a constant θ = θ(A, E) such that any x ∈ X satisfies We refer to θ as the Hoffman constant associated with matrix Let x 0 be an initial iterate of the algorithm and let r = h(x 0 ). Since h is coercive, [h ≤ r] is a compact set and we can thus define the following quantities: x − y , Ex − Ey ≤ D E , ∇v(Ex) .
Since Algorithm 1 generates a non-increasing sequence {h(x k )} k∈N , for all k we have x k ∈ [h ≤ r] and We conclude that argmin h ⊆ [h ≤ r] ⊂ [g ≤ R], for any fixed R > g(x 0 ) + V r D r E + b D r . For such a bound R, we have wherex = (x, t) and It can be easily seen thatx = (x , t ) minimizes (13) if and only if x ∈ argmin h and t = g(x ). Using [23, Lemma 2.5], we obtain where Ex − Ey = D R E .
Inequality (14) implies that for all x ∈ [g ≤ R] and for all t ∈ [0, R] Setting t = g(x), we then have that Lemma 3: Let r = h(x 0 ) and fix any R > g(x 0 )+V r D r E + b D r . Under Assumptions 1 and 2, for all x ∈ [h ≤ r] we have

V. CONCLUSION
In this paper we revisited the regularized Jacobi algorithm proposed in [12], and enhanced its convergence properties. It was shown that iterates generated by the algorithm converge to a minimizer of the centralized problem counterpart, provided that the objective function satisfies a power growth property. We also established linear convergence of the algorithm when the power growth condition satisfied by the objective function is quadratic.

APPENDIX
In this section we show that the set of fixed points of Algorithm 1 coincides with the set of minimizers of problem P. The result follows from [13, §3]; however, the proof is modified to account for the presence of the nondifferentiable terms g i , i = 1, . . . , m. We first recall the optimality condition for a nondifferentiable convex function h. Similarly to [13], we define an operator T such that and operators T i ( · ; y −i ) such that where y −i ∈ R n−ni is a treated as a fixed parameter. Observe that we can characterize the operator T (x) via the operators T i (x i ; x −i ) as follows T (x) = T 1 (x 1 ; x −1 ), . . . , T m (x m ; x −m ) .
We define the sets of fixed points for these operators as Note that, in the spirit of [24, §5], we treat T as a single valued function T : R n → R n since the quadratic term in the right hand side of (15) ensures that the set of minimizers is always single-valued, with an identical comment applying to the operators T i (y −i ). We now show that the sets argmin h and Fix T coincide. Proof: The proof is based on the proofs of Propositions 1-3 in [13]. We first show that argmin h ⊆ Fix T . Fix any x ∈ argmin h. If x minimizes h, then it is also a block-wise minimizer of h at x, i.e. for all i = 1, . . . , m, we have Since x i minimizes both f i ( · ; x −i ) + g i and c ( · ) − x i 2 , it is also the unique minimizer of their sum, i.e.
x i = argmin implying that x i ∈ Fix T i (x −i ), and thus x = (x 1 , . . . , x m ) is a fixed point of T (x) = T 1 (x 1 ; x −1 ), . . . , T m (x m ; x −m ) . We now show that Fix T ⊆ argmin h. Let x ∈ Fix T , and thus for all i = 1, . . . , m, x i ∈ Fix T i (x −i ), i.e.
According to Proposition 3 the above condition means that for all z i ∈ R ni we have which again by Proposition 3 implies that x i is a minimizer of f i ( · ; x −i ) + g i . According to [25,Lemma 3.1] differentiability of f and component-wise separability of g imply that any x = (x 1 , . . . , x m ) for which (16) holds for all i = 1, . . . , m, is also a minimizer of f + g , i.e., x ∈ argmin h, thus concluding the proof.