Second-Order Optimality and Beyond: Characterization and Evaluation Complexity in Convexly Constrained Nonlinear Optimization

High-order optimality conditions for convexly constrained nonlinear optimization problems are analysed. A corresponding (expensive) measure of criticality for arbitrary order is proposed and extended to define high-order ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document}-approximate critical points. This new measure is then used within a conceptual trust-region algorithm to show that if derivatives of the objective function up to order q≥1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q \ge 1$$\end{document} can be evaluated and are Lipschitz continuous, then this algorithm applied to the convexly constrained problem needs at most O(ϵ-(q+1))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\epsilon ^{-(q+1)})$$\end{document} evaluations of f and its derivatives to compute an ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document}-approximate qth-order critical point. This provides the first evaluation complexity result for critical points of arbitrary order in nonlinear optimization. An example is discussed, showing that the obtained evaluation complexity bounds are essentially sharp.


Introduction
scalars converging to zero, we say that a k = o(b k ) if and only if lim k→∞ a k /b k = 0 and, more generally, a(α) = o(α) if and only if lim α→0 a(α)/α = 0. The normal cone to a general convex set C at x ∈ C is defined by and its polar, the tangent cone to F at x, by Note that C ⊆ x +T C (x) for all x ∈ C. We also define P C [·] be the orthogonal projection onto C. (See [25,Section 3.5] for a brief introduction of the relevant properties of convex sets and cones, or to [39,Chapter 3] or [50,Part I] for an in-depth treatment.)

Tensor Norms and Generalized Cauchy-Schwarz Inequality
We will make substantial use of tensors and their norms in what follows and thus start by establishing some concepts and notation. If the notation T [v 1 , . . . , v j ] stands for the tensor of order q − j resulting from the application of the qth-order tensor T to the vectors v 1 , . . . , v j , the (recursively induced 1 ) Euclidean norm · q on the space of qth-order tensors is the given by That it is the recursively norm induced by the standard Euclidean norm results from the observation that a simple generalization of the standard Cauchy-Schwarz inequality for order-1 tensors (vectors) and of Mv ≤ M v which is valid for induced norms of matrices (order-2 tensors). Observe also that perturbation theory (see [40,Th. 7]) implies that T q is continuous as a function of T . If T is a symmetric tensor of order q, define the q-kernel of the multilinear q-form (see [12,13]). Note that, in general, ker q [T ] is a union of cones. Interestingly, the q-kernels are not only unions of cones but also subspaces for q = 1. However, this is not true for general q-kernels, since both (0, 1) T and (1, 0) T belong to the 2-kernel of the symmetric 2-form x 1 x 2 on R 2 , but not their sum. We also note that, for symmetric tensors of odd order, T [v] q = −T [−v] q and thus that − min d ≤1 (2.3) where we used the symmetry of the unit ball with respect to the origin to deduce the second equality.

High-Order Error Bounds from Taylor Series
The tensors considered in what follows are symmetric and arise as high-order derivatives of the objective function f . For the pth derivative of a function f : R n → R to be Lipschitz continuous on the set S ⊆ R n , we require that there exists a constant L f, p ≥ 0 such that, for all x, y ∈ S, where ∇ p x h(x) is the pth-order symmetric derivative tensor of h at x.
Let T f, p (x, s) denote 2 the pth-order Taylor series approximation to f (x + s) at some x ∈ R n given by (2.5) and consider the Taylor identity involving a given univariate C k function φ(α) and its kth-order Taylor approximation t k (α) = k i=0 φ (i) (0)α i /i! expressed in terms of the ith derivatives φ i , i = 1, . . . , k. Let x, s ∈ R n . Then, picking φ(α) = f (x + αs) and k = p, it follows immediately from the fact that t p (1) = T f, p (x, s), the identity (2.2), (2.4), (2.5) and (2.6) imply that, for all x, s ∈ R n , Similarly, Inequalities (2.8) and (2.9) will be useful in our developments below, but immediately note that they in fact depend only on the weaker requirement that for all x and s of interest, rather than relying on (2.4).

Unconstrained and Convexly Constrained Problems
The problem we wish to solve is formally described as where we assume that f : R n −→ R is q-times continuously differentiable and bounded from below, and that f has Lipschitz continuous derivatives of orders 1 to q. We also assume that the feasible set F is closed, convex and non-empty. Note that this formulation covers unconstrained optimization (F = R n ), as well as standard inequality (and linear equality) constrained optimization in its different forms: the set F may be defined by simple bounds, and/or by polyhedral or more general convex constraints. We are tacitly assuming here that the cost of evaluating values and derivatives of the constraint functions possibly involved in the definition of F is negligible.

High-Order Optimality Conditions
Given that our ambition is to work with high-order model, it seems natural to aim at finding high-order local minimizers. As is standard, we say that x * is a local minimizer of f if and only if there exists a (sufficiently small) neighbourhood B * of x * such that However, we must immediately remember important intrinsic limitations. These are exemplified by the smooth two-dimensional problem which is a simplified version of a problem stated by Hancock nearly a century ago [38, p. 36], itself a variation of a famous problem stated even earlier by Peano [49,. The contour lines of its objective function are shown in Fig. 1.
The first conclusion which can be drawn by examining this example is that, in general, assessing that a given point x (the origin in this case) is a local minimizer needs more that verifying that every direction from this point is an ascent direction. Indeed, this latter property holds in the example, but the origin is not a local minimizer (it is a saddle point). This is caused by the fact that objective function decrease may occur along specific arcs starting from the point under consideration, and these arcs need not be lines (such as x(α) = 0 + αe 2 + 1 /2e −1/2α 2 e 1 for α ≥ 0 in the example). The second conclusion is that the characterization of a local minimizer cannot always be translated into a set of conditions only involving the Taylor expansion of f at x * . In our example, the difficulty arises because the coefficients of the Taylor's expansion of e −1/x 2 1 at x all vanish as x 1 approaches the origin and, therefore, that the (non-)minimizing nature of this point cannot be determined from the values of these coefficients. Thus, the gap between necessary and sufficient optimality conditions cannot be closed if one restricts one's attention to using derivatives of the objective function at a putative solution of problem (3.1).
Note that worse situations may also occur, for instance if we consider the following variation on Hancock simplified example (3.3): for which no continuous descent arc exists in a neighbourhood of the origin despite the origin not being a local minimizer.

Necessary Conditions for Convexly Constrained Problems
The above examples show that fully characterizing a local minimizer in terms of general continuous descent arcs is in general impossible. However, the fact that no such arc exists remains a necessary condition for such points, even if Hancock's example shows that these arcs may not be amenable to a characterization using arc derivatives. In what follows, we therefore propose derivative-based necessary optimality conditions by focussing on a specific (yet reasonably general) class of descent arcs x(α) of the form where α > 0. Such an arc-based approach was used by several authors for first-and second-order conditions (see [4,10,24,33] for example). Note that, if s i 0 is the first nonzero s i in the sum in the right-hand side of (3.5) (if any), we may redefine α to be α s i 0 −1/i 0 without modifying the arc, so that we may assume, without loss of generality, that s i 0 = 1 whenever (s 1 , . . . , s q ) = (0, . . . , 0).
Define the qth-order descriptor set of F at x by is closed and always contains (0, . . . , 0), and that D 1 is the inner second-order tangent set to F at x, as defined in [10]. 3 For example, We say that a feasible arc x(α) is tangent to D q F (x) if (3.5) holds for some (s 1 , . . . , s q ) ∈ D q F (x). Note that definition (3.6) implies that where s u is the first nonzero s . We now consider some conditions that preclude the existence of feasible descent arcs of the form (3.5). These conditions involve the index sets P( j, k) defined, for k ≤ j, by (3.8) For k ≤ j ≤ 4, these are given in Table 1. 3 It would be possible to generalize the approach of [10] and define the inner jth-order tangent set ( j > 1) . . , s j−1 ), but we prefer the equivalent (3.6) for notational convenience.
Proof Consider an arbitrary feasible arc of the form (3.5). Substituting this relation in the expression f (x(α)) ≥ f (x * ) (given by (3.2)) and collecting terms of equal degree in α, we obtain that, for sufficiently small α, where with P(i, k) defined in (3.8). For this to be true, we need each coefficient of α j to be non-negative on the zero set of the coefficients 1, . . . , j −1, subject to the requirement that the arc (3.5) must be feasible for α sufficiently small, that is x(α) ∈ D j F (x * ). First consider the case where j = 1 (in which case (3.10) is void). The fact that the coefficient of α in (3.11) must be non-negative implies that ∇ 1 x f (x * )[s 1 ] ≥ 0 for all s 1 ∈ T F (x * ), which proves (3.9) for j = 1. Assume now that s 1 ∈ T F (x * ) and that (3.10) holds for i = 1. This latter condition requests s 1 to be in the zero set of the coefficient in α in (3.11), that is Then the coefficient of α 2 in (3.11) must be non-negative, which yields, using P(2, 1) = {(2)}, P(2, 2) = {(1)} (see Table 1), that which is (3.9) for j = 2.
We may then proceed in the same manner for all coefficients up from order j = 3 to q, each time considering them in the zero set of the previous coefficients (that is (3.10)), and verify that (3.11) directly implies (3.9).
Following a long tradition, we say that x * is a qth-order critical point for problem (3.1) if the conclusions of this theorem hold for j ∈ {1, . . . , q}. Of course, a qthorder critical point need not be a local minimizer, but every local minimizer is a qth-order critical point. This theorem states conditions for qth-order criticality for smooth problems which are only necessary because not every feasible arc needs to be tangent to D q F (x * ), depending on the geometry of the feasible set in the neighbourhood of x * .
Note that, as the order j grows, (3.9) may be interpreted as imposing a condition on s j (via ∇ 1 satisfying (3.10). In more general situations, the fact that conditions (3.9) and (3.10) not only depend on the behaviour of the objective function in some well-chosen subspace, but involve the geometry of the all possible feasible arcs makes the second-order condition (3.13) difficult to use, particular in the case where F ⊂ R n . In what follows we discuss, as far as we currently can, two resulting questions of interest.
We start by deriving useful consequences of Theorem 3.1.

Corollary 3.2 Suppose that the assumptions of Theorem 3.1 hold. Then
and Moreover (3.14) holds. Also note that (3.9) and (3.10) impose that which, because of (3.14) and the polarity of N F (x * ) and T F (x * ), yields that s 1 belongs to ∂T F (x * ). Assume now that s 2 / ∈ T F (x * ). Then, for all α sufficiently small, αs 1 + α 2 s 2 does not belong to T F (x * ) and thus x(α) = x * + αs 1 + α 2 s 2 + o(α 2 ) cannot belong to F, which is a contradiction. Hence, s 2 ∈ T F (x * ) and (3.15) follows for i = 2, while it follows from s 1 ∈ T F (x * ) and (3.14) for i = 1.
Consider now the second-order conditions (3.13). If F = R n (or if the convex constraints are inactive at x * ), then ∇ 1 x f (x * ) = 0 because of (3.14) and (3.13) is nothing but the familiar condition that the Hessian of the objective function must be positive semi-definite. If x * happens to lie on the boundary of F and ∇ 1 x f (x * ) = 0, (3.13) indicates that the effect of the curvature of the boundary of F may be represented by , which is non-negative because of (3.15). Consider, for example, the problem min x∈F ⊂R 2 whose global solution is at the origin. In this case, it is easy to check that −∇ 1 x f (0) = 0, and that second-order feasible arcs of the form (3.5) with x(0) = 0 may be chosen with s 1 = ±e 2 and s 2 = βe 1 where Interestingly, there are cases where the geometry of the set of locally feasible arcs is simple and manageable. In particular, suppose that the boundary of F is locally polyhedral. Then, given ∇ 1 case conditions (3.9) and (3.10) are void, or there exists d = 0 in that subspace. It is then possible to define a locally feasible arc with s 1 = d and s 2 = · · · = s q = 0. As a consequence, the smallest possible value of ∇ 1 for feasible arcs starting from x * is identically zero and this term therefore vanishes from (3.9) to (3.10). Moreover, because of the definition of P(k, j) (see Table 1), all terms but that in ∇ j x f (x * )[s 1 ] j also vanish from these conditions, which then simplify to for j = 2, . . . , q, which is a condition only involving subspaces and (for i ≥ 2) cones. Analysis for first-and second orders in the polyhedral case can be found in [2,30,52] for instance. Further discussion of second-order (both necessary and sufficient) conditions for the more general problem can be found in [10] and the references therein.

Necessary Conditions for Unconstrained Problems
Consider now the case where x * belongs to F 0 , which is obviously the case if the problem is unconstrained. Then we have that D q F (x * ) = R n×q , and one is then free to choose the vectors {s i } q i=1 (and their sign) arbitrarily. Note first that, since N F (x * ) = {0}, (3.14) implies that, unsurprisingly, For the second-order condition, we obtain from (3.9), again unsurprisingly, that, because Thus, the term for k = 1 vanishes from (3.9), as well as all terms involving ∇ 2 This implies in particular that the third-order condition may now be written as where the equality is obtained by considering both s 1 and −s 1 . Unfortunately, complications arise with fourth-order conditions, even when the objective function is a polynomial. Consider the following variant of Peano's [49] problem: where κ 1 and κ 2 are parameters. Then one can verify that Hence, The necessary condition (3.9) then states that, if the origin is a minimizer, then, using the arc defined by s 1 = e 1 and s 2 = 1 /2κ 1 e 2 and the fact that P(4, 3) contains three elements, This shows that the condition ∇ 4 , although necessary, is arbitrarily far away from the weaker necessary condition when κ 1 grows. As was already the case for problem (3.3), the example for κ 1 = 1 and κ 2 = 2, say, shows that a function may admit a saddle point (x * = 0) which is a maximum (x * = 0) along an arc (x 2 = 3 /2x 2 1 in this case) while at the same time be minimal along every line passing through x * . Figure 2 shows the contour lines of the objective function of (3.19) for increasing values of κ 2 , keeping κ 1 = 3.
One may attribute the problem that not every term in (3.9) vanishes to the fact that switching signs of s 1 or s 2 does imply that any of the terms in (3.20) is zero (as we have verified) because of the terms ∇ 2 Is this a feature of even orders only? Unfortunately, this not the case for q = 7. Indeed is it not difficult to verify that the terms whose multi-index ( 1 , . . . , k ) is a permutation of (1, 2, 2, 2) belong to P(7, 4) and those whose multi-index is a permutation of (1, 1, 1, 1, 1, 2) belong to P (7,6). Moreover, the contribution of these terms to the sum (3.9) cannot be distinguished by varying s 1 or s 2 , for instance by switching their signs as this technique yields only one equality in two unknowns. In general, we may therefore conclude that (3.9) must involve a mixture of terms with derivative tensors of various degrees.

Sufficient Conditions for Isolated Local Minimizers
Despite the limitations we have seen when considering the simplified Hancock example, we may still derive a sufficient condition for x * to be an isolated minimizer, which is inspired by the standard second-order case (see Theorem 2.4 in Nocedal and Wright [48] for instance). This condition requires a constraint qualification in that the feasible set in the neighbourhood of x * is required to be completely described by the arcs of the form (3.5) for small α.
Proof Consider any δ 2 ∈ (0, δ] and, using the fact that F = {x * }, an arbitrary y ∈ F ∩ ∂B(x * , δ 2 ) ⊆ A q F (x, δ), where we used (3.21) to obtain the last inclusion. Thus, there exists at least one arc x(α) of the form (3.5) which is tangent to D q F (with associated nonzero (s 1 , . . . , s j )) and a smallest α y ≥ 0 such that x(α y ) = y. For any such arc, let m be the smallest integer such that c m = 0, where c j is defined by (3.12). The relations (3.9), (3.10) and (3.22) then imply that c m > 0. (3.23) and (3.22) also ensures that m ∈ {1, . . . , q}. Now choose such an arc x(α) with maximal m. From Taylors' theorem and using (3.11) to obtain the form of the derivatives along the arc x(α), we have that for some τ ∈ [0, 1], where we used our assumption c j = 0 for j = 1, . . . , m − 1 to deduce the second equality. Observe that x(τ α y ) − x * ≤ δ 2 because α y is the smallest α such that x(α) − x * = δ 2 . Hence, we choose δ 2 small enough to ensure, by continuity, (3.12), (3.23) and (3.24) This proves the theorem since y is chosen arbitrarily in a sufficiently small feasible neighbourhood of x * .
Note that the condition F = {x * } may be viewed as a form of Slater condition, and also that x * is obviously a local isolated minimizer if it fails.
If we now return to our examples, we see that Theorem 3.3 excludes that the origin is a local minimizer, for example, (3.19) with κ 1 = 1 and κ 2 = 2, since the arc x 2 = 3 /2x 2 1 must be considered in (3.21). The origin is not a local minimizer for either problem (3.3) or (3.4), since (3.22) fails for any q because the Taylor's series of f is identically zero along the first coordinate axis (which defines two admissible arcs x(α) = ±αe 1 ).
Of course the assumptions of Theorem 3.3 may be difficult to check in general, but may be tractable in some cases. Assume for instance that F is polyhedral. Then, for sufficiently small δ, A q F (x * , δ) ⊂ T F (x * ) and we may use half lines originating at x * to define feasible arcs. This is the inspiration of the following less general but easier to verify alternative to Theorem 3.3.

Theorem 3.4 Suppose that f is p times continuously differentiable in an open
neighbourhood of x * ∈ F. If there exists a q ∈ {1, . . . , p} such that, for all nonzero s ∈ T F (x * ), then x * is an isolated local minimizer for problem (3.1).
is reduced to the origin, then the inclusion F ⊆ x * + T F (x * ) implies that F = {x * } and x * is therefore an isolated minimizer. Let us therefore assume that there exists a nonzero s ∈ T F (x * ). The second part of condition (3.25) and the continuity of the (q + 1)-th derivative then imply that for all z in a sufficiently small feasible neighbourhood of x * . Now, using Taylor's expansion, we obtain that, for all s ∈ T F (x * ) and all τ ∈ (0, 1), If τ is sufficiently small, then this equality, the first part (3.25) and (3.26) ensure that f (x * + τ s) > f (x * ). Since this strict inequality holds for an arbitrary nonzero s ∈ T F (x * ) ⊇ F − x * and all τ sufficiently small, x * must be a feasible isolated minimizer.
Observe that, in Peano's example (see (3.19) with κ 1 = 3 and κ 2 = 2), we have that the curvature of the objective function is positive along every line passing through the origin, but that the order of the curvature varies with s (second order along s = e 2 and fourth order along s = e 1 ), which precludes applying Theorem 3.3. Also note that, when q = 2, weaker sufficient conditions (exploiting the structure of D 2 F (x * ) to a larger extent) are known for a several classes of problems, including semi-definite optimization (see [10] for details).

An Approach Using Taylor Models
As already noted, the conditions expressed in Theorem 3.1 may, in general, be very complicated to verify in an algorithm, due to their dependence on the geometry of the set of feasible arcs. To avoid this difficulty, we now explore a different approach. Let the symbol "globmin" represent global minimization and define, for some ∈ (0, 1] and some j ∈ {1, . . . , p}, the smallest value of the jth-order Taylor model T f, j (x, s) achievable by a feasible point at distance at most from x. Note that φ f, j (x) is a continuous function of x and for given F and f (see [40,Th. 7]). The introduction of this quantity is in part motivated by the following theorem.

Theorem 3.5 Suppose that f is q times continuously differentiable in an open
Then Proof We start by rewriting the power series (3.11) for degree j, for any given arc where s(α) def = x(α) − x, c i is defined by (3.12) and where the last equality holds because f and T f, j share the first j derivatives at x. This reformulation allows us to write that, for i ∈ {1, . . . , j}, (3.29) Assume now there exists an (s 1 , . . . , s j ) ∈ Z f, j F (x) such that (3.9) does not hold. In the notation just introduced, this means that, for this particular (s 1 , . . . , s j ), Then, from (3.29), and thus the first ( j − 1) coefficients of the polynomial T f, j (x, s(α)) − f (x) vanish. Thus, using (3.28), Now let i 0 be the index of the first nonzero s i . Note that i 0 ∈ {1, . . . , j} since otherwise the structure of the sets P(i, k) implies that c j = 0. Observe also that we may redefine the parameter α as α s i 0 1/i 0 so that we may assume, without loss of generality that s i 0 = 1. As a consequence, we obtain that, for sufficiently small α, Hence, successively using the facts that c j < 0, that (3.29) and (3.31) hold for all arcs x(α) tangent to D q F (x), and that (3.32) and (3.27) hold, we may deduce that The conclusion of the theorem immediately follows since lim →∞ This theorem has a useful consequence. Proof We successively apply Theorem 3.5 q times and deduce that x is a jth-order critical point for j = 1, . . . , q.
This last result says that we may avoid the difficulty of dealing with the possibly complicated geometry of D q F (x) if we are ready to perform the global optimization occurring in (3.27) exactly and find a way to compute or overestimate the limit in (3.33). Although this is a positive conclusion, these two remaining challenges remain daunting. However, it is worthwhile noting that the standard approach to computing first-, second-and third-order criticality measures for unconstrained problems follows the exact same approach. In the first-order case, it is easy to verify that where the first equality is justified by the convexity of ∇ 1 as a function of d. Because the left-hand side of the above relation is independent of , the computation of the limit (3.33) for tending to zero is trivial when j = 1 and the limiting value is ∇ x f (x) . For the second-order case, assuming ∇ 1 the first global optimization problem being easily solvable by a trust-region-type calculation [25,Section 7.3] or directly by an equivalent eigenvalue analysis. As for the first-order case, the left-hand side of the equation is independent of and obtaining the limit for tending to zero is trivial.
and P M(x) is the orthogonal projection onto that subspace, where the first equality results from (2.1). In this case, the global optimization in the subspace M(x) is potentially harder to solve exactly (a randomization argument is used in [1] to derive a upper bound on its value), although it still involves a subspace. 4 While we are unaware of a technique for making the global minimization in (3.27) easy in the even more complicated general case, we may think of approximating the limit in (3.33) by choosing a (user-supplied) value of > 0 small enough 5 and consider the size of the quantity Unfortunately, it is easy to see that, if is fixed at some positive value, a zero value of φ f, j (x) alone is not a necessary condition for x being a local minimizer. Indeed consider the univariate problem of minimizing f (x) = x 2 (1 − αx) for α > 0. One verifies that, for any > 0, the choice α = 2/ yields that (3.37) despite 0 being a local (but not global) minimizer. As a matter of fact, φ f, j (x) gives more information than the mere potential proximity of a jth-order critical point: it is able to see beyond an infinitesimal neighbourhood of x and provides information on possible further descent beyond such a neighbourhood. Rather than a true criticality measure, it can be considered, for fixed , as an indicator of further progress, but its use for terminating at a local minimizer is clearly imperfect. Despite this drawback, the above arguments would suggest that it is reasonable to consider a (conceptual) minimization algorithm whose objective is to find a point x such that for some ∈ (0, 1] sufficiently small and some q ∈ {1, . . . , p}. This condition implies an approximate minimizing property which we make more precise by the following result.

(3.39)
Proof Consider x + d ∈ F. Using the triangle inequality, we have that d). (3.40) Now, condition (3.38) for j = q implies that, if d ≤ , and the desired result follows.
The size of the neighbourhood of x where f is "locally smallest"-in that the first part of (3.39) holds-therefore increases with the criticality order q, a feature potentially useful in various contexts such as global optimization. Before turning to more algorithmic aspects, we briefly compare the results of Theorem 3.7 which what can be deduced on the local behaviour of the Taylor series T f,q (x * , s) if, instead of requiring the exact necessary condition (3.9) to hold exactly, this condition is relaxed to while insisting that (3.10) should hold exactly. If j = q = 1, it is easy to verify that (3.42) for s 1 ∈ T F (x * ) is equivalent to the condition that from which we deduce, using the Cauchy-Schwarz inequality, that for all s ∈ T F (x * ) with d ≤ , that is (3.38) for j = 1. Thus, by Theorem 3.7, we obtain that (3.39) holds for j = 1.

A Trust-Region Minimization Algorithm
Aware of the optimality conditions and their limitations, we may now consider an algorithm to achieve (3.38). This objective naturally suggests a trust-region 6 formulation with adaptative model degree, in which the user specifies a desired criticality order q, assuming that derivatives of order 1, . . . , q are available when needed. We made this idea explicit in Algorithm 4.1.
Step 1 : Step computation. For j = 1, . . . , q, Step 3 with s k = d, where d is the argument of the global minimum in the computation of φ k f, j (x k ).
Step 2 : Termination. Terminate with x = x k and = k .
Step 3 : Accept the new iterate. Compute f (x k + s k ) and (4.1) Step 4 : Update the trust-region radius. Set increment k by one and go to Step 1.
We first state a useful property of Algorithm 4.1, which ensures that a fixed fraction of the iterations 1, 2, . . . , k must be either successful or very successful. Indeed, if we define the following bound holds. Lemma 4.1 Assume that k ≥ min for some min > 0 independent of k. Then Algorithm 4.1 ensures that, whenever S k = ∅, Proof The trust-region update (4.2) ensures that where U k = {1, . . . , k} \ S k . This inequality then yields (4.3) by taking logarithms and using that |S k | ≥ 1 and k = |S k | + |U k |.

Evaluation Complexity for Algorithm 4.1
We start our worst-case analysis by formalizing our assumptions. Let

AS.1
The feasible set F is closed, convex and non-empty.

AS.2
The objective function f is q times continuously differentiable on an open set containing L f .

AS.3
For j ∈ {1, . . . , q}, the jth derivative of f is Lipschitz continuous on L f (in the sense of (2.4)) with Lipschitz constant L f, j ≥ 1.
For simplicity of notation, define L f def = max j∈{1,...,q} L f, j . Algorithm 4.1 is required to start from a feasible x 1 ∈ F, which, together with the fact that the subproblem solution in Step 2 involves minimization over F, leads to AS.1. Note that AS.3 requires AS.2 and automatically holds if f is q + 1 times continuously differentiable and F is bounded.
We now establish a lower bound on the trust-region radius. where κ def = min 1, Proof Assume that, for some ∈ {1, . . . , k} From (4.1), we obtain that, for some j ∈ {1, . . . , q}, where we used (2.8) (implied by AS.3) and the fact that φ f, j (x ) > j to deduce the first inequality, the bound s ≤ to deduce the second, and (4.6) with j ≥ 1 to deduce the third. Thus, ρ ≥ η 2 and +1 ≥ . The mechanism of the algorithm and the inequality 1 ≥ then ensures that, for all ∈ k, ≥ min 1 , We next derive a simple lower bound on the objective function decrease at successful iterations.

Lemma 4.3
Suppose that AS.1-AS.3 hold, and that termination does not occur before iteration k + 1. Then, if k is the index of a successful iteration, (4.8) Proof We have, using (4.1), the fact that φ k f, j (x k ) > j k for some j ∈ {1, . . . , q} and (4.4) successively, that Our worst-case evaluation complexity results can now be proved by summing the decreases guaranteed by this last lemma.
successful iterations (each possibly involving one evaluation of f and its q first derivatives) and at most iterations in total to terminate with an iterate x such that (3.38) holds, where , (4.11) and κ u is given by (4.3). Moreover, if is the value of k at termination, for all d such that (4.13) Proof Let k be the index of an arbitrary iteration before termination. Using the definition of f low , the nature of successful iterations, (4.11) and Lemma 4.3, we deduce that (4.14) which proves (4.9). We next call upon Lemma 4.1 to compute the upper bound on the total number of iterations before termination (obviously, there must be a least one successful iteration unless termination occurs for k = 1) and add one for the evaluation at termination. Finally, (4.12) and (4.13) result from AS.3, Theorem 3.7 and the fact that φ k f,q (x ) ≤ q k at termination. Observe that, because of (4.2) and (4.4), ∈ [κ δ , max ]. Theorem 4.4 generalizes the known bounds for the cases where F = R and q = 1 [46], q = 2 [16,47] and q = 3 [1]. The results for q = 2 with F ⊂ R n and for q > 3 appear to be new. The latter provide the first evaluation complexity bounds for general criticality order q. Note that, if q = 1, bounds of the type O( −( p+1)/ p ) exist if one is ready to minimize models of degree p > q (see [9]). Whether similar improvements can be obtained for q > 1 remains an open question at this stage.
We also observe that the above theory remains valid if the termination rule  (2.9). However, in the derivation of the complexity bounds (4.9) and (4.10), the Lipschitz continuity implied by AS.3 is only used for deriving the first inequality of (4.7), in that Lipschitz continuity of ∇ q x f implies (2.8) along the segment [x k , x k + s k ]. Since it was discussed in Sect. 2.3 that (2.10) implies the same (2.8) along this segment, the weaker assumption

AS.3b
For j ∈ {1, . . . , q}, the jth derivative of f is Lipschitz continuous on "the tree of iterates" in the sense that (2.10) is assumed to hold with constant L f, p ≥ 1 for all x = x k , s = s k , p = j and all k ≥ 0.
is all what is required for deriving (4.7). AS.3b can therefore replace AS.3 in Theorem 4.4 for the limited purpose of ensuring (4.9)-(4.11).

Sharpness
It is interesting that an example was presented in [18] showing that the bound in O( −3 ) evaluations for q = 2 is essentially sharp for both the trust-region and regularization algorithms. This is significant, because requiring φ f,2 (x) ≤ 2 is slightly stronger, for small , than the standard condition [16,47] for instance). Indeed, for one-dimensional problems and assuming ∇ 2 x f (x) ≤ 0, the former condition amounts to requiring that where the absolute value reflects the fact that s = ± depending on the sign of g.
In the remainder of this section, we show that the example proposed in [18] can be extended to arbitrary order q, and thus that the complexity bounds (4.9)-(4.10) are essentially sharp for our trust-region algorithm. The idea of our generalized example is to apply Algorithm 4.1 to a unidimensional objective function f for some fixed q ≥ 1 and F = R + (hence guaranteeing AS.1), generating a sequence of iterates {x k } k≥0 starting from the origin, i.e., x 0 = x 1 = 0. We first choose the sequences of derivatives values up to order q to be, for all k ≥ 1, where δ ∈ (0, 1) is a (small) positive constant. This means that, at iterate x k , the qth-order Taylor model is given by where the value of f (x k ) remains unspecified for now. The step is then obtained by minimizing this model in a trust-region of radius yielding that As a consequence, the model decrease is given by (5.5) For our example, we the define the objective function decrease at iteration k to be (5.6) thereby ensuring that ρ k ∈ [η 1 , η 2 ) and x k+1 = x k + s k for each k. Summing up function decreases, we may then specify the objective function's values at the iterates by where ζ(t) def = ∞ k=1 k −t is the Riemann zeta function. This function is finite for all t > 1 (and thus also for t = 1 + (q + 1)δ), thereby ensuring that f (x k ) ≥ 0 for all k ≥ 0. We also verify that in accordance with (4.2), provided γ 2 ≤ ( 2 /3) 1 q+1 +δ . Observe also that (5.3) and (5.5) ensure that, for each k ≥ 1, and We now use Hermite interpolation to construct the objective function f on the successive intervals [x k , x k+1 ] and define where p k is the polynomial with coefficients defined by the interpolation conditions (5.12) These conditions ensure that f (x) is q times continuously differentiable on R + and thus that AS.2 holds. They also impose the following values for the first q + 1 coefficients where the right-hand side is given by (5. 15) Observe now that the coefficient matrix of this linear system may be written as is an invertible matrix independent of k (see Appendix). Hence, Observe now that, because of (5.4), (5.6), (5.5) and (5.3), These bounds and (5.15) imply that [r k ] i , the ith component of r k , satisfies Hence, using (5.17) and the non-singularity of M q , we obtain that there exists a constant κ q ≥ 1 independent of k such that Moreover, using successively (5.11), the triangle inequality, (5.13), (5.3), (5.4), (5.18) and κ q ≥ 1, we obtain that, for j ∈ {1, . . . , q}, and thus, all derivatives of order one up to q remain bounded on [0, s k ]. Because of (5.10), we therefore obtain that AS.3 holds. Moreover (5.13), (5.18), the inequalities |∇ q x f (x k )| ≤ q! and f (x k ) ≥ 0, (5.10) and (5.4) also ensure that f (x) is bounded below.
We have therefore shown that the bounds of Theorem 4.4 are essentially sharp, in that, for every δ > 0, Algorithm 4.1 applied to the problem of minimizing the lower-bounded objective function f just constructed and satisfying AS.1-AS.3 will take, because of (5.8) and (5.9), 1 q+1 1+(q+1)δ iterations and evaluation of f and its q first derivatives to find an iterate x k such that condition (4.15) holds. Moreover, it is clear that, in the example presented, the global rate of convergence is driven by the term of degree q in the Taylor series.

Discussion
We have analysed the necessary and sufficient optimality conditions of arbitrary order for convexly constrained nonlinear optimization problems, using approximations of the feasible region which generalizes the idea of second-order tangent sets (see [10]) to orders beyond two. Using the resulting necessary conditions, we then proposed a measure of criticality for arbitrary order for convexly constrained nonlinear optimization problems. As this measure can be extended to define -approximate critical points of high order, we have then used it in a conceptual trust-region algorithm to show that if derivatives of the objective function up to order q ≥ 1 can be evaluated and are Lipschitz continuous, then this algorithm applied to the convexly constrained problem (3.1) needs at most O( −(q+1) ) evaluations of f and its derivatives to compute an -approximate qth-order critical point. Moreover, we have shown by an example that this bound is essentially sharp.
In the purely unconstrained case, this result recovers known results for q = 1 (first-order criticality for Lipschitz gradients) [46], q = 2 (second-order criticality 7 with Lipschitz Hessians) [18,47] and q = 3 (third-order criticality 8 with Lipschitz continuous third derivative) [1], but extends them to arbitrary order. The results for the convexly constrained case appear to be new and provide in particular the first complexity bound for second-and third-order criticality for such inequality constrained problems.
Because the condition (4.15) measures different orders of criticality, we could choose to use a different for every order (as in [18]), complicating the expression of the bound accordingly. However, as shown by our example, the worst-case behaviour of Algorithm 4.1 is dominated by that of ∇ q x f , which makes the distinction of the various -s less crucial.
Since the global optimization occurring in the definition of the criticality measure φ f, j (x), the algorithm discussed in the present paper remains, in general, of a theoretical nature. However, there may be cases where this computation is tractable for small enough , for instance if the derivative tensors of the objective function are strongly structured. Such approaches may hopefully be of use for small dimensional or structured highly nonlinear problems, such as those occurring in machine learning using deep learning techniques (see [1]).
The present framework for handling convex constraints is not free of limitations, resulting from our choice to transfer difficulties associated with the original problem to the subproblem solution, thereby sparing precious evaluations of f and its derivatives. In particular, the cost of evaluating any constraint function/derivative possibly defining the convex feasible set F is neglected by the present approach, which must therefore 7 Using (3.34). 8 Using (3.35). be seen as a suitable framework to handle "cheap inequality constraint" such as simple bounds.
Questions of course arise from the results presented. The first is whether it is possible to extend the existing work (e.g., [10]) on bridging the gap between necessary and sufficient optimality conditions for orders one and two to higher orders, possibly by finding sufficient conditions to ensure (3.21) and by isolating problem classes where this constraint qualification condition automatically holds. From the complexity point of view, it is known that the complexity of obtaining -approximate first-order criticality for unconstrained and convexly constrained problem can be reduced to O( −( p+1)/ p ) if one is ready to define the step by using a regularization model of order p ≥ 1. In the unconstrained case, this was shown for p = 2 in [16,47] and for general p ≥ 1 in [9], while the convexly constrained case was analysed (for p = 2) in [17]. The question of whether this methodology and the associated improvements in evaluation complexity bounds can be extended to order above one also remains open at this stage.