Trading Performance for Stability in Markov Decision Processes

We study the complexity of central controller synthesis problems for finite-state Markov decision processes, where the objective is to optimize both the expected mean-payoff performance of the system and its stability. We argue that the basic theoretical notion of expressing the stability in terms of the variance of the mean-payoff (called global variance in our paper) is not always sufficient, since it ignores possible instabilities on respective runs. For this reason we propose alernative definitions of stability, which we call local and hybrid variance, and which express how rewards on each run deviate from the run's own mean-payoff and from the expected mean-payoff, respectively. We show that a strategy ensuring both the expected mean-payoff and the variance below given bounds requires randomization and memory, under all the above semantics of variance. We then look at the problem of determining whether there is a such a strategy. For the global variance, we show that the problem is in PSPACE, and that the answer can be approximated in pseudo-polynomial time. For the hybrid variance, the analogous decision problem is in NP, and a polynomial-time approximating algorithm also exists. For local variance, we show that the decision problem is in NP. Since the overall performance can be traded for stability (and vice versa), we also present algorithms for approximating the associated Pareto curve in all the three cases. Finally, we study a special case of the decision problems, where we require a given expected mean-payoff together with zero variance. Here we show that the problems can be all solved in polynomial time.


I. Introduction
Markov decision processes (MDPs) are a standard model for stochastic dynamic optimization. Roughly speaking, an MDP consists of a finite set of states, where in each state, one of the finitely many actions can be chosen by a controller. For every action, there is a fixed probability distribution over the states. The execution begins in some initial state where the controller selects an outgoing action, and the system evolves into another state according to the distribution associated with the chosen action. Then, another action is chosen by the controller, and so on. A strategy is a recipe for choosing actions. In general, a strategy may depend on the execution history (i.e., actions may be chosen differently when revisiting the same state) and the choice of actions can be randomized (i.e., the strategy specifies a probability distribution over the available actions). Fixing a strategy for the controller makes the behaviour of a given MDP fully probabilistic and determines the usual probability space over its runs, i.e., infinite sequences of states and actions.
A fundamental concept of performance and dependability analysis based on MDP models is mean-payoff. Let us assume that every action is assigned some rational reward, which corresponds to some costs (or gains) caused by the action. The mean-payoff of a given run is then defined as the longrun average reward per executed action, i.e., the limit of partial averages computed for longer and longer prefixes of a given run. For every strategy σ, the overall performance (or throughput) of the system controlled by σ then corresponds to the expected value of mean-payoff, i.e., the expected meanpayoff. It is well known (see, e.g., [18]) that optimal strategies for minimizing/maximizing the expected mean-payoff are positional (i.e., deterministic and independent of execution history), and can be computed in polynomial time. However, the quality of services provided by a given system often depends not only on its overall performance, but also on its stability. For example, an optimal controller for a live video streaming system may achieve the expected throughput of approximately 2 MBits/sec. That is, if a user connects to the server many times, he gets 2 Mbits/sec connection on average. If an acceptable video quality requires at least 1.8 Mbits/sec, the user is also interested in the likelihood that he gets at least 1.8 Mbits/sec. That is, he requires a certain level of overall stability in service quality, which can be measured by the variance of mean-payoff, called global variance in this paper. The basic computational question is "given rationals u and v, is there a strategy that achieves the expected mean-payoff u (or better) and variance v (or better)?". Since the expected meanpayoff can be "traded" for smaller global variance, we are also interested in approximating the associated Pareto curve consisting of all points (u, v) such that (1) there is a strategy achieving the expected mean-payoff u and global variance v; and (2) no strategy can improve u or v without worsening the other parameter.
The global variance says how much the actual mean-payoff of a run tends to deviate from the expected mean-payoff. However, it does not say anything about the stability of individual runs. To see this, consider again the video streaming system example, where we now assume that although the connection is guaranteed to be fast on average, the amount of data delivered per second may change substantially along the executed run for example due to a faulty network infrastructure. For simplicity, let us suppose that performing one action in the underlying MDP model takes one second, and the reward assigned to a given action corresponds to the amount of transferred data. The above scenario can be modeled by saying that 6 Mbits are downloaded every third action, and 0 Mbits are downloaded in other time frames. Then the user gets 2 Mbits/sec connection almost surely, but since the individual runs are apparently "unstable", he may still see a lot of stuttering in the video stream. As an appropriate measure for the stability of individual runs, we propose local variance, which is defined as the long-run average of (r i (ω) − mp(ω)) 2 , where r i (ω) is the reward of the i-th action executed in a run ω and mp(ω) is the mean-payoff of ω. Hence, local variance says how much the rewards of the actions executed along a given run deviate from the mean-payoff of the run on average. For example, if the mean-payoff of a run is 2 Mbits/sec and all of the executed actions deliver 2 Mbits, then the run is "absolutely smooth" and its local variance is zero. The level of "local stability" of the whole system (under a given strategy) then corresponds to the expected local variance. The basic algorithmic problem for local variance is similar to the one for global variance, i.e., "given rationals u and v, is there a strategy that achieves the expected mean-payoff u (or better) and the expected local variance v (or better)?". We are also interested in the underlying Pareto curve.
Observe that the global variance and the expected local variance capture different and to a large extent independent forms of systems' (in)stability. Even if the global variance is small, the expected local variance may be large, and vice versa. In certain situations, we might wish to minimize both of them at the same. Therefore, we propose another notion of hybrid variance as a measure for "combined" stability of a given system. Technically, the hybrid variance of a given run ω is defined as the long-run average of (r i (ω) − E mp ) 2 , where E mp is the expected mean-payoff. That is, hybrid variance says how much the rewards of individual actions executed along a given run deviate from the expected meanpayoff on average. The combined stability of the system then corresponds to the expected hybrid variance. One of the most crucial properties that motivate the definition of hybrid variance is that the expected hybrid variance is small iff both the global variance and the expected local variance are small (in particular, for a prominent class of strategies the expected hybrid variance is a sum of expected local and global variances). The studied algorithmic problems for hybrid variance are analogous to the ones for global and local variance.
The Results. Our results are as follows: 1) (Global variance). The global variance problem was considered before but only under the restriction of memoryless strategies [21]. We first show that in general randomized memoryless strategies are not sufficient for Pareto optimal points for global variance (Example 1). We then establish that 2-memory strategies are sufficient. We show that the basic algorithmic problem for global variance is in PSPACE, and the approximate version can be solved in pseudo-polynomial time. 2) (Local variance). The local variance problem comes with new conceptual challenges. For example, for unichain MDPs, deterministic memoryless strategies are sufficient for global variance, whereas we show (Example 2) that even for unichain MDPs both randomization and memory is required for local variance. We establish that 3-memory strategies are sufficient for Pareto optimality for local variance. We show that the basic algorithmic problem (and hence also the approximate version) is in NP. 3) (Hybrid variance). After defining hybrid variance, we establish that for Pareto optimality 2-memory strategies are sufficient, and in general randomized memoryless strategies are not. We show the basic algorithmic problem for hybrid variance is in NP, and the approximate version can be solved in polynomial time. 4) (Zero variance). Finally, we consider the problem where the variance is optimized to zero (as opposed to a given non-negative number in the general case). In this case, we present polynomial-time algorithms to compute the optimal mean-payoff that can be ensured with zero variance (if zero variance can be ensured) for all the three cases. The polynomial-time algorithms for zero variance for mean-payoff objectives is in sharp contrast to the NP-hardness for cumulative reward MDPs [16]. To prove the above results, one has to overcome various obstacles. For example, although at multiple places we build on the techniques of [13] and [4] which allow us to deal with maximal end components of an MDP separately, we often need to extend these techniques, since unlike the above works which study multiple "independent" objectives, in the case of global and hybrid variance any change of value in the expected mean payoff implies a change of value of the variance. Also, since we do not impose any restrictions on the structure of the strategies, we cannot even assume that the limits defining the mean-payoff and the respective variances exist; this becomes most apparent in the case of local and hybrid variance, where we need to rely on delicate techniques of selecting runs from which the limits can be extracted. Another complication is that while most of the work on multi-objective verification deals with objective functions which are linear, our objective functions are inherently quadratic due to the definition of variance.
The summary of our results is presented in Table I. A simple consequence of our results is that the Pareto curves can be approximated in pseudo-polynomial time in the case of global and hybrid variance, and in exponential time for local variance. for unichain MDPs was considered in [10], where a solution using quadratic programming was designed; under memoryless (stationary) strategies the problem was considered in [21]. All the above works for mean-payoff variance trade-off consider the global variance, and are restricted to memoryless strategies. The problem for general strategies and global variance was not solved before. Although restrictions to unichains or memoryless strategies are feasible in some areas, many systems modelled as MDPs might require more general approach. For example, a decision of a strategy to shut the system down might make it impossible to return the running state again, yielding in a non-unichain MDP. Similarly, it is natural to synthesise strategies that change their decisions over time.
As regards other types of objectives, no work considers the local and hybrid variance problems. The variance problem for discounted reward MDPs was studied in [20]. The trade-off of expected value and variance of cumulative reward in MDPs was studied in [16], showing the zero variance problem to be NP-hard. This contrasts with our results, since in our setting we present polynomial-time algorithms for zero variance.

II. Preliminaries
We use N, Z, Q, and R to denote the sets of positive integers, integers, rational numbers, and real numbers, respectively. We assume familiarity with basic notions of probability theory, e.g., probability space, random variable, or expected value. As usual, a probability distribution over a finite or countable set X is a function f : The set of all distributions over X is denoted by dist(X).
For our purposes, a Markov chain is a triple M = (L, → , µ) where L is a finite or countably infinite set of locations, → ⊆ L × (0, 1] × L is a transition relation such that for each fixed ℓ ∈ L, ℓ x →ℓ ′ x = 1, and µ is the initial probability distribution on L. A run in M is an infinite sequence ω = ℓ 1 ℓ 2 . . . of locations such that ℓ i x → ℓ i+1 for every i ∈ N. A finite path in M is a finite prefix of a run. Each finite path w in M determines the set Cone(w) consisting of all runs that start with w. To M we associate the probability space (Runs M , F , P), where Runs M is the set of all runs in M, F is the σ-field generated by all Cone(w) for finite paths w, and P is the unique probability measure such that P(Cone(ℓ 1 , . . . , ℓ k )) = µ(ℓ 1 ) for all 1 ≤ i < k (the empty product is equal to 1).

Markov decision processes. A Markov decision process
(MDP) is a tuple G = (S , A, Act, δ) where S is a finite set of states, A is a finite set of actions, Act : S → 2 A \ {∅} is an action enabledness function that assigns to each state s the set Act(s) of actions enabled at s, and δ : S × A → dist(S ) is a probabilistic transition function that given a state s and an action a ∈ Act(s) enabled at s gives a probability distribution over the successor states. For simplicity, we assume that every action is enabled in exactly one state, and we denote this state Src(a). Thus, henceforth we will assume that δ : A → dist(S ).
A run in G is an infinite alternating sequence of states and actions ω = s 1 a 1 s 2 a 2 . . . such that for all i ≥ 1, Src(a i ) = s i and δ(a i )(s i+1 ) > 0. We denote by Runs G the set of all runs in G. A finite path of length k in G is a finite prefix w = s 1 a 1 . . . a k−1 s k of a run, and we use last(w) = s k for the last state of w. Given a run ω ∈ Runs G , we denote by (2) for all s, t ∈ T there is a finite path w = s 1 a 1 . . . a k−1 s k such that s 1 = s, s k = t, and all states and actions that appear in w belong to T and B, respectively. An end component (T, B) is a maximal end component (MEC) if it is maximal wrt. pointwise subset ordering. The set of all MECs of G is denoted by MEC(G). Given an end component C = (T, B), we sometimes abuse notation by considering C as the disjoint union of T and B (for example, we write S ∩ C to denote the set T ). For a given C ∈ MEC(G), we use R C to denote the set of all runs ω = s 1 a 1 s 2 a 2 . . . that eventually stay in C, i.e., there is k ∈ N such that for all k ′ ≥ k we have that s k ′ , a k ′ ∈ C.
Strategies and plays. Intuitively, a strategy in an MDP G is a "recipe" to choose actions. Usually, a strategy is formally defined as a function σ : (S A) * S → dist(A) that given a finite path w, representing the execution history, gives a probability distribution over the actions enabled in last(w). In this paper we adopt a definition which is equivalent to the standard one, but more convenient for our purpose. Let M be a finite or countably infinite set of memory elements. A strategy is a triple σ = (σ u , σ n , α), where σ u : A × S × M → dist(M) and σ n : S × M → dist(A) are memory update and next move functions, respectively, and α is an initial distribution on memory elements. We require that for all (s, m) ∈ S × M, the distribution σ n (s, m) assigns a positive value only to actions enabled at s. The set of all strategies is denoted by Σ (the underlying MDP G will be always clear from the context).
A play of G determined by an initial state s ∈ S and a strategy σ is a Markov chain G σ s (or G σ if s is clear from the context) where the set of locations is S × M × A, the initial distribution µ is positive only on (some) elements of {s}×M×A where µ(s, m, a) = α(m) · σ n (s, m)(a), and (t, m, a) starts in a location chosen randomly according to α and σ n . In a current location (t, m, a), the next action to be performed is a, hence the probability of entering t ′ is δ(a)(t ′ ). The probability of updating the memory to m ′ is σ u (a, t ′ , m)(m ′ ), and the probability of selecting a ′ as the next action is σ n (t ′ , m ′ )(a ′ ). Since these choices are independent (in the probability theory sense), we obtain the product above.
Note that every run in G σ s determines a unique run in G. Hence, every notion originally defined for the runs in G can also be used for the runs in G σ s , and we use this fact implicitly at many places in this paper. For example, we use the symbol R C to denote the set of all runs in G σ s that eventually stay in C, certain functions originally defined over Runs G are interpreted as random variables over the runs in G σ s , etc. Strategy types. In general, a strategy may use infinite memory, and both σ u and σ n may randomize. A strategy is pure (or deterministic) if α is Dirac and both the memory update and the next move functions give a Dirac distribution for every argument, and stochastic-update if α, σ u , and σ n are unrestricted. Note that every pure strategy is stochastic-update. A randomized strategy is a strategy which is not necessarily pure. We also classify the strategies according to the size of memory they use. Important subclasses are memoryless strategies, in which M is a singleton, n-memory strategies, in which M has exactly n elements, and finite-memory strategies, in which M is finite.
For a finite-memory strategy σ, a bottom strongly connected component (BSCC) of G σ s is a subset of locations W ⊆ S × M × A such that for all ℓ 1 ∈ W and ℓ 2 ∈ S × M × A we have that (i) if ℓ 2 is reachable from ℓ 1 , then ℓ 2 ∈ W, and (ii) for all ℓ 1 , ℓ 2 ∈ W we have that ℓ 2 is reachable from ℓ 1 . Every BSCC W determines a unique end component ({s | (s, m, a) ∈ W}, {a | (s, m, a) ∈ W}), and we sometimes do not distinguish between W and its associated end component.
An MDP is strongly connected if all its states form a single (maximal) end component. A strongly connected MDP is a unichain if for all end components (T, B) we have T = S . Throughout this paper we will use the following standard result about MECs. Global, local, and hybrid variance. Let G = (S , A, Act, δ) be an MDP, and r : A → Q a reward function. We define the mean-payoff of a run ω ∈ Runs G by The expected value and variance of mp in G σ s are denoted by E σ s mp and V σ s mp , respectively (recall that V σ Intuitively, E σ s mp corresponds to the "overall performance" of G σ s , and V σ s mp is a measure of "global stability" of G σ s indicating how much the mean payoffs of runs in G σ s tend to deviate from E σ s mp (see Section I). In the rest of this paper, we refer to V σ s mp as global variance. The stability of a given run ω ∈ Runs G (see Section I) is measured by its local variance defined as follows: Note that lv(ω) is not really a "variance" in the usual sense of probability theory 1 . We call the function lv(ω) "local variance" because we find this name suggestive; lv(ω) is the long-run average square of the distance from mp(ω). The expected value of lv in G σ s is denoted by E σ s [lv]. Finally, given a run ω in G σ s , we define the hybrid variance of ω in G σ s as follows: Note that the definition of hv(ω) depends on the expected mean payoff, and hence it makes sense only after fixing a strategy σ and an initial state s. Sometimes we also write hv σ,s (ω) instead of hv(ω) to prevent confusions about the underlying σ and s.  The studied problems. In this paper, we study the following basic problems connected to the three stability measures introduced above (below V σ s is either V σ s mp , E σ s [lv], or E σ s [hv]): • Pareto optimal strategies and their memory. Do Pareto optimal strategies exist for all points on the Pareto curve? Do Pareto optimal strategies require memory and randomization in general? Do strategies achieving non-Pareto points require memory and randomization in general? • Deciding strategy existence. For a given MDP G, an initial state s, a rational reward function r, and a point (u, v) ∈ Q 2 , we ask whether there exists a strategy σ such that (E σ s mp , V σ s ) ≤ (u, v). • Approximation of strategy existence. For a given MDP G, an initial state s, a rational reward function r, a number ε and a point (u, v) ∈ Q 2 , we want to get an algorithm which (a) outputs "yes" if there is a strategy σ such that , we wish to compute such strategy. Note that it is not a priori clear that σ is finitely representable, and hence we also need to answer the question what type of strategies is needed to achieve Pareto optimal points. • Optimal performance with zero-variance. Here we are interested in deciding if there exists a Pareto point of the form (u, 0) and computing the value of u, i.e., the optimal expected mean payoff achievable with "absolute stability" (note that the variance is always non-negative and its value 0 corresponds to stable behaviours).

Remark 1.
If the approximation of strategy existence problem is decidable, we design the following algorithm to approximate the Pareto curve up to an arbitrarily small given ε > 0. We compute a finite set of points P ⊆ Q 2 such that (1) for every Note that |E σ s mp | ≤ R and V σ s ≤ R 2 for an arbitrary strategy σ. Hence, the set P is computable by a naive algorithm which decides the approximation of strategy existence for O(|R| 3 /ε 2 ) points in the corresponding ε-grid and puts O(|R| 2 /ε) points into P. The question whether the three Pareto curves can be approximated more efficiently by sophisticated methods based on deeper analysis of their properties is left for future work.

III. Global variance
In the rest of this paper, unless specified otherwise, we suppose we work with a fixed MDP G = (S , A, Act, δ) and a reward function r : A → Q. We start by proving that both memory and randomization is needed even for achieving non-Pareto points; this implies that memory and randomization is needed even to approximate the value of Pareto points. Then we show that 2-memory stochastic update strategies are sufficient, which gives a tight bound.  Fig. 1. Observe that the point (4, 2) is achievable by a strategy σ which selects c with probability 4 5 and d with probability 1 5 upon the first visit to s 3 ; in every other visit to s 3 , the strategy σ selects c with probability 1. Hence, σ is a 2-memory randomized strategy which stays in MEC C = ({s 3 }, {c}) with probability 1 2
Interestingly, if the MDP is strongly connected, memoryless deterministic strategies always suffice, because in this case a memoryless strategy that minimizes the expected mean payoff immediately gets zero variance. This is in contrast with local and hybrid variance, where we will show that memory and randomization is required in general already for unichain MDPs. For the general case of global variance, the sufficiency of 2-memory strategies is captured by the following theorem.
, then there is a 2-memory strategy with the same properties. Moreover, Pareto optimal strategies always exist, the problem whether there is a strategy achieving a point (u, v) is in PSPACE, and approximation of the answer can be done in pseudo-polynomial time.
Note that every C ∈ MEC(G) can be seen as a strongly connected MDP. By using standard linear programming methods (see, e.g., [18]), for every C ∈ MEC(G) we can compute the minimal and the maximal expected mean payoff achievable in C, denoted by α C and β C , in polynomial time (since C is strongly connected, the choice of initial state is irrelevant). Thus, we can also compute the system L of Fig. 2 in polynomial time. We show the following: Proposition 1. Let s ∈ S and u, v ∈ R.
1) If there is a strategy ζ satisfying (E ζ s mp , V ζ s mp ) ≤ (u, v) then the system L of Fig. 2 has a solution.
2) If the system L of Fig. 2 has a solution, then there exist a 2-memory stochastic-update strategy σ and z ∈ R such that (E σ s mp , V σ s mp ) ≤ (u, v) and for every C ∈ MEC(G) we have the following: If α C > z, then Observe that the existence of Pareto optimal strategies follows from the above proposition, since we define points (u, v) that some strategy can achieve by a continous function from values x C and t∈S ∩C y t for C ∈ MEC(G) to R 2 . Because the domain is bounded (all x C and t∈S ∩C y t have minimal and maximal values they can achieve) and closed (the points of the domain are expressible as a projection of feasible solutions of a linear program), it is also compact, and a continuous map of a compact set is compact [19], and hence closed.
Let us briefly sketch the proof of Proposition 1, which combines new techniques with results of [4], [13]. We start with Item 1. Let ζ be a strategy satisfying (E ζ s mp , V ζ s mp ) ≤ (u, v). First, note that almost every run of G ζ s eventually stays in some MEC of G by Lemma 1. The way how ζ determines the values of all y κ , where κ ∈ S ∪ A, is exactly the same as in [4] and it is based on the ideas of [13]. The details are given in Appendix A1. The important property preserved is that for every C ∈ MEC(G) and every state t ∈ S ∩ C, the value of y t corresponds to the probability that a run stays in C and enters C via the state t. Hence, t∈S ∩C y t is the probability that a run of G ζ s eventually stays in C. The way how ζ determines the value of y a , where a ∈ A, is explained in Appendix A1. The value of x C is the conditional expected mean payoff under the condition that a run stays in C, i.e., (4) and (5) are satisfied. Further, E ζ s mp = C∈MEC(G) x C · t∈S ∩C y t , and hence (6) holds. Note that V ζ s mp is not necessarily equal to the righthand side of (7), and hence it is not immediately clear why (7) should hold. Here we need the following lemma (a proof is given in Appendix A2): Then there exists a memoryless randomized strategy σ z C such that for every state t ∈ C ∩ S we have that P σ z C t mp=z C = 1.
Using Lemma 2, we can define another strategy ζ ′ from ζ such that for every C ∈ MEC(G) we have the following: , and therefore (1)-(6) also hold if we use ζ ′ instead of ζ to determine the values of all variables. Further, the right-hand side of (7) is equal to V ζ ′ s mp , and hence (7) holds. This completes the proof of Item 1.
Item 2 is proved as follows. Let y κ , where κ ∈ S ∪ A, and x C , where C ∈ MEC(G), be a solution of L. For every C ∈ MEC(G), we put y C = t∈S ∩C y t . By using the results of Sections 3 and 5 of [13] and the modifications presented in [4], we first construct a finite-memory stochastic update strategy ̺ such that the probability of R C in G ̺ s is equal to y C . Then, we construct a strategyσ which plays according to ̺ until a bottom strongly connected component B of G ̺ s is reached. Observe that the set of all states and actions which appear in B is a subset of some C ∈ MEC(G). From that point on, the strategyσ "switches" to the memoryless randomized strategy σ x C of Lemma 2. Hence, E ̺ s mp and V ̺ s mp are equal to the right-hand sides of (6) and (7), respectively, and thus . Note thatσ may use more than 2-memory elements. A 2-memory strategy is obtained by modifying the initial part ofσ (i.e., the part before the switch) into a memoryless strategy in the same way as in [4]. Then, σ only needs to remember whether a switch has already been performed or not, and hence 2 memory elements are sufficient. Finally, we transformσ into another 2-memory stochastic update strategy σ which satisfies the extra conditions of Item 2 for a suitable z. This is achieved by modifying the behaviour ofσ in some MECs so that the probability of staying in every MEC is preserved, the expected mean payoff is also preserved, and the global variance can only decrease. This part is somewhat tricky and the details are given in Appendix A.
We can solve the strategy existence problem by encoding the existence of a solution to L as a closed formula Φ of the existential fragment of (R, +, * , ≤). Since Φ is computable in polynomial time and the existential fragment of (R, +, * , ≤) is decidable in polynomial space [5], we obtain Theorem 1.
The pseudo-polynomial-time approximation algorithm is obtained as follows. First note that if we had the number z above, we could simplify the system L of Fig. 2 by substituting all x C variables with constants. Then, (4) and (5) can be eliminated, (6) becomes a linear constraint, and (7) the only quadratic constraint. Thus, the system L can be transformed into a quadratic program L z in which the quadratic constraint is negative semi-definite with rank 1 (see Appendix A5), and hence approximated in polynomial time [23]. Since we do not know the precise number z we try different candidatesz, namely we approximate the value (to the precision ε 2 ) of Lz for all numbersz between min a∈A r(a) and max a∈A r(a) that are a multiple of τ = ε 8 max{N,1} where N is the maximal absolute value of an assigned reward. If any Lz has a solution lower than u − ε 2 , we output "yes", otherwise we output "no". The correctness of the algorithm is proved in Appendix A6.
Note that if we knew the constant z we would even get that the approximation problem can be solved in polynomial time (assuming that the number of digits in z is polynomial in the size of the problem instance). Unfortunately, our proof of Item 2 does not give a procedure for computing z, and we cannot even conclude that z is rational. We conjecture that the constant z can actually be chosen as a rational number with small number of digits (which would immediately lower the complexity of strategy existence to NP using the results of [22] for solving negative semi-definite quadratic programs). Also note that Remark 1 and Theorem 1 immediately yield the following result.

IV. Local variance
In this section we analyse the problem for local variance. As before, we start by showing the lower bounds for memory needed by strategies, and then provide an upper bound together with an algorithm computing a Pareto optimal strategy. As in the case of global variance, Pareto optimal strategies require both randomization and memory, however, in contrast to global variance where for unichain MDPs deterministic memoryless strategies are sufficient we show (in the following example) that for local variance both memory and randomization is required even for unichain MDPs. Figure 3 and consider a strategy σ that in the first step in s 1 makes a random choice uniformly between a and b, and then, whenever the state s 1 is revisited, it chooses the action that was chosen in the first step. The expected mean-payoff under such strategy is 0.5·2+0.5·1 = 1.5 and the variance is 0.5· 0.5·(0−1) 2 +0.5·(2−1) 2 + 0.5·(2− 2) 2 = 0.5. We show that the point (1.5, 0.5) cannot be achieved by any memoryless randomized strategy σ ′ . Given

Example 2. Consider the MDP from
Insufficiency of deterministic history-dependent strategies is proved using the same equations and the fact that there is only one run under such a strategy.
Thus have shown that memory and randomization is needed to achieve a non-Pareto point (1.55, 0.6). The need of memory and randomization to achieve Pareto points will follow later from the fact that there always exist Pareto optimal strategies.
In the remainder of this section we prove the following.
then there is a 3-memory strategy with the same properties. The problem whether such a strategy exists belongs to NP. Moreover, Pareto optimal strategies always exist.
We start by proving that 3-memory stochastic update strategies achieve all achievable points wrt. local variance.

Proposition 2.
For every strategy ζ there is a 3-memory stochastic-update strategy σ satisfying ) Moreover, the three memory elements of σ, say m 1 , m 2 , m ′ 2 , satisfy the following: • The memory element m 1 is initial, σ may randomize in m 1 and may stochastically update its memory either to m 2 , or to m ′ 2 . • In m 2 and m ′ 2 the strategy ζ behaves deterministically and never changes its memory.
In what follows we sometimes treat each MEC C as a standalone MDP obtained by restricting G to C. Then, for example, C κ denotes the Markov chain obtained by applying the strategy κ to the component C.
The next proposition formalizes the main idea of our proof: Proposition 3. Let C be a MEC. There are two frequency functions f C : C → R and f ′ C : C → R on C, and a number p C ∈ [0, 1] such that the following holds The proposition is proved in Appendix B2, where we first show that it follows from a relaxed version of the proposition which gives us, for any ε > 0, frequency functions f ε and f ′ ε and number p ε such that Then we show that the weaker version holds by showing that there are runs ω from which we can extract the frequency functions f ε and f ′ ε . The selection of runs is rather involved, since it is not clear a priori which runs to pick or even how to extract the frequencies from them (note that the naive approach of considering the average ratio of taking a given action a does not work, since the averages might not be defined).
Proposition 3 implies that any expected mean payoff and local variance achievable on a MEC C can be achieved by a composition of two memoryless randomized strategies giving precisely the frequencies of actions specified by f C and f ′ C (note that lv[ f C ] and lv[ f ′ C ] may not be equal to the expected local variance of such strategies, but we show that the "real" expected local variance cannot be larger). By further selecting BSCCs of these strategies and using some de-randomization tricks we obtain, for every MEC C, two memoryless deterministic strategies π C and π ′ C and a constant h C such that for every s ∈ C ∩ S the value of . We define two memoryless deterministic strategies π and π ′ that in every C behave as π C and π ′ C , respectively. Details of the steps above are postponed to Appendix B3.
Using similar arguments as in [4] (that in turn depend on results of [13]) one may show that there is a 2-memory stochastic update strategy σ ′ , with two memory locations m 1 , m 2 , satisfying the following properties: In m 1 , the strategy σ ′ may randomize and may stochastically update its memory to m 2 . In m 2 , the strategy σ ′ never changes its memory. Most importantly, the probability that σ ′ updates its memory from m 1 to m 2 in a given MEC C is equal to P ζ s 0 [R C ]. We modify the strategy σ ′ to the desired 3-memory σ by splitting the memory element m 2 into two elements m 2 , m ′ 2 . Whenever σ ′ updates to m 2 , the strategy σ further chooses randomly whether to update either to m 2 (with prob. h C ), or to m ′ 2 (with prob. 1 − h C ). Once in m 2 or m ′ 2 , the strategy σ never changes its memory and plays according to π or π ′ , respectively. For every MEC C we have P σ as shown in Appendix B4. Proposition 2 combined with results of [4] allows us to finish the proof of Theorem 2.
Proof (of Theorem 2): Intuitively, the non-deterministic polynomial time algorithm works as follows: First, guess two memoryless deterministic strategies π and π ′ . Verify whether there is a 3-memory stochastic update strategy σ with memory elements m 1 , m 2 , m ′ 2 which in m 2 behaves as π, and in m ′ . Note that it suffices to compute the probability distributions chosen by σ in the memory element m 1 and the probabilities of updating to m 2 and m ′ 2 . This can be done by a reduction to the controller synthesis problem for two dimensional mean-payoff objectives studied in [4].
More concretely, we construct a new MDP G[π, π ′ ] with (Intuitively, the actions [π] and [π ′ ] simulate the update of the memory element m 2 and to m ′ 2 , respectively, in σ. As σ is supposed to behave in a fixed way in m 2 and m ′ 2 , we do not need to simulate its behavior in these states in G[π, π ′ ]. Hence, the G[π, π ′ ] just loops under the action default in the states (s, m 2 ) and (s, m ′ 2 ). The action default is also used in the initial state to denote that the initial memory element is m 1 .) • the probabilistic transition function δ ′ defined as follows: ) and r(s in ) = r((s, m 1 )) := (max a∈A r(a) + 1, (max a∈A r(a) − min a∈A r(a)) 2 + 1). (Here the rewards are chosen in such a way that no (Pareto) optimal scheduler can stay in the states of the form (s, m 1 ) with positive probability.) Note that r can be computed in polynomial time using standard algorithms for computing mean-payoff in Markov chains [17]. ). Also, we show that such ρ can be computed in polynomial time using results of [4]. Finally, it is straightforward to move the second component of the states of G[π, π ′ ] to the memory of a stochastic update strategy which gives a 3-memory stochastic update strategy σ for G with the desired properties. Thus a non-deterministic polynomial time algorithm works as follows: (1) guess π, π ′ (2) construct G[π, π ′ ] and r (3) compute ρ (if it exists). As noted above, ρ can be transformed to the 3-memory stochastic update strategy σ in polynomial time.
Finally, we can show that Pareto optimal strategies exist by a reasoning similar to the one used in global variance.
Theorem 2 and Remark 1 give the following corollary.

Corollary 2.
The approximate Pareto curve for local variance can be computed in exponential time.

V. Hybrid variance
We start by showing that memory or randomization is needed for Pareto optimal strategies in unichain MDPs for hybrid variance; and then show that both memory and randomization is required for hybrid variance for general MDPs. However, a memoryless randomized strategy σ which randomizes uniformly between a and b yields the expectation 1.5 and variance which makes it incomparable to either of the memoryless deterministic strategies. Similarly, the deterministic strategy which alternates between a and b on subsequent visits of s 1 yields the same values as the σ above. This gives us that memory or randomization is needed even to achieve a non-Pareto point (1.6, 0.8).
Before proceeding with general MDPs, we give the following proposition, which states an interesting and important relation between the three notions of variance 3 . The proposition is proved in Appendix C1.
Now we can show that both memory and randomization is needed, by extending Example 1.
, then there is a 2-memory strategy with the same properties. The problem whether such a strategy exists belongs to NP, and approximation of the answer can be done in polynomial time. Moreover, Pareto optimal strategies always exist.
We start by proving that 2-memory stochastic update strategies are sufficient for Pareto optimality wrt. hybrid variance. (Fig. 4) has a non-negative solution.
2) If there is a non-negative solution for the system L H (Fig. 4), then there is a 2-memory stochastic-update . Notice that we get the existence of Pareto optimal strategies as a side product of the above proposition, similarly to the case of global variance. We briefly sketch the main ingredients for the proof of Proposition 5. We first establish the sufficiency of finitememory strategies by showing that for an arbitrary strategy ζ, there is a 3-memory stochastic update strategy σ such that The key idea of the proof of the construction of a 3-memory stochastic update strategy σ from an arbitrary strategy ζ is similar to the proof of Proposition 2. The details are in Appendix C2. We then focus on finite-memory strategies. For a finite-memory strategy ζ, the frequencies are well-defined, and for an action a ∈ A, let f (a) ≔ lim ℓ→∞ denote the frequency of action a. We show that setting x a ≔ f (a) for all a ∈ A satisfies Eqns. (12), Eqns. (13) and Eqns. (14) of L H . To obtain y a and y s , we define them in the same way as done in [4, Proposition 2] using the results of [13]. The details are postponed to Appendix C3. This completes the proof of the first item. The proof of the second item is as follows: the construction of a 2-memory stochastic update strategy σ from the constraints of the system L H (other than constraint of Eqns 14) was presented in [4, Proposition 1]. The key argument to show that strategy σ also satisfies Eqns 14 is obtained by establishing that for the strategy σ we have: here mp r 2 is the value of mp w.r.t. reward function defined by r 2 (a) = r(a) 2 ; the equality is shown in Appendix C4). It follows immediately that Eqns 14 is satisfied. This completes the proof of Proposition 5. Finally we show that for the quadratic program defined by the system L H , the quadratic constraint satisfies the conditions of negative semi-definite programming with matrix of rank 1 (see Appendix C5). Since negative semi-definite programs can be decided in NP [22] and with the additional restriction of rank 1 can be approximated in polynomial time [23], we get the complexity bounds of Theorem 3. Finally, Theorem 3 and Remark 1 give the following result. VI. Zero variance with optimal performance Now we present polynomial-time algorithms to compute the optimal expectation that can be ensured along with zero variance. The results are captured in the following theorem.

Theorem 4. The minimal expectation that can be ensured
1) with zero hybrid variance can be computed in O((|S | · |A|) 2 ) time using discrete graph theoretic algorithms; 2) with zero local variance can be computed in PTIME; 3) with zero global variance can be computed in PTIME.
Hybrid variance. The algorithm for zero hybrid variance is as follows: (1) Order the rewards in an increasing sequence β 1 < β 2 < . . . < β n ; (2) find the least i such that A i is the set of actions with reward β i and it can be ensured with probability 1 (almost-surely) that eventually only actions in A i are visited, and output β i ; and (3) if no such i exists output "NO" (i.e., zero hybrid variance cannot be ensured). Since almost-sure winning for MDPs with eventually always property (i.e., eventualy only actions in A i are visited) can be decided in quadratic time with discrete graph theoretic algorithm [7], [6], we obtain the first item of Theorem 4. The correctness is proved in Appendix D1. Local variance. For zero local variance, we make use of the previous algorithm. The intuition is that to minimize the expectation with zero local variance, a strategy σ needs to reach states s in which zero hybrid variance can be ensured by strategies σ s , and then mimic them. Moreover, σ minimizes the expected value of mp among all possible behaviours satisfying the above. The algorithm is as follows: (1) Use the algorithm for zero hybrid variance to compute a function β that assigns to every state s the minimal expectation value β(s) that can be ensured along with zero hybrid variance when starting in s, and if zero hybrid variance cannot be ensured, then β(s) is assigned +∞. Let M = 1 + max s∈S β(s). (2) Construct an MDP G as follows: For each state s such that β(s) < ∞ we add a state s with a self-loop on it, and we add a new action a s that leads from s to s. (3) Assign a reward β(s) − M to a s , and 0 to all other actions. Let T = {a s | β(s) < ∞} be the target set of actions. (4) Compute a strategy that minimizes the cumulative reward and ensures almost-sure (probability 1) reachability to T in G. Let β(s) denote the minimal expected payoff for the cumulative reward; and β(s) = β(s) + M. In Appendix D2 we show that β(s) is the minimal expectation that can be ensured with zero local variance, and every step of the above computation can be achieved in polynomial time. This gives us the second item of Theorem 4. Global variance. The basic intuition for zero global variance is that we need to find the minimal number y such that there is an almost-sure winning strategy to reach the MECs where expectation exactly y can be ensured with zero variance.
The algorithm works as follows: (1) Compute the MEC decomposition of the MDP and let the MECs be C 1 , C 2 , . . . , C n .
(2) For every MEC C i compute the minimal expectation α C i = inf σ min s∈C i E σ s mp and the maximal expectation β C i = sup σ max s∈C i E σ s mp that can be ensured in the MDP induced by the MEC C i . (3) Sort the values α C i in a non-decreasing order as ℓ 1 ≤ ℓ 2 ≤ . . . ≤ ℓ n . (4) Find the least i such that (a) C i = {C j | α C j ≤ ℓ i ≤ β C j } is the MEC's whose interval contains ℓ i ; (b) almost-sure (probability 1) reachability to the set C j ∈C i C j (the union of the MECs in C i ) can be ensured; and output ℓ i . (5) If no such i exists, then the answer to zero global variance is "NO" (i.e., zero global variance cannot be ensured). All the above steps can be computed in polynomial time. The correctness is proved in Appendix D3, and we obtain the last item of Theorem 4.

VII. Conclusion
We studied three notions of variance for MDPs with meanpayoff objectives: global (the standard one), local and hybrid variance. We established a strategy complexity (i.e., the memory and randomization required) for Pareto optimal strategies. Appendix A. Proofs for Global Variance 1) Obtaining values y κ for κ ∈ S ∪ A in Item 1 of Proposition 1: Let G be an MDP, and let G ′ be obtained from G by adding a state d s for every state s ∈ S , and an action a s that leads to d s from s.

Lemma 3. Let σ be a strategy for G. Then there is a strategyσ in G
. Proof: We give a proof by contradiction. Let C 1 , . . . C n be all MECs of G, and let X ⊆ R n be the set of all points (x 1 , . . . , x n ) for which there is a strategy σ ′ in G ′ such that P σ ′ s in Let (y 1 , . . . , y n ) be the numbers such that P σ s in R C i = y i for all 1 ≤ i ≤ n. For contradiction, suppose (y 1 , . . . , y n ) X. By [13, Theorem 3.2] the set X can be described as a set of solutions of a linear program, and hence it is convex. By separating hyperplane theorem (see e.g. [3]) there are non-negative weights w 1 , . . . , w n such that n i=0 y i · w i > n i=0 x i · w i for every (x 1 , . . . , x n ) ∈ X. We define a reward function r by r(a) = w i for an action a from C i , where 1 ≤ i ≤ n, and r(a) = 0 for actions not in any MEC. Observe that the mean payoff of any run that eventually stays in a MEC C i is w i , and so the expected mean payoff w.r.t. r under σ is n i=0 y i · w i . Because memoryless deterministic strategies suffice for maximizing the expected mean payoff, there is also a memoryless deterministic strategyσ for G that yields expected mean payoff w.r.t. r equal to z ≥ n i=0 y i · w i . We now define a strategyσ for G ′ to mimicσ until a BSCC is reached, and when a BSCC is reached, say along a path w, the strategyσ takes the action a last (w) . Let x i = Pσ s in s∈C i Reach(d s ) . Due to the construction ofσ we have x i = Pσ s in R C i : this follows because once a BSCC is reached on a path w, every run ω extending w has an infinite suffix containing only the states of the MEC containing the state last(w). Hence n i=0 x i · w i = z. However, by the choice of the weights w i we get that (x 1 , . . . , x n ) X, and hence a contradiction, becauseσ witnesses that (x 1 , . . . , x n ) ∈ X.
Let ζ be the strategy from Item 1. of Proposition 1. By the above lemma there is a strategy ζ ′ for G ′ such that P Let σ 1 and σ 2 be memoryless deterministic strategies that minimize and maximize the expectation, respectively, and only yield one BSCC for any initial state. Let σ ′ be arbitrary memoryless randomized strategy that visits every action in C with nonzero frequency (such strategy clearly exists). We define the strategy σ z C as follows. If z C = a∈C∩A f σ ′ (a) · r(a), then σ z C = σ ′ . If z C > a∈C∩A f σ ′ (a) · r(a), then, because also z C ≤ a∈C∩A f σ 2 (a) · r(a), there must be a number p ∈ (0, 1] such that We define numbers z a = p · f σ ′ (a) + (1 − p) · f σ 2 (a) for all a ∈ C ∩ A. Observe that we have, for any s ∈ C a∈C∩A z a · δ(a)(s) = a∈C∩A p · f σ ′ (a) · δ(a)(s) + (1 − p) · f σ 2 (a) · δ(a)(s) Hence, there is a memoryless randomized strategy σ z C which visits a with frequency z a , hence giving the expectation r(a) we proceed similarly, this time combining σ C with σ 1 instead of σ 2 .
3) Showing that V ζ s mp ≥ V ζ ′ s mp : Since by law of total variance V(Z) = E(V(Z|Y)) + V(E(Z|Y)) for all random variables Y, Z we have for σ ∈ {ζ, ζ ′ }: where X is the random variable which to every MEC C assigns E σ s mp|R C . Note that these random variables are equal for both ζ and ζ ′ , and so also the second summands in the equation above are equal for ζ and ζ ′ . In the first summand, all the values V ζ s mp|R C are nonnegative, while V ζ ′ s mp|R C are zero. Hence the variance can only decrease when we go from ζ to ζ ′ .

4) Fromσ to σ:
In the construction of σ we employ the following technical lemma.

Lemma 4. Let A be a finite set, X, Y :
A → R be random variables, a 1 , a 2 ∈ A and d > 0 a number satisfying the following: Proof: Let us fix the following notation: For expectation, we have For variance, we need to show that which boils down to showing that We have and so we need to show that the term on the last line is not positive. It is equal to and hence we need to show that d + 2(e 1 − e 2 ) + p 1 p 2 · d is not positive, which is the case, because by the assumption we have (e 2 − e 1 ) = Y(a 2 ) + p 1 Letσ be the strategy from page 6, i.e. for every MEC C there is a number x C such that mp(ω) = x C for almost every run from R C . Let us fix arbitrary z, and let C(z, σ) be the set of all the MECs which satisfy: We create a sequence of strategies σ 0 , σ 1 . . . and numbers z 0 , z 1 , . . . by starting with σ 0 =σ, z 0 = z and creating σ k+1 and z k+1 from σ k and z k as follows, finishing the sequence with a desired strategy σ. First, until possible, we repeat the following step.
If there are MECs C i and C j in C(z k , σ k ) such that x C i < z and We construct a 2-memory strategy σ k+1 that preserves the probabilities of σ k to reach each of the MECs, satisfies E σ k+1 s mp | R C = E σ k s mp | R C and V σ k+1 s mp | R C = 0 for every MEC C different from C i and C j , and also satisfies E σ k+1 We also define z k+1 = z k . By Lemma 4 the resulting strategy σ k+1 satisfies E σ k+1 s mp = E σ k s mp and V σ k+1 s mp ≤ V σ k s mp . Also, C(z k+1 , σ k+1 ) C(z k , σ k ), because one of the MECs C i and C j does not satisfy the defining condition of C and no new MEC satisfies it.
Once it is not possible to perform the above, we either got C(z k+1 , σ k+1 ) = ∅ (in which case we put σ = σ k+1 and we are done) or exactly one of the following takes place: there is a MEC C in C(z k+1 , σ k+1 ) such that x C > z or there is a MEC C in C(z k+1 , σ k+1 ) such that x C < z. Depending on which of these two happen, we continue building the sequence of strategies and numbers using one of the following items, until possible.
. By Lemma 4 the resulting strategy satisfies E σ k+1 s mp = E σ k s mp and V σ ′ s mp ≤ V σ k s mp . One of the following also takes place: We set z k+1 = z k and continue, if possible. • If there is a MEC C such that x C < z we proceed similarly as in the above item. Note that the above procedure eventually terminates, because in every step either C(z i+1 , σ i+1 ) ⊆ C(z i , σ i ), and for m = |MEC(G)| we have C(z i+m , σ i+m ) C(z i+1 , σ i+1 ), because if C(z i+1 , σ i+1 ) = C(z i , σ i ), then D(z i+1 , σ i+1 ) D(z i , σ i ) and |D(·, ·)| ≤ m.

5) Solving Lẑ in polynomial time.:
Lemma 5. Let n ∈ N and m i ∈ N for every 1 ≤ i ≤ n. For all 1 ≤ i ≤ n and 1 ≤ j ≤ m i , we use i, j to denote the index j + i−1 ℓ=1 m ℓ . Consider a function f :

Then f ( v) can be written as f ( v) = v T Q v + d T v where Q is a negative semi-definite matrix of rank 1 and d ∈ R k . Consequently, f ( v) is concave and Q has exactly one eigenvalue.
Proof: Observe that every vector u ∈ R k can be written as u T = ( u 1,1 , . . . , u 1,m 1 , · · · , u n,1 , . . . , Let u ∈ R k be a (fixed) vector such that u i, j = −c i . Then the i ′ , j ′ -th column of Q is equal to c i ′ · u, which means that the rank of Q is 1. The matrix Q is negative semi-definite because v T Q v ≤ 0 for every v ∈ R k . 6) Correctness of the approximation algorithm.: Assume there is a strategy σ such that (E σ s mp , V σ s mp ) ≤ (u − ε, v − ε), and let z be the number from Item 2, and let us fix a valuationȳ κ for the variables y κ where κ ∈ S ∪ A from equations of the system L (see Figure 2). Letz be a number between the minimal and the maximal assigned reward that is a multiple of τ, and which satisfies |z −z| < τ. Such a number must exist. We show that the system Lz has a solution. The valuationȳ κ can be applied to the system Lz, and we get Hence we have shown that there is a solution for Lz, and so the algorithm returns "yes".
On the other hand, if there is no strategy such that (E σ s mp , V σ s mp ) ≤ (u, v), then the algorithm clearly returns "no".

B. Proofs for Local Variance 1) Computation for Example 2:
We have Throughout this section we use the following three simple lemmas. The first one allows us to reduce convex combinations of two-dimensional vectors (typically vectors consisting of the mean-payoff and variance) to combinations of just two vectors. (a 1 , b 1 ), (a 2 , b 2 ), . . . , (a m , b m ) be a sequence of points in R 2 and c 1 , c 2 , . . . , c m ∈ (0, 1] satisfy m i=1 c i = 1. Then  there are two vectors (a k , b k ) and (a ℓ , b ℓ ) and a number p

Lemma 6. Let
If all the points of H lie in the same line, then clearly there must be some (a k , b k ) ≤ (x, y). Assume that this is not true. Then the convex hull C(H) of H is a convex polygon whose vertices are some of the points of H. Consider a point ( The point (x ′ , y) lies on the boundary of C(H) and thus, as C(H) is a convex polygon, (x ′ , y) lies on the line segment between two vertices, say (a k , b k ), (a ℓ , b ℓ ), of C(H). Thus there is p ∈ [0, 1] such that This finishes the proof. The following lemma shows how to minimize the mean square deviation (to which our notion of variance is a special case).

Lemma 7.
Let a 1 , . . . , a m ∈ R such that m i=0 a i = 1, let r 1 , . . . , r m ∈ R and let us consider the following function of one real variable: Then the function V has a unique minimum in m i=1 a i r i . Proof: By taking the first derivative of V we obtain Thus δV δx (x) = 0 iff x = m i=1 a i r i . Moreover, by taking the second derivative we obtain δ 2 V δx 2 = 2 > 0, and thus m i=1 a i r i is a minimum.
The following lemma shows that frequencies of actions determine (in some cases) the mean-payoff as well as the variance.

Lemma 8. Let µ be a memoryless strategy and let D be a BSCC of G µ . Consider frequencies of individual actions a ∈ D ∩
Finally, it is easy to see that the local and hybrid variance coincide in BSCCs since almost all runs have the same frequencies of actions. This gives us the result for the local variance.
2) Proof of Proposition 3.: We obtain the proof from the following slightly weaker version.

Proposition 6.
Let us fix a MEC C and let ε > 0. There are two frequency functions f ε : , and a number p ε ∈ [0, 1] such that: Before we prove Proposition 6, let us show that it indeed implies Proposition 3. There is a sequence ε 1 , ε 2 , . . ., two functions f C and f ′ C , and p C ∈ [0, 1] such that as n → ∞ • p ε n converges to p C It is easy to show that f C as well as f ′ C are frequency functions. Moreover, as and lim we obtain This finishes a proof of Proposition 3. It remains to prove Proposition 6.
Proof of Proposition 6.: Given ℓ, k ∈ Z we denote by A ℓ,k the set of all runs ω ∈ R C such that By Lemma 6, there are ℓ, k, ℓ ′ , k ′ ∈ Z and p ∈ [0, 1] such that P ζ s 0 (A ℓ,k |R C ) > 0 and P ζ s 0 (A ℓ ′ ,k ′ |R C ) > 0 and Let us concentrate on (ℓ · ε, k · ε) and construct a frequency function f on C such that Intuitively, we obtain f as a vector of frequencies of individual actions on an appropriately chosen run of R C . Such frequencies determine the average and variance close to ℓ · ε and k · ε, respectively. We have to deal with some technical issues, mainly with the fact that the frequencies might not be well defined for almost all runs (i.e. the corresponding limits might not exist). This is solved by a careful choice of subsequences as follows.
and for every action a ∈ A there is a number f ω (a) such that Moreover, for almost all runs ω of R C we have that f ω is a frequency function on C and that f ω determines (mp(ω), lv(ω)), i.e., mp(ω) = mp( f ω ) and lv(ω) ≥ lv( f ω ).

Proof:
We start by taking a sequence T ′ Existence of such a sequence follows from the fact that every sequence of real numbers has a subsequence which converges to the lim sup of the original sequence. Now we extract a subsequence T ′′ using the same argument. Now assuming an order on actions, a 1 , . . . , a m , we define T k , . . ., and every T k+1 , . . . such that the following limit exists (and is equal to a number f ω (a k+1 )) We take T m 1 [ω], T m 2 [ω], . . . to be the desired sequence T 1 [ω], T 2 [ω], . . .. Now we have to prove that f ω is a frequency function on C for almost all runs of R C . Clearly, 0 ≤ f ω (a) ≤ 1 for all a ∈ C ∩ A. Also, To prove the third condition from the definition of frequency functions, we invoke the law of large numbers (SLLN) [2]. Given a run ω, an action a, a state s and k ≥ 1, define N a,s k (ω) =        1 a is executed at least i times, and s is visited just after the i-th execution of a; 0 otherwise.
By SLLN and by the fact that in every step the distribution on the next states depends just on the chosen action, for almost all runs ω the following limit is defined and the equality holds whenever f ω (a) > 0: We obtain Here S j (ω) is the j-th state of ω, and I s (t) = 1 for s = t and I s (t) = 0 otherwise.
Now pick an arbitrary run ω of A k,ℓ such that f ω is a frequency function. Then This together with the equation (15) from page 17 proves Proposition 6: This finishes the proof of Proposition 6.

3) Details for proof of Proposition 2:
We have Here E ζ s 0 mp | R C and E ζ s 0 mp | R C are conditional expectations of mp and lv, respectively, on runs of R C . Thus We define memoryless strategies κ and κ ′ in C as follows: Given s ∈ C ∩ S such that b∈A(s) f C (b) > 0 and a ∈ A(s), we put In the remaining states s the strategy κ (or κ ′ ) behaves as a memoryless deterministic strategy reaching Here E D (mp) and E D (lv) denote the expected mean-payoff and the expected local variance, resp., on almost all runs of either C κ or C κ ′ initiated in any state of D (note that almost all such runs have the same mean-payoff and the local variance due to ergodic theorem). Note that the second equality follows from the fact that f C (a) > 0 (or f ′ C (a) > 0) iff a ∈ D ∩ A for a BSCC D of C κ (or of C κ ′ ). The third inequality follows from Lemma 7. The last equality follows from Lemma 8 and the fact that f C (a)/ f C (D) is the frequency of firing a on almost all runs initiated in D.
By Lemma 6, there are two components D, D ′ ∈ BSCC(C κ ) ∪ BSCC(C κ ′ ) and 0 ≤ d C ≤ 1 such that In what follows we use the following definition: Let ν be a memoryless randomized strategy on a MEC C and let K be a BSCC of C ν . We say that a strategy µ K is induced by K if 1) µ K (s)(a) = ν(s)(a) for all s ∈ K ∩ S and a ∈ K ∩ A 2) in all s ∈ S (K ∩ S ) the strategy µ K corresponds to a memoryless deterministic strategy which reaches a state of K with probability one (Note that the above definition is independent of the strategy ν once it generates the same BSCC K.) The strategies µ D and µ D ′ induced by D and D ′ , resp., generate single-BSCC Markov chains C µ D and C µ D ′ satisfying for every state s ∈ C ∩ S the following Here the last equality follows from the fact that almost all runs in C µ D (and also in C µ D ′ ) have the same mean-payoff. Thus for almost all runs the local variance is equal to the hybrid one. This shows that in C, a convex combination of two memoryless (possibly randomized) strategies is sufficient to optimize the mean-payoff and the local variance. Now we show that these strategies may be even deterministic.

Claim 2.
Let s ∈ S . There are memoryless deterministic strategies χ 1 , χ 2 , χ ′ 1 , χ ′ 2 in C, each generating a single BSCC, and numbers 0 ≤ ν, ν ′ ≤ 1 such that and Proof: It suffices to concentrate on µ D . By [12], E µ D s 0 mp I a is equal to a convex combination of the values E ι i s 0 mp I a for some memoryless deterministic strategies ι 1 , . . . , ι m , i.e. there are γ 1 , . . . , γ m > 0 such that For all 1 ≤ i ≤ m and D ∈ BSCC(C ι i ) denote ι i,D a memoryless deterministic strategy such that ι i,D (s) = ι i (s) on all s ∈ D ∩ S , and on other states ι i,D is defined so that D ∩ S is reached with probability 1, independent of the starting state. For all a ∈ D ∩ A we have E Here the inequality follows from Lemma and so by Lemma 6, there are π C , π ′ C ∈ {χ 1 , χ 2 , χ ′ 1 , χ ′ 2 } and a number h C such that Define memoryless deterministic strategies π and π ′ in G so that for every s ∈ S and a ∈ A we have π(s)(a) := π C (s)(a) and π ′ (s)(a) := π ′ C (s)(a) for s ∈ C ∩ S .

4) Proof of Equation (8):
We have in the memory element m 1 mimics the behavior of σ ′ on states of the form (s, m 1 ). Once σ ′ chooses the action [π] (or [π ′ ]) the strategy σ changes its memory element to m 2 (or to m ′ 2 ) and starts playing according to π (or to π ′ , resp.) Formally, we define • α(m 1 ) = σ ′ n (s in , n 1 )(default), α(m 1 ) = σ ′ n (s in , n 1 )([π]) and α(m 1 ) = σ ′ n (s in , n 1 )([π ′ ]) • σ n (s, m 1 )(a) = σ ′ n ((s, m 1 ), n 1 )(a) / b∈A σ ′ n ((s, m 1 ), n 1 )(b) for all a ∈ A • σ u (a, s, m 1 )(m 1 ) = b∈A σ ′ n ((s, m 1 ), 2) Obtaining 3-memory strategy σ.: Let us fix a MDP G = (S , A, Act, δ). We prove the following proposition. x a for all s ∈ S Intuitively the proof will resemble the proof of Proposition 2, and given an arbitrary strategy ζ with E ζ s 0 mp = u, we will mimic the proof for the local variance replacing the quantity (r(A j (ω)) − mp(ω)) 2 by (r(A j (ω) − u) 2 appropriately. Formally, Proposition 7 is a consequence of Lemma 9. 2) If there is a non-negative solution for the system L ζ H (Figure 5), then there is a 3-memory stochastic-update strategy σ satisfying (E σ s 0 mp , E σ s 0 [hv]) = (u, v). We start with the proof of the first item of Lemma 9. We have and thus We first argue that Proposition 8 gives us a solution of L ζ H . Indeed, given a ∈ A (or s ∈ S ) denote by C(a) (or C(s)) the MEC containing a (or s). For every a ∈ A put x a = P(R C(a) ) · p C(a) · f C(a) (a) and For every action a ∈ A which does not belong to any MEC put x a = x ′ a = 0. (1) We have the following equality for u, i.e., and (2) the following equality for v: The appropriate values for y a , y s can be found in the same way as in the proof of [4,Proposition 2]. It remains to prove Proposition 8. As for the proof for local variance, we obtain the proposition from the following slightly weaker version Proposition 9. Let us fix a MEC C and let ε > 0. There are two frequency functions f ε : C → [0, 1] and f ′ ε : C → [0, 1], and a number p ε ∈ [0, 1] such that: As before Proposition 9 implies Proposition 8 as follows: There is a sequence ε 1 , ε 2 , . . ., two functions f C and f ′ C , and p C ∈ [0, 1] such that as n → ∞ • p ε n converges to p C It is easy to show that f C as well as f ′ C are frequency functions. Moreover, as we obtain This together with equation (28) from page 26 gives the desired result: This finishes the proof of the first item of Lemma 9.
We continue with the proof of the second item of Lemma 9. Assume that the system L ζ H has a solutionȳ a ,x a ,x ′ a for every a ∈ A. We define two memoryless strategies κ and κ ′ as follows: Given s ∈ S and a ∈ Act(s), we define respectively.
Using similar arguments as in [4] it can be shown that there is a 3-state stochastic update strategy ξ with memory elements m 1 , m 2 , m ′ 2 satisfying the following: A run of G ξ starts in s 0 with a fixed initial distribution on memory elements. In m 1 the strategy plays according to a fixed memoryless strategy until the memory changes either to m 2 , or to m ′ 2 . In m 2 (or in m ′ 2 ), the strategy ξ plays according to κ (or according to κ ′ , resp.) and never changes its memory element. The key ingredient is that for every BSCC D of G κ we have that Here P ξ s 0 (switch to κ in D ′ ) (or P ξ s 0 (switch to κ ′ in D ′ )) is the probaibility that ξ switches its state to m 2 (or to m ′ 2 ) in one of the states of D (or D ′ ).
Given a BSCC D of G ξ , almost all runs ω of G ξ s 0 that stay in D with the memory element m 2 have the frequency of a ∈ D ∩ A equal tox a /x D . Thus mp(ω) = a∈D∩Axa /x D · r(a). Similarly, if the BSCC is D ′ and the memory element is m ′ 2 , then mp(ω) = a∈D ′ ∩Ax ′ a /x ′ D ′ · r(a). Thus we have the following desired equalities: (1) Equality for u The desired result follows.
3) First item of Proposition 5 supposing finite-memory strategies exist: Let ζ be a strategy such that the following two conditions hold: (1) E ζ s 0 mp = u ≤ u; (2) E ζ s 0 [hv] = v ≤ v. By Proposition 7 without loss of generality the strategy ζ is a finite-memory strategy. Since ζ is a finite-memory strategy, the frequencies are well-defined, and for an action a ∈ A, let denote the frequency of action a. We will first show that setting x a ≔ f (a) for all a ∈ A satisfies Eqns. (12), Eqns. (13)  We establish this below: Here the first and the seventh equality follow from the definition of f . The second and the sixth equality follow from the linearity of the limit. The third equality follows by the definition of δ. The fourth equality is obtained from the following: Satisfying Eqns 13. We will show that a∈A f (a) · r(a) = u.
Here, the first equality is the definition of f (a); the second equality follows from the linearity of the limit; the third equality follows by linearity of expectation; the fourth equality involves exchanging limit and expectation and follows from Lebesgue Dominated convergence theorem (see, e.g. [ The first equality is by definition; the second equality about existence of limit follows from the fact that ζ is a finite-memory strategy; and the final equality of exchange of limit and the expectation follows from Lebesgue Dominated convergence theorem (see, e.g. [19,Chapter 4, Section 4]), since (r(A t ) − u) 2 ≤ (2 · W) 2 , where W = max a∈A |r(a)|. We have The first equality is by rewriting the term within the expectation and by linearity of expectation; the second equality is by linearity of limit; the third equality follows by the equality to show satisfaction of Eqns 13 (it follows from the equality for Eqns 13 that lim ℓ→∞ by simply considering the reward function r 2 instead of r); and the final equality follows from the equality to prove Eqns 13. Thus we have the following equality: Now we have to set the values for y χ , χ ∈ A ∪ S , and prove that they satisfy the rest of L H when the values f (a) are assigned to x a . By Lemma 1 almost every run of G ζ eventually stays in some MEC of G. For every MEC C of G, let y C be the probability of all runs in G ζ that eventually stay in C. Note that Here the last equality follows from the fact that lim ℓ→∞ P ζ s 0 [A ℓ ∈ C] is equal to the probability of all runs in G ζ that eventually stay in C (recall that almost every run stays eventually in a MEC of G) and the fact that the Cesàro sum of a convergent sequence is equal to the limit of the sequence.
By the previous paragraph there is ζ such that P ζ s 0 [R C ] = a∈A∩C f (a), so we can define y a and y s in the same way as done in [4,Proposition 2] (this solution is based on the results of [13]; the proof is exactly the same as the proof of [4, Proposition 2], we only skip the part in which the assignment to x a s is defined). This completes the proof of the desired result.

4) Proof that Eqns 14 is satisfied by σ:
We argue that the strategy σ from [4, Proposition 1] satisfies Eqns 14. We show that for the strategy σ we have: It follows immediately that Eqns 14 is satisfied. Since σ is a finite-memory strategy, all the limit-superior can be replaced with limits. Then we use the the equality from Appendix C1 where we showed that which is equal to E σ s mp r 2 − E σ s mp 2 . 5) Properties of the quadratic constraints of L H . : We now establish that the quadratic constraints of L H (i.e., Eqns 14) satisfies that it is a negative semi-definite constraint of rank 1. Let us denote by x the vector of variables x a , and r the vector of rewards r(a), for a ∈ A. Then the quadratic constraint of Eqns 14 is specified in matrix notation as: a∈A x a · r 2 (a) − x T · Q · x, where x T is the transpose of x, and the matrix Q is as follows: Q i j = r(i) · r( j). Indeed, we have x T · Q · x = z T · x where z i = k∈A x k · r(i) · r(k) and so x T · Q · x = i∈A x i · k∈A x k · r(i) · r(k) = i∈A (x i r(i)) 2 + i∈A x i · k∈A,k i x k · r(i) · r(k) = i∈A (x i r(i)) 2 + i∈A k<i 2 · x i · r(i) · x k · r(k) = i∈A x i r(i) 2 where in the last but one equality we use an arbitrary order on A, and where the last equality follows by multinomial theorem.
The desired properties of Q are established as follows: • Negative semi-definite. We argue that Q is a positive semi-definite matrix. A sufficient condition to prove that Q is positive semi-definite is to show that for all real vectors y we have y T · Q · y ≥ 0. For any real vector y we have y T · Q · y = ( a∈A y a · r(a)) 2 ≥ 0 (as the square of a real-number is always non-negative). It follows that Eqns 14 is a negative semi-definite constraint. • Rank of Q is 1. We now argue that rank of Q is 1. We observe that the matrix Q with Q i j = r i · r j is the outer-product matrix of r and r T , where r and r T denote the vector of rewards and its transpose, respectively, i.e., Q = r · r T . Since Q is obtained from a single vector (and its transpose) it follows that Q has rank 1.

D. Details for Section VI
Some of our algorithms will be based on the notion of almost-sure winning for reachability and coBüchi objectives.
Almost-sure winning, reachability and coBüchi objectives. An objective Φ defines a set of runs. For a set B ⊆ A of actions, we (i) recall the reachability objective Reach(B) that specifies the set of runs ω = s 1 a 1 s 2 a 2 . . . such that for some i ≥ 0 we have a i ∈ B (i.e., some action from B is visited at least once); and (ii) define the coBüchi objective coBuchi(B) that specifies the set of runs ω = s 1 a 1 s 2 a 2 . . . such that for some i ≥ 0 for all j ≥ i we have a j ∈ B (i.e., actions not in B are visited finitely often). Given an objective Φ, a state s is an almost-sure winning state for the objective if there exists a strategy σ (called an almost-sure winning strategy) to ensure the objective with probability 1, i.e., P σ s [Φ] = 1. We recall some basic results related to almost-sure winning for reachability and coBüchi objectives. Theorem 5 ([7], [8]). For reachability and coBüchi objectives whether a state is almost-sure winning can be decided in polynomial time (in time O((|S | · |A|) 2 )) using discrete graph theoretic algorithms. Moreover, both for reachability and coBüchi objectives, if there is an almost-sure winning strategy, then there is a memoryless pure almost-sure winning strategy. Basic facts. We will also use the following basic fact about finite Markov chains. Given a Markov chain, and a state s: (i) (Fact 1). The local variance is zero iff for every bottom scc reachable from s there exists a reward value r * such that all rewards of the bottom scc is r * . positive. (ii) (Fact 2). The hybrid variance is zero iff there exists a reward value r * such that for every bottom scc reachable from s all rewards of the bottom scc is r * . (iii) (Fact 3). The global variance is zero iff there exists a number y such that for every bottom scc reachable from s the expected mean-payoff value of the bottom scc is y. 1) Zero Hybrid Variance: We establish the correctness of our algorithm with the following lemma.

Lemma 10.
Given an MDP G = (S , A, Act, δ), a starting state s, and a reward function r, the following assertions hold: 1) If β is the output of the algorithm, then there is a strategy to ensure that the expectation is at most β and the hybrid variance is zero. 2) If there is a strategy to ensure that the expectation is at most β * and the hybrid variance is zero, then the output β of the algorithm satisfies that β ≤ β * .
Proof: The proofs of the items are as follows: 1) If the output of the algorithm is β, then consider A ′ to be the set of actions with reward β. By step (2) of the algorithm we have that there exists an almost-sure winning strategy for the objective coBuchi(A ′ ), and by Theorem 5 there exists a memoryless pure almost-sure winning strategy σ for the coBüchi objective. Since σ is an almost-sure winning strategy for the coBüchi objective, it follows that in the Markov chain G σ s every bottom scc C reachable from s consists of reward β only. Thus the expectation given the strategy σ is β, and by Fact 2 for Markov chains the hybrid variance is zero. 2) Consider a strategy to ensure that the expectation is at most β * with hybrid variance zero. By the results of Proposition 7 there is a finite-memory strategy σ to ensure expectation β * with hybrid variance zero. Given the strategy σ, if there exists an action a with reward other than β * that appear in a bottom scc, then the hybrid variance is greater than zero and as soon as a bottom scc C is reached at state s ′ , the strategy in G chooses the action a s ′ to proceed to the state s ′ . The strategy ensures that the cumulative reward in G is at most α(s) − M, i.e., α(s) − M ≥ β(s). It follows that α(s) ≥ β(s).
Proof: Consider a witness memoryless pure strategy σ * in G that achieves the optimal cumulative reward value. We construct a witness strategy σ for zero local variance in G as follows: play as σ * till the set T is reached (note that σ * ensures almost-sure reachability to T ), and after T is reached, if a state s is reached, then switch to the memoryless pure strategy from s to ensure expectation at most β(s) with zero hybrid variance. The strategy σ ensures that every bottom scc of the resulting Markov chain consists of only one reward value. Hence the local variance is zero. The expectation given strategy σ is at most β * (s). Hence the desired result follows.
3) Zero Global Variance: The following lemma shows that in a MEC, any expectation in the interval is realizable with zero global variance.

Lemma 13.
Given an MDP G = (S , A, Act, δ), a starting state s, and a reward function r, the following assertions hold: 1) If ℓ is the output of the algorithm, then there is a strategy to ensure that the expectation is at most ℓ and the global variance is zero. 2) If there is a strategy to ensure that the expectation is at most ℓ * and the global variance is zero, then the output ℓ of the algorithm satisfies that ℓ ≤ ℓ * .
Proof: The proof of the items are as follows: 1) If the output of the algorithm is ℓ, then consider C to be the set of MEC's whose interval contains ℓ. Let A ′ = C j ∈C C j .
By step (4)(b) of the algorithm we have that there exists an almost-sure winning strategy for the objective Reach(A ′ ), and by Theorem 5 there exists a memoryless pure almost-sure winning strategy σ R for the reachability objective. We consider a strategy as follows: (i) play σ R until an end-component in C is reached; (ii) once A ′ is reached, consider a MEC C j that is reached and switch to the memoryless randomized strategy σ ℓ of Lemma 2 to ensure that every bottom scc obtained in C j by fixing σ ℓ has expected mean-payoff exactly ℓ (i.e., it ensures expectation ℓ with zero global variance). Since σ is an almost-sure winning strategy for the reachability objective to the MECs in C, and once the MECs are reached the strategy σ ℓ ensures that every bottom scc of the Markov chain has expectation exactly ℓ, it follows that the expectation is ℓ and the global variance is zero. 2) Consider a strategy to ensure that the expectation is at most ℓ * and the global variance zero. By the results of Theorem 1 there is a finite-memory strategy σ to ensure expectation ℓ * with global variance zero. Given the strategy σ, consider the Markov chain G σ s . Let C = { C | C is a bottom scc reachable from s in G σ s }. Since the global variance is zero and the expectation is ℓ * , every bottom scc C ∈ C must have that the expectation is exactly ℓ * . Let C = {C | C is a MEC and there exists C ∈ C such that the associated end component of C is contained in C}.
For every C ∈ C we have ℓ * ∈ [α C , β C ], where [α C , β C ] is the interval of C. Moreover, the strategy σ is also a witness almost-sure winning strategy for the reachability objective Reach(A ′ ), where A ′ = C∈C C. Let ℓ ′ = min{α C | ℓ is the minimal expectation of C ∈ C}. Since for every C ∈ C we have ℓ * ∈ [α C , β C ], it follows that ℓ ′ ≤ ℓ * . Observe that if the algorithm checks the value ℓ ′ in step (4) (say ℓ ′ = ℓ i ), then the condition in step (4)(3) is true true, as A ′ ⊆ C j ∈C i C j and σ will be a witness almost-sure winning strategy to reach C j ∈C i C j . Thus the algorithm must retrun a value ℓ ≤ ℓ ′ ≤ ℓ * . The desired result follows.
The above lemma ensures the correctness and the complexity analysis is as follows: (i) the MEC decomposition for MDPs can be computed in polynomial time [6], [7] (hence step 1 is polynomial); (ii) the minimal and maximal expectation can be computed in polynomial time by linear programming to solve MDPs with mean-payoff objectives [18] (thus step 2 is polynomial); and (iii) sorting (step 3) and deciding existence of almost-sure winning strategies for reachability objectives can be achieved in polynomial time [7], [8]. It follows that the algorithm runs in polynomial time.
For reader's convenience, the formal description of the algorithm is given as Algorithm 2.