Wasserstein stability estimates for covariance-preconditioned Fokker-Planck equations

We study the convergence to equilibrium of the mean field PDE associated with the derivative-free methodologies for solving inverse problems. We show stability estimates in the euclidean Wasserstein distance for the mean field PDE by using optimal transport arguments. As a consequence, this recovers the known convergence towards equilibrium estimates in the case of a linear forward model.


Introduction
In this paper, we are concerned with the nonlocal Fokker-Planck equation where σ 0, f t = f(•, t), C is the covariance operator defined by Author to whom any correspondence should be addressed.
Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. and Φ R (•; y) is a functional of the form Φ R (u; y) = 1 2 |y − G(u)| 2 Γ + 1 2 |u| 2 Γ 0 =: Φ(u, y) + Here G : R d → R K is the so-called forward model, in view of the link with Bayesian inverse problems, y ∈ R d is a given vector of observations and Γ ∈ R K×K , Γ 0 ∈ R d×d are symmetric, positive definite matrices. We employed the notation |•| Γ := Γ − 1 2 • , where |•| is the usual Euclidean norm.
Throughout this paper, we restrict our attention to the case where G is a linear mapping and we write G(u) = Gu, with G ∈ R K×d . We will assume that the matrix Γ −1 0 + G T Γ −1 G =: B −1 is nonsingular, so that the regularised least squares misfit Φ R , given by (1.2), admits the unique minimiser u 0 = BG T Γ −1 y. Our main result is that, if f 1 and f 2 are the solutions of (1.1) associated with the initial conditions f 1 0 and f 2 0 , respectively, then a stability estimate of the following form holds: where C(• 1 , • 2 ; G, Γ, Γ 0 ) depends only on the first two moments of • 1 and • 2 and the function γ(t; σ) converges to zero as t → ∞ exponentially when σ > 0 and algebraically when σ = 0.
Here and in the rest of the paper, we employed the notation f i t = f i (•, t), i = 1, 2. If σ > 0, then by taking one solution in (1.3) to be the Gaussian equilibrium f ∞ = 1 Z e −Φ R /σ , with Z the normalisation constant, one recovers the equilibration estimate obtained in [15], but with a sharper rate of convergence. An important feature of our result is that the convergence rate, encapsulated in the function γ(•; σ), depends only on σ and not on the parameters, notably the Hessian, of the quadratic least-squares functional Φ R . Roughly speaking, the reason for this universality of the convergence rate comes from the fact that the covariance preconditioner tends to accelerate equilibration in directions along which the posterior distribution has a large variance [or, equivalently, in directions associated with a small eigenvalue of Hess(Φ R )], and these are the directions which, in the absence of the covariance preconditioner, are limiting for the convergence rate. For a more precise statement on the optimality of the decay rate, we refer to proposition 3.8 and remark 3.7.
As a byproduct of our analysis, we deduce the algebraic convergence of the solution towards a Dirac delta at u 0 when σ = 0, i.e. to the solution of the Bayesian inverse problem, generalising to the mean field PDE the estimates obtained for a related particle system in [21] and answering fully the equilibration open problem discussed in [18] for the linear forward model.
We now turn our attention to the connection of the PDE (1.1) to mean field descriptions of the ensemble Kalman methods for the Bayesian inverse problem. The Fokker-Planck equation (1.1) can be linked to the inverse problem of finding u ∈ R d from an observation Here η is a random variable assumed to have Lebesgue density ρ. In the Bayesian approach to inverse problems [9,12], a probability measure called the prior is placed on u. If we assume that this measure also has a density ρ 0 and that u is independent of η, then (u, y) is a random variable with density ρ( y − G(u)) ρ 0 (u). The posterior density of u|y (i.e. of u given an observation y) is then given by the normalised probability density In the particular case where ρ and ρ 0 are the densities of Gaussians N (0, Γ) and N (0, Γ 0 ), respectively, ρ y ∝ e −Φ R (u; y) , where Φ R is given by (1.2). We make this assumption below.
In [19], the authors proposed to solve the inverse problem (1.4) by applying a stateestimation method, or filter, to the following artificial dynamics on R d × R K and associated observational model, where we denote by u the first d components of z: If the observed data in the dynamics is fixed at the observation of the Bayesian inverse problem y for all steps, then the u-marginal of the posterior distribution at iteration n has density ρ n (u) ∝ exp(−nhΦ(u; y)) ρ 0 (u), which can be obtained by repeatedly applying the reasoning that led to (1.5). This iteration leads to a concentration of the mass of ρ n at minimisers of the (non-regularized) least squares functional Φ in the limit as n → ∞, and we remark that the posterior ρ n coincides with the posterior ρ y of the inverse problem when nh = 1, a fact that can be exploited to produce approximate samples of the posterior ρ y [11]. If the prior ρ 0 is Gaussian and the forward model G is linear, then the posteriors {ρ n } n∈N can in principle be captured exactly by a Kalman filter. However, when the dimension of the state space is large, which is often the case in scientific and engineering applications, the Kalman filter is computationally expensive and a particle-based method such as the ensemble Kalman filter (EnKF) becomes preferable. This approach is also more general than the Kalman filter, because it does not require that the forward model be linear. The ensemble members U = {u ( j) } J j=1 of EnKF are evolved according to equation (4) in [21]: Traditionally, the distribution of the noise employed to perturb the simulated observations {G(u ( j) n )} in the EnKF coincides with that of the noise in the observational model, which suggests taking Σ = Γ. It was shown in [21], however, that taking Σ = 0 also produces an efficient method for solving (1.8) for j = 1, . . . , J. The second term in the right-hand side is included so as to take the prior information into account. The idea of including the covariance matrix C uu (U) in that term, as well as in the noise, is motivated by the fact, in the case of linear forward model, (1.8) can equivalently be written aṡ which is expected to produce approximate samples of the posterior ρ y for large J. Indeed, the formal mean field limit of this interacting particle system is given by the law of the process defined by the McKean-type SDĖ which clearly admits 1 Z e −Φ R as an invariant measure, where Z is the normalisation constant. The associated Fokker-Planck equation for f, which was derived formally in [15] and rigorously in [13], is precisely (1.1) with σ = 1. Two remarks are in order. First, we note that a concentration of the particles at any point of R d is a stationary solution of the dynamics (1.8) and, likewise, any Dirac delta is a stationary solution of (1.10). Second, as recently noted in [20], the J-particle distribution 1 Z J J j=1 e −Φ R (u ( j) ; y) is not invariant under the dynamics (1.9). The strategy of the proof of the stability estimate (1.3) is the following: we first realise that the moments up to second order of the equation (1.1) are governed by a closed system of ODEs. This is a common feature appearing in some of the simplest cases of homogeneous kinetic equations, such as the Fokker-Planck operator preserving the first two moments of the distribution function [22], the Maxwellian molecules case for the Landau-Fokker-Planck equation [23], and the Boltzmann equation for Maxwellian molecules; see [8,10] and the references therein. Then, we focus on finding stability estimates for solutions that have the same covariance matrix, which is simpler because the nonlinearity of the problem does not show up and we are reduced to a kind of linear Fokker-Planck equation. Then we obtain the stability estimate for any two solutions, regardless of the values of their first two moments, by using optimal transport techniques. The strategy of our proofs follows that employed in similar results for the Boltzmann equation in the Maxwellian case as in [4][5][6]10].
The paper is organised as follows. In section 2, we summarise known results and we present some equilibration estimates for the first and second moments of the solution to (1.1). In section 3, we give a simple proof of the stability estimate (1.3) in Euclidean Wassertein distance based on analytical techniques in optimal transport.

Preliminaries
We remind the reader that the forward model G = G is assumed to be linear throughout the paper, and we recall the following result, proved in [15]. Proposition 2.1 (Closed system of ordinary differential for the first and second moments). Assume f t is a solution of (1.1), and let C(t) := C( f t ) and δ(t) where M( f t ) denotes the first moment of f t . The evolution of C(t) and δ(t) is governed by the system:δ Proof. We show this only in the case σ = 0, for simplicity. Multiplying (1.1) by u, integrating over R d , and using the notation we obtain an equation for the covariance matrix. Omitting the dependence of C and m on t for convenience: Since the term in the first round brackets in the integral is mean-zero with respect to f(u, t), we can remove and add constants in the other factor: which, in matrix form, gives (2.1b).
If we assume that C 0 := C( f 0 ) is positive definite, then the solution of (2.1b) reads We notice that the solution in the case σ = 0 is the pointwise limit as σ → 0 of that when σ > 0. For a given solution C(t) of (2.1b), we will denote by U(s, t; C) the fundamental matrix associated with (2.1a); this matrix solves

Lemma 2.2 (Bound for the fundamental matrix). Let C(t) be a solution of
which implies Let us denote the polar decomposition of C(t) −1/2 U(s, t) by Q(s, t)S(s, t), for some orthogonal matrix Q(s, t) and some symmetric matrix S(s, t). Substituting this decomposition in (2.6), we obtain S( Rewriting C(t) in a way that exhibits a convex combinations of B −1 and C(0) −1 , we deduce (2.4).
In the sequel, α(t) denotes the same function as in lemma 2.2, and we employ the notations |•| F := i j • 2 i j and |•| 2 to denote the Frobenius matrix norm and the operator norm induced by the Euclidean vector norm in R d , respectively.

Lemma 2.3 (Convergence of the first and second moments).
We consider two solutions C 1 (t), C 2 (t) of (2.1b) and the corresponding solutions δ 1 (t), δ 2 (t) of (2.1a), and we assume that Then it holds that Proof. By a sub-multiplicative property of the Frobenius norm, so, using the sub-multiplicative property of the norm again, leading to (2.9a).
For the first moments, we have By the variation-of-constants formula, and with the shorthand notation U i (s, t) := U(s, t; C i ), we deduce that Employing (2.4) and (2.9a), and using the fact that δ 2 (u) = U 2 (s, u)δ 2 (s), we obtain We calculate that (This calculation fails for σ = 0 and σ = 1, but it is easy to check that the conclusion holds for any σ 0.) This leads to (2.9b) after taking s = 0 (the case s > 0 will be useful in remark 2.1 below) and rearranging.
where the constants m and M are defined as before.
In the rest of this paper, we denote by g(•; μ, Σ) the density of the Gaussian N (μ, Σ).

Lemma 2.4 (Propagation of Gaussians for the linear equation). Let C(t) be the solution ofĊ
for a given matrix C 0 . Then the solution of the linear Fokker-Planck equation is given by the Gaussian density f(u, t) = g(u; μ(t), Σ(t)) where Proof. Proceeding as in proposition 2.1, we deduce that the first and second moments of any solution to (2.14a), which we denote μ and Σ, satisfẏ We then verify, proceeding similarly to [2,14], that the Gaussian ansatz is indeed a solution. Omitting the dependence of C, μ and Σ on t for notational convenience, we calculate that the left-hand side of (2.14a) reads and the right hand is For general initial conditions μ 0 and Σ 0 , we can check that the solution to the system of equation (

Stability in Wasserstein distance
The aim of this section is to derive a stability property for both the linear Fokker-Planck equation (2.14a) (where C(t) is a given parameter) and the nonlocal mean field equation (1.1) (where C( f t ) depends on the solution), which we undertake in sections 3.1 and section 3.2, respectively.

Stability for the linear Fokker-Planck equation (2.14a)
Throughout this subsection we consider that C(t) is a given solution of (2.1b) and U(0, t) = U(0, t; C). For some probability measure f over R d and a mapping T : R d → R d , we will denote the pushforward measure by T f. We remind the reader that, if f admits a densityf with respect to the Lebesgue measure and A ∈ R d×d is a nonsingular matrix, then A f (identifying A with the associated linear mapping) has density 1 det(A)f (A −1 •). We show the following result.

Proposition 3.1 (Convergence of solutions when the covariance is given).
Let f 1 and f 2 be two solutions of (2.14a) associated with initial conditions f 1 0 and f 2 0 , respectively. Then Under the same assumptions as in lemma 2.3, it therefore holds, in view of (2.4), To prove proposition 3.1 we will need the following lemma.

Lemma 3.2 (Influence of linear transformations on the Wasserstein distance).
Let A ∈ R d×d be nonsingular and let us consider two probability measures with finite second moment, f , g ∈ P 2 (R d ).
Then also A f , A g ∈ P 2 (R d ) and Proof. Let γ o be an optimal transference plan (by [10, proposition 2.1], the infimum in the definition of the Wasserstein distance is achieved) such that and consider the map r : (x, y) → (Ax, Ay). The pushforward plan r γ o has the correct marginals: looking for example at the x marginal, we calculate that for all ϕ ∈ C b (R d ), Furthermore, by a change of variable, We notice that orthogonal transformations do not influence the Wasserstein distance.
Proof of theorem 3.1. Let us denote by ζ(u, t; v) the fundamental solution provided by lemma 2.4. By linearity, the solution of (2.14a) associated with initial condition f 0 can be expressed as follows: By the change of variables v → U(0, t)(v − u 0 ) =: w(v), we can rewrite this integral as By the convexity property of the Wasserstein distance [10, proposition 2.1], its invariance under translation and lemma 3.2, we obtain which is the desired inequality.
Remark 3.1. Proposition 3.1 can also be proved via a purely probabilistic approach, employing the approach presented e.g. in [7,24]. Indeed a solution of (2.14a) with initial condition f 0 can be viewed, by Itô's formula, as the law of the process (X t ) t 0 that solves the stochastic SDE where W is a standard Wiener process on R d . Considering two solutions X t and Y t associated with the initial conditions X 0 ∼ f 1 0 and Y 0 ∼ f 2 0 (and with the same Wiener process), we calculate Recalling that the Wasserstein distance can equivalently be defined as where the infimum is over all X and Y with laws ρ 1 and ρ 2 , respectively, and taking the expectation of both sides of (3.3), we obtain Infimizing over all X 0 and Y 0 with laws f 1 0 and f 2 0 , respectively, we obtain precisely (3.1). We remark that the first moment of f 1 and f 2 need not coincide for proposition 3.1 to hold.

Stability for the nonlocal mean field equation
To prepare the terrain for the derivation of our main result, we begin by showing a stability property on the set of Gaussian solutions. To this end, we will employ the following bound for the distance between the square root of the covariant matrices associated with two solutions.
where C R is a constant that depends only on the dimension of the problem.
Proof. We restrict ourselves in the proof to the case σ > 0 for simplicity. Employing the same reasoning as in the first part of the proof of lemma 2.3, we write The middle term can be written as Therefore, using the technical bound presented in lemma A.1 in the appendix A, which leads to (3.4) after employing the convex decomposition (2.7) to bound C The Wasserstein distance between two Gaussian measures admits an explicit expression, which we recall in the following lemma.

Lemma 3.4 (Wasserstein distance between Gaussians).
Consider two Gaussians probability measures N (μ 1 , Σ 1 ) and N (μ 2 , Σ 2 ) on R d . The Wasserstein distance between them is given by Proof. Equation (3.5) is proved in [17], but we will include a sketch of the proof in the simpler case where Σ 1 , Σ 2 0 (the proof of the general case requires an additional step) for the reader's convenience and because we will employ the intermediate inequality (3.6) below in remark 3.7. We will see that, by taking an appropriate singular value decomposition, the proof presented in the aforementioned paper can be slightly simplified. The key idea is to notice that the covariance matrix of the optimal transference plan (a probability measure on R d × R d ) must have the form and that the Wasserstein distance is given by |μ 1 − μ 2 | 2 + tr(Σ 1 + Σ 2 − 2X). Using Schur's complement, and denoting the squared Wasserstein distance on the left-hand side of (3.5) by W 2 for short, we deduce (The infimum is attained because the admissible set is compact.) By polar decomposition of Σ −1/2 1 X, it is possible to write X = Σ 1/2 1 QS 1/2 , for an orthogonal matrix Q and a symmetric positive-semidefinite matrix S 1/2 . Since Q does not appear in the constraint, and since tr(X) = tr(QS 1/2 Σ 1/2 where the maximum is taken over all symmetric positive-semidefinite matrices. Here we employed that tr(D) = tr((V T 2 D 2 V 2 ) 1/2 ) = tr((Σ 1/2 1 SΣ 1/2 1 ) 1/2 ). Since the matrix square root is monotonous over the cone of positive semi-definite matrices, and since clearly Σ 1 on the set of admissible S (that is, congruence preserves the order ), we conclude that the optimum is reached when S = Σ 2 , which leads to Considering the following transportation map, we notice that the lower bound is in fact attained for Gaussian densities. Indeed, it is simple to check that T # N (μ 1 , Σ 1 ) = N (μ 2 , Σ 2 ) and, by a change of variable, where we employed the notation g μ,Σ = g(•, μ, Σ) for short.

Lemma 3.5 (Bounds on the Wasserstein distance between Gaussians).
Consider two Gaussians N (μ 1 , Σ 1 ) and N (μ 2 , Σ 2 ). Denoting the Wasserstein distance between them by W for convenience, it holds Proof. The first inequality in (3.7) can be rewritten as tr Σ where s j (•) is the jth singular value and |•| s 1 denotes the Schatten matrix norm with p = 1, defined as the sum of the (all positive) singular values of its argument. This inequality follows from the general arithmetic mean/geometric mean inequality, valid for any unitarily invariant matrix norm and any positive matrices in place of Σ 1/2 1 and Σ 1/2 2 , that is the subject of [3]. To obtain the second inequality in (3.7), we employ the standard Araki-Lieb-Thirring inequality with r = 1/2 and q = 1, which concludes the proof.
where C is a constant that depends only on the dimension d and α(t) is given by (2.5).
Proof. Combining the moment bounds (3.4) and (2.9b) with (3.7), and denoting the Wasserstein distance on the left-hand side of (3.8) by W for short, we obtain Employing lemma A.2, which generalises the inequality to symmetric positive semi-definite matrices, and using (3.7) again, we finally obtain which leads to our claim.
To prove a more general stability result, we will combine the ideas of propositions 3.1 and 3.6. Additionally, we will need the following lemma.

Lemma 3.7 (Wasserstein distance between linearly transformed densities). Let
A, B ∈ R d×d be nonsingular, possibly nonsymmetric matrices, and let f be a probability measure with finite second moment, f ∈ P 2 (R d ). Then it holds that where M( f ) and C( f ) are the first and second moments of f: Proof. Let us consider the transference plan γ = (A × B) f, which clearly has the required marginals. (Here A × B is the operator x → (Ax, Bx).) We calculate, by a change of variable, which directly leads to the conclusion.
Proof. Let us denote the fundamental matrices associated with the two solutions by U i (s, t), i = 1, 2. Our starting point will be (3.2), rewritten in a such a way that the Gaussian densities are centered at zero: (3.11) Introducing new functionsf i (u, t) := f i (u + u 0 , t) andf i 0 (u) = f i 0 (u + u 0 ) for convenience, we obtain the simpler expression so we directly obtain (3.12) without the first term on the right-hand side.
Remark 3.5. Proposition 3.8 can be proved with a probabilistic approach too, although with slightly different constants on the right-hand side. Since the probabilistic proof is very similar in spirit to the one given above, we will not present it here.
Remark 3.6. Strictly speaking, proposition 3.8 is not a generalisation of proposition 3.6 because the constant on the right-hand side of (3.10) contains the term m 4 M 4 , which was not present in (3.8).
As mentioned in the introduction, when σ > 0 proposition 3.8 implies an equilibration estimate: taking f 2 0 to be the Gaussian equilibrium f ∞ in (3.10), we obtain for a constant depending only on σ and on the first and second moments of f 1 0 and f ∞ . In the case σ = 0, however, proposition 3.8 does not yield a rate of convergence to the equilibrium Dirac at u 0 , because the constant prefactor in (3.10) diverges to +∞ as |C 1 (0)| 2 → 0 or |C 2 (0)| 2 → 0. In that case, an estimate can be obtained more directly from the equation Combining this with (2.2) and (2.4), we obtain the equilibration estimate (3.14) Remark 3.7. Notice that, in contrast to [15, proposition 2], the rate of decay (3.13) does not depend on B, i.e., on the Hessian of Φ R . More importantly, the rate of decay we obtain is sharp.
In order to check this, note that by (2.6) the mean δ(t) satisfies for any σ 0. Therefore where λ min (•) and λ max (•) denote the minimum and maximum eigenvalues, respectively. By the assumptions (2.8a) and (2.8b) and by (2.7), it holds λ max (C(0)) M and λ min (C(t)) 1 m and thus, On the other hand, since the first inequality of (3.7) in lemma 3.5 holds for general probability measures, it holds for any solution f t of (1.1), with f ∞ being the Gaussian equilibrium. Consequently, This shows that (3.13) and (3.14) are optimal, in the sense that it is not possible to obtain a better rate of convergence with respect to time on the right-hand side of these equations.
Remark 3.8. Finally, we remark that, in the case σ = 0, the convergence to the equilibrium Dirac at u 0 could also have been derived using an approach similar to that in [15], but this does not provide an optimal bound. Defining By (2.2), this leads to Consequently, we obtain E( f t ) 1 (2t + 1) 1 mM E( f 0 ).
Since W 2 ( f t , δ u 0 ) 2 = R d |u − u 0 | 2 f t (u) du, we deduce from this an equilibration estimate in Wasserstein distance: Since mM 1, the convergence rate obtained is not as sharp as the one of (3.14).
To complete the second part, we must show that there exist constants C 1 and C 2 such that The first inequality is proved in [1, lemma C.1]. The second inequality follows after taking the supremum (over the sphere |x| = 1) in the following equation, where we employ the triangle inequality: This completes the proof.
Using the same trick, of passing to the equivalent distance d(•, •), we can show the following.
Lemma A.2. Let M 1 , M 2 be symmetric, positive-semidefinite matrices in R d×d . It holds that for a constant C that depends only on d.
Proof. In one dimension, the statement follows from the equation We can then show pass to d(•, •) as follows: where S := {x : |x| = 1, x T (M 1 + M 2 )x > 0}. This leads to the statement after rearranging.