Bias and Consistency in Three-way Gravity Models

We study the incidental parameter problem for the ``three-way'' Poisson {Pseudo-Maximum Likelihood} (``PPML'') estimator recently recommended for identifying the effects of trade policies and in other panel data gravity settings. Despite the number and variety of fixed effects involved, we confirm PPML is consistent for fixed $T$ and we show it is in fact the only estimator among a wide range of PML gravity estimators that is generally consistent in this context when $T$ is fixed. At the same time, asymptotic confidence intervals in fixed-$T$ panels are not correctly centered at the true point estimates, and cluster-robust variance estimates used to construct standard errors are generally biased as well. We characterize each of these biases analytically and show both numerically and empirically that they are salient even for real-data settings with a large number of countries. We also offer practical remedies that can be used to obtain more reliable inferences of the effects of trade policies and other time-varying gravity variables, which we make available via an accompanying Stata package called ppml_fe_bias.


Introduction
Despite intense and longstanding empirical interest, the effects of bilateral trade agreements on trade are still considered highly difficult to assess. As emphasized in a recent practitioner's guide put out by the WTO (Yotov, Piermartini, Monteiro, and Larch, 2016), many current estimates in the literature suffer from easily identifiable sources of bias (or "estimation challenges"). This is not for a lack of awareness. Papers showing leading causes of bias in the gravity equation are often among the most widely celebrated and cited in the trade field, if not in all of Economics. 1 In particular, it is now generally accepted that trade flows across different partners are interdependent via the network structure of trade (the main contribution of Anderson and van Wincoop, 2003), that log-transforming the dependent variable is not innocuous (as argued by Santos Silva and Tenreyro, 2006), and-most relevant to the context of trade agreements-that earlier, puzzlingly small estimates of the effects of free trade agreements were almost certainly biased downwards by treating them as exogenous (Baier and Bergstrand, 2007).
As a consequence-and aided by some recent computational developments-researchers seeking to identify the effects of trade agreements have naturally moved towards more advanced estimation strategies that take on board all of the above concerns. 2 In particular, a "three-way" fixed effects Poisson Pseudo-Maximum Likelihood ("FE-PPML") estimator with time-varying exporter and importer fixed effects to account for network dependence and time-invariant exporter-importer ("pair") fixed effects to address endogeneity has recently emerged as a logical workhorse method for empirical trade policy analysis. 3 It also 1 For some context, if we start citation counts in 2003, Anderson and van Wincoop (2003) and Santos Silva and Tenreyro (2006) are, respectively, the most cited articles in the American Economic Review and the Review of Economics and Statistics. Paling only slightly in this exclusive company, Baier and Bergstrand (2007) is the 4th most-cited article in the Journal of International Economics, having gathered "only" 2,500 citations. Readers familiar with these other papers will also likely be familiar with Helpman, Melitz, and Rubinstein (2008)'s work on the selection process underlying zero trade flows, an issue we do not take up here.
2 Larch, Wanner, Yotov, and Zylkin (2019), Correia, Guimarães, and Zylkin (2020), and Stammann (2018) describe algorithms that enable fast estimation of the three-way models considered here. 3 Pair fixed effects are of course no substitute for good instruments. However, instruments for trade policy changes which are also exogenous to trade are understandably hard to come by. As discussed in Head and Mayer (2014)'s essential handbook chapter on gravity estimation, pair fixed effects have the advantage that the effects of trade agreements and other trade policies are identified from time-variation in trade within pairs. Causal interpretations follow if standard "parallel trend" assumptions are satisfied.
has clear potential application to the study of network data more generally, such as data on urban commuting or migration (e.g., Brinkman and Lin, 2019;Gaduh, Gracner, and Rothenberg, 2020;Allen, Dobbin, and Morten, 2018;Beverelli and Orefice, 2019).
However, one reason why some researchers may hesitate in embracing this estimator is the current lack of clarity regarding how the three fixed effects in the model may bias estimation, especially in the standard "fixed T " case where the number of time periods is small. Even though FE-PPML estimates can be shown to be asymptotically unbiased with a single fixed effect (a well-known result) as well as in a two-way setting where both dimensions of the panel become large (Fernández-Val and Weidner, 2016), the latter result does not come strictly as a generalization of the former one, leaving it potentially unclear whether a three-way model with a fixed time dimension should be expected to inherit the nice asymptotic properties of these other models.
Accordingly, the question we investigate in this paper is the extent to which the threeway FE-PPML estimator is affected by incidental parameter problems (IPPs). As is well known in both statistics and econometrics (Neyman and Scott, 1948;Lancaster, 2002), IPPs arise when estimation noise from estimates of fixed effects and other "incidental parameters" contaminates the scores of the main parameters of interest, inducing bias. In the worst case, this bias renders the estimates inconsistent, making estimation inadvisable.
As we will show, while inconsistency is actually not a problem for three-way FE-PPML, both the estimated coefficients and standard errors are affected by meaningful biases due to IPPs that researchers should be aware of.
To state our main results more precisely, in gravity settings where the number of countries (N ) goes to infinity and T is small, we find the following: 1. Consistency of point estimates of FE-PPML: The point estimates produced by threeway FE-PPML estimator in gravity settings are asymptotically consistent.

Inconsistency of other FE-PML estimators: FE-PPML is the only estimator in a set
of related FE-PML estimators sometimes considered in this context that is generally consistent. FE-Gamma PML, for example, should not be used because it is only consistent under strict assumptions.
3. Asymptotic bias: Point estimates of the three-way FE-PPML estimator are nonetheless asymptotically biased, meaning that the asymptotic distribution of the estimates is not centered at the truth as N → ∞. In other words, it approaches the truth "at an angle" asymptotically (see Figure 1 for an illustration.) 4. Biased standard error estimates: Estimates of cluster-robust sandwich-type standard errors are likewise asymptotically biased due to an IPP.

Bias corrections improve inferences: Simulations show that using analytical bias
corrections to address each of these biases leads to improved inferences. These corrections are available to use via the Stata package ppml_fe_bias.
To first explain our consistency results, our basic strategy involves using the first-order conditions of FE-PPML to "profile out" (solve for) the pair fixed effect terms from the firstorder conditions of the other parameters. Notably, this allows us to re-express the threeway gravity model as a two-way model in which the only remaining incidental parameters are the exporter-time and importer-time fixed effects. Three-way FE-PPML is therefore consistent in fixed-T settings for largely the same reasons the two-way models considered in Fernández-Val and Weidner (2016) are consistent, and we provide suitably modified versions of the regularity conditions and consistency results established by Fernández-Val and Weidner (2016) for the simpler two-way case.
At the same time, it does not also follow that Fernández-Val and Weidner (2016)'s earlier results for the asymptotic unbiased-ness of the two-way FE-PPML estimator similarly carry over to the three-way case when T is fixed. The key is that the resulting two-way estimator that is obtained after profiling out the fixed effects has its own special properties with respect to IPPs. When T is fixed, the estimation noise in the remaining exporter-time and importer-time fixed effects induces an asymptotic bias of order 1/N as N grows large, a result that is broadly consistent with most of the two-way settings studied in Fernández-Val and Weidner (2016). However, when instead both N and T grow large at the same rate, the estimator turns out to be unbiased asymptotically, analogous to what Fernández-Val and Weidner (2016) found in the two-way FE-PPML case.
The reason why asymptotic bias is a concern is that the asymptotic standard deviation is itself of order 1/(N √ T ). Thus, when T is fixed, both the bias in point estimates will be of comparable magnitude to their standard errors as N → ∞, causing the asymptotic distribution of estimates to be incorrectly centered as discussed above. In practice, this is a less severe problem than inconsistency, but it does mean that standard hypothesis tests for assessing statistical significance are not reliable. One of the objectives of this paper will be to adapt some of the leading remedies from the recent literature on "large T " IPPs (see, e.g., Arellano and Hahn, 2007) in order to re-center the asymptotic distribution of estimates and thereby restore asymptotically valid inferences. 4 The bias in the estimated standard errors is similar to one that has been found in two-way gravity settings by several recent studies (Egger and Staub, 2015;Jochmans, 2016;Pfaffermayr, 2019Pfaffermayr, , 2021. Intuitively, because the origin-time and destination-time fixed effects in the model each converge to their true values at a rate of only 1/ √ N (not 1/N ), the cluster-robust sandwich estimator for the variance has a leading bias of order 1/N (not 1/N 2 ), and standard errors in turn have a bias of order 1/ √ N . This latter type of bias is related to the general result that standard "heteroskedasticity-robust" variance estimators are downward-biased in small samples (see, e.g., MacKinnon and White, 1985; Imbens and Kolesar, 2016), including for PML estimators (Kauermann and Carroll, 2001), but is more severe in this setting due to an IPP. We should therefore be concerned that estimated confidence intervals may be too narrow in addition to being off-center.
For the bias in point estimates, we construct two-way analytical and jackknife bias corrections inspired by the corrections proposed in Fernández-Val and Weidner (2016;. For the bias in standard errors, we show how Kauermann and Carroll (2001)'s method for correcting the PML sandwich estimator may be adapted to the case of a conditional estimator with multi-way fixed effects and cluster-robust standard errors.
Our simulations confirm that these methods are usually effective at improving inferences.
The jackknife correction reduces more of the bias in point estimates than the analytical correction in smaller samples, but the analytical correction does a better job at improving coverage, especially when also paired with corrected standard errors.
For our empirical applications, we first estimate the average effects of a free trade agreement (FTA) on trade for a range of different industries using what would typically be considered a large trade data set, with 167 countries and 5 time periods. The biases we uncover vary in size across the different industries, but are generally large enough to indicate that our bias corrections should be worthwhile in most three-way gravity settings. For aggregate trade data (which yields results that are fairly representative), the estimated coefficient for FTA has an implied downward bias about 15%-22% of the estimated standard error, and the implied downward bias in the standard error itself is 4 The new literature on "large T " asymptotic bias in nonlinear FE models has emerged as a recent response to the well-known "fixed T " consistency problem first described in Neyman and Scott (1948).
Examples include Phillips and Moon (1999), Hahn and Kuersteiner (2002), Lancaster (2002), Woutersen (2002), Alvarez and Arellano (2003), Carro (2007), Arellano and Bonhomme (2009) about 11% of the original standard error. As a means of further demonstration, we also apply our corrections to replication data from several recent papers that have used threeway gravity models. This latter exercise reveals several instances in which our methods make a material difference for assessing statistical significance. It also highlights the possibility that the bias in standard errors can sometimes be severe, as much as 40% or more in some cases.
Aside from Fernández-Val and Weidner (2016)'s work on two-way nonlinear models, Pesaran (2006), Bai (2009), Hahn and Moon (2006), and Moon and Weidner (2017) have each conducted similar analyses for two-way linear models with interacted individual and time fixed effects. Turning to three-way models, Hinz, Stammann, and Wanner (2019) have recently developed bias corrections for dynamic three-way probit and logit models based on asymptotics suggested by Fernández-Val and Weidner (2018) where all three panel dimensions grow at the same rate. Though widely applicable, this approach is not appropriate for our setting because of the different role played by the time dimension when the estimator is FE-PPML. 5 In the network context, Graham (2017), Dzemski (2018), and Chen, Fernández-Val, and Weidner (2019) have studied asymptotic bias in network models with node-specific (possibly sender-and receiver-specific) fixed effects.
Chen, Fernández-Val, and Weidner (2019)'s analysis is espeically notable in that they allow these node-specific effects to be vectors rather than scalars, similar to the exportertime and importer-time fixed effects that feature in gravity models. Our bias expansions substantially differ from those of Chen, Fernández-Val, and Weidner (2019) because the equivalent outcome variable in our setting (trade flows observed over time for a given pair) is also a vector rather than a scalar and because we work with a conditional moment model where the distribution of the outcome may be misspecified.
In what follows, Section 2 first provides a discussion of why IPPs are a concern for gravity models and of the no-IPP properties of FE-PPML. Section 3 then establishes bias and consistency results for the three-way gravity model specifically and discusses how to implement bias corrections. Sections 4 and 5 respectively present simulation evidence and empirical applications. Section 6 concludes, and an Appendix adds further simulation results and technical details, including proofs.
5 Also related are the GMM-based differencing strategies for two-way FE models proposed by Charbonneau (2017) and Jochmans (2016). These strategies rely on differencing the data in such as way that the resulting GMM moments do not depend on any of the incidental parameters. In principle, these methods could be extended to allow for differencing across a time dimension as well in a three-way panel.

Gravity Models and IPPs
Gravity models are now routinely estimated using FE-PPML with multiple sets of fixed effects. As we discuss in this section, these practices follow naturally from the gravity model's theoretical microfoundations but are not without need for further scrutiny. In particular, because the underlying model is nonlinear, it is important to clarify that, while PPML is known to be free from incidental parameter bias in some special cases, it is by no means immune to IPPs in general. It will also be useful for us to provide some general discussion of IPPs and the different ways in which they may manifest.

Fixed Effects and Gravity Models
As documented in Head and Mayer (2014), the emergence of rich and varied theoretical foundations for the gravity equation has fueled a "fixed effects revolution" in the gravity literature over the last two decades. As such, we find it useful to briefly describe a simple trade model and discuss how it may be used to motivate an estimating equation with either two-way or three-way fixed effects.
To establish some notation we will use throughout the paper, we will consider a world with N countries and we will let i and j respectively be indices for exporter and importer.
For now, we will focus on deriving a two-way gravity model where the two fixed effects account for each country's multilateral resistance. Later, we will add a time dimension and a third fixed effect that absorbs all time-invariant components of trade costs.
To add some theoretical structure, suppose that trade flows are given by the following gravity equation: Here, y i := j y ij and y j := i y ij are the market sizes of the two countries, τ ij ≥ 1 is a bilateral trade cost, θ > 0 is the trade elasticity, and Π i and P j respectively are the outward and inward multilateral resistances from Anderson and van Wincoop (2003), which capture how bilateral trade flows depend on each country's opportunities for trade with third countries. More formally, these latter terms are derived from the following two relationships that are inherent to all general equilibrium gravity models: As shown, these terms respectively aggregate the exporter's ability to export goods to more desirable import markets and the importer's ability to import from more capable exporters. 6 For estimation, it is typical to parameterize the trade cost τ ij as depending exponentially on some variables of interest, i.e., where x ij are the components of trade costs whose effects we wish to estimate. Because not all trade costs are reflected in x ij , we also allow for "unobserved" trade costs via the idiosyncratic trade cost term ω ij . Combining (1) with (3) then delivers the following estimating equation: where α i = ln(y i /Π −θ i ) and γ j = ln(y j /P −θ j ) are origin and destination fixed effects that absorb market sizes and multilateral resistances and ω ij now provides a multiplicative error term. 7 When we introduce the three-way model, all of the terms shown in (4) will have a further subscript for time, and the the unobserved trade cost will have a time-invariant component that will motivate the use of an added ij fixed effect.
To motivate the arc of the rest of the paper, several points stand out from the estimation suggested by (5). First, the implied moment condition for estimation is which follows after imposing E(ω ij |x ij ,α i ,γ j ) = 1. 8 As discussed in Santos Silva and Tenreyro (2006), consistent estimation of the trade cost parameters in β therefore generally 6 This presentation of the gravity model readily conforms to the trade models used in Eaton and Kortum (2002) or Anderson and van Wincoop (2003), though the interpretation of θ differs across the two models.
With some minor modifications, this setup can also be made compatible with any of the theoretical gravity models considered in Head and Mayer (2014) or Costinot and Rodríguez-Clare (2014). To be clear, our econometric results do not require any particular microfoundation for the gravity equation. 7 Alternatively, it is sometimes common to write trade costs as a log-linear function, i.e, ln τ ij = x ij β + e ij , with e ij now reflecting unobserved (log) trade costs. Interestingly, these two ways of specifying the error term do not necessarily have equivalent implications for estimation. In the log-linear formulation, if the log-error term e ij is assumed to be heteroskedastic with mean zero, Jensen's inequality implies that estimation in levels will be biased and log-OLS will be consistent. 8 Note that consistent estimation of β actually does not require E(ω ij |·) = 1 in this case but rather where ω i and ω j could be country-specific components of unobserved trade costs that would be absorbed by the fixed effects. For the three-way model, one requires E(ω ijt |·) = ω it ω jt ω ij . requires a nonlinear model. Second, because unobserved trade costs enter the countryspecific terms α i and γ j through the system of multilateral resistances, we treat α i and γ j as unknown parameters that will be noisily estimated, raising concerns about a possible IPP. 9 As we go on to discuss, the FE-PPML estimator that is most often used in this context has some special robustness against IPPs, but this robustness does not hold for FE-PPML in general, especially once we deviate from the two-way gravity setting implied by (5).

The Incidental Parameter Problem
In the context of fixed effects models, IPPs occur when the estimation noise in the fixed effects contaminates the scores of the other parameters being estimated, inducing a bias.
This bias can manifest in a variety of different ways; thus it is useful to provide a generic characterization that can illustrate the different possibilities that may arise. To that end, let n be the total number of observations and let p be the total number of parameters being estimated, inclusive of any fixed effects. As described in Fernández-Val and Weidner (2018), what we need to be concerned with the number of observations that are available to estimate each fixed effect, i.e., n/p. More precisely, when appropriate regularity conditions are satisfied, the estimated β may generally be thought of as having the following bias and standard deviation: where b ∈ R and c > 0 are constants that depend on the model being estimated. As this presentation emphasizes, all estimators in nonlinear settings are generally biased in small samples, but this is not the same thing as saying that the bias always poses a problem for inferring statistical significance. As we will discuss, in larger samples, what matters is whether the bias disappears faster than the standard error as n → ∞. 10 9 As demonstrated in Pfaffermayr (2021), if we assume (i) there are no unobserved trade costs (such that ω ij does not enter the system in (2)) and (ii) the aggregate quantities y i and y j are perfectly observed, then it is better to regard α i and γ j as reflecting constraints rather than as incidental parameters. Since Fally (2015) shows that constrained PPML and FE-PPML produce the same estimates in this context, there is no concern about IPPs if these assumptions are met. 10 In the panel data literature, these results for the bias and standard deviation are usually derived not for β directly, but for the asymptotic distribution of β (because there are cases where β may not have a first or second moment, but nevertheless has a well-defined limiting distribution with finite moments).
We ignore this distinction for our heuristic discussion here.
To provide a simple taxonomy of the cases that can arise, consider first the standard textbook treatment of maximum likelihood estimation, where we usually have p fixed while n → ∞. In this case, the bias in β becomes asymptotically negligible as compared to the standard deviation, which crucially means that estimated confidence intervals can be expected to be centered at the truth when the data becomes sufficiently large. By contrast, in the classical IPP of Neyman and Scott (1948), the number of parameters grows at the same rate as the number of observations, implying that the bias does not converge to zero asymptotically. In that case, the fixed effect estimator is inconsistent.
The gravity model with two-way fixed effects then serves to illustrate a third possibility that will also be applicable to the results that follow for three-way gravity models. In the two-way gravity setting, p is on the order of 2N , where N is the number of countries, and n is on the order of N 2 . Consequently, the bias and standard deviation of β are given by In this case, as N → ∞, the estimated β is consistent (both the standard deviation and bias converge to zero), but we also have As we discuss below, the two-way PPML gravity estimator is a special case where we actually have that b = 0. However, for other two-way gravity estimators (such as twoway Gamma PML for example), we generally have that b = 0, meaning the bias will not disappear relative to the standard error as N → ∞. Compared to Neyman and Scott (1948), the IPP these estimators suffer from is not an inconsistency problem but rather an asymptotic bias problem, whereby the slow convergence of the fixed effects causes the asymptotic distribution for β to be incorrectly centered as it converges to the truth. 11 This version of the IPP is more benign, but ignoring the bias will nonetheless result in invalid inferences and test results. The "large T " panel data literature therefore discusses various methods for bias correction of β that restore asymptotically valid inference. Importantly, the degree to which inferences are biased depends on the bias constant b, which cannot be known beforehand without applying such a correction.
11 Consistency here follows from how the number of fixed effects grows only with the square root of the sample size, as discussed in Egger, Larch, Staub, and Winkelmann (2011). The bias in the asymptotic distribution for two-way models was proven by Fernández-Val and Weidner (2016), discussed below.  Figure 1: Kernel density plots of FE PPML estimates for 3 different models, using 500 replications.
Clockwise from top left, the 3 models are: three-way gravity model with y ijt = exp[α it + γ jt + η ij + x ijt β]ω ijt and T = 2. Respectively, these models exemplify inconsistency, no asymptotic bias, and asymptotic bias. The i and j dimensions of the panel both have size N in the latter two models. The true value of β is 1 (indicated by the vertical dotted lines) and the data is generated using Var(y|·) = E(y|·).
To provide a more visual illustration of these ideas, Figure 1 presents simulation results for the three cases we have just discussed: inconsistency (top-left), no asymptotic bias (top-right), and asymptotic bias (bottom-left). Since asymptotic bias will ultimately be our focus, it is worth noting from the figure how the estimates are consistent in this case-the distribution will collapse to the true value as N → ∞-but confidence bounds based on these estimates will clearly be inappropriate.

How FE-PPML is Different
Our discussion of IPPs thus far has been for the generic estimation of a nonlinear model.
However, the PPML estimator that is most commonly used to estimate gravity models actually behaves very differently than other estimators in this context. In the classic panel data setting with "one way" fixed effects, for example, FE-PPML has the very special property that the IPP bias constant b turns out to be zero, meaning that it is asymptotically unbiased in situations where other estimators tend to be inconsistent. As discussed in Wooldridge (1999), the reason behind this result is that the same estimator can be obtained from a multinomial model that does not depend on the fixed effects. 12 This special property of FE-PPML has important implications for estimating gravity models as well. For two-way gravity settings, the asymptotic bias of β when N → ∞ was worked out in Fernández-Val and Weidner (2016). They show that the two IPP contributions from α i and γ j "decouple" asymptotically, such that the overall bias can be decomposed as the sum of two bias terms that would be expected in a one-way setting, is important to clarify that a key feature of the two-way gravity model is that both fixed effect dimensions grow only with the square root of the panel size, such that the estimation noise in the estimated α i 's and γ j 's disappears asymptotically. As we discuss in the Appendix, if we instead consider a model where both fixed effects grow with n rather than with its square root, the IPPs associated with each fixed effect do not decouple from one another, and FE-PPML in this case is actually inconsistent.
To synthesize these points, FE-PPML has a very special property-one can condition out one of the fixed effects-but this property has an important limitation-the resulting multinomial model does not inherit the same no-bias properties as the original PPML estimator with respect to any further fixed effects. Both of these results will be fleshed out in more detail in the following section when we recast the three-way gravity model as 12 The earliest references to present versions of this result include Andersen (1970), Palmgren (1981), andHausman, Hall, andGriliches (1984 a two-way multinomial model in order to obtain an appropriate expression for the bias.
Doing so will also allow us to highlight another reason why FE-PPML is not immune to IPPs: even for the two-way gravity model, while the α i and γ j parameters do not induce an IPP bias in β, they nonetheless have implications for the estimated variance that are not innocuous; we thus will devote attention to this issue as well.

Results for the Three-way Gravity Model
To recap the sequence of results just described, we know that FE-PPML estimates with one fixed effect do not suffer from an IPP. We also know that FE-PPML may have an IPP in models with more than one fixed effect, but it is both consistent and asymptotically unbiased in two-way gravity settings where neither fixed effect dimension grows at the same rate as the size of the panel. As we will now show, each of these earlier results will be useful for understanding the more complex case of a three-way gravity model that adds a time dimension and a third set of fixed effects to the above two-way model. We also describe a series of bias corrections for the three-way model, including for the possible downward bias of the estimated standard errors.

Consistency
To formally introduce the three-way model, we add an explicit time subscript t ∈ {1, . . . , T } to y ij , x ij , and ω ij from the prior model and add a "country-pair"-specific fixed effect η ij , such that trade costs are now given by τ −θ ijt = e x ijt β+η ij ω ijt . All other elements in the original trade model likewise acquire a time subscript, meaning that α it = ln y it /Π −θ it and γ jt = ln y jt /P −θ jt also must be indexed by t. The model now reads where the three fixed effects now respectively index exporter-time, importer-time, and country-pair. 14 The unobserved trade cost ω ijt ≥ 0 continues to serve as an error term, such that y ijt = λ ijt ω ijt ≥ 0. We thus allow for zero trade flows. For the asymptotics 14 Note that a multiplicative error term is not necessary to deliver the moment condition in (8). We could instead have an additive error term ε ijt = y ijt − λ ijt . In this case, it would be more natural to think of it as coming from measurement error. In addition, note that we assume the true model is as written in (8) and assume away, e.g., any unobserved heterogeneity in β. Allowing for this type of heterogeneity is an important extension for future work to address.
Our strategy for showing the consistency of this estimator will capitalize on the special properties of FE-PPML discussed in the previous section. In particular, we will exploit the fact that not all of the fixed effect dimensions grow at the same rate as N increases.
The numbers of exporter-time and importer-time fixed effects each increase with N (as before), but the dimension of the pair fixed effect η increases with N 2 , since adding another country to the data adds another N − 1 pairs to the estimation. It therefore makes sense to first "profile out" (i.e., solve for) η so that we may deal with the remaining two fixed effects in turn. For given values of β, α, γ, maximizing over η gives us We therefore have with ij (β, α it , γ jt ) := T t=1 y ijt log µ ijt T s=1 µ ijs + terms not depending on parameters, (11) thus leaving us with the likelihood of a multinomial model where the only incidental parameters are α it and γ jt . Using (8), one can easily verify that there is no bias in the score of the profile log-likelihood ij (β, α it , γ jt ) when evaluated at the true parameters β 0 , α 0 it , and γ 0 jt . The reason for this is exactly the same as for the classic panel data setting discussed above. Furthermore, the remaining fixed effects α it and γ jt grow only with the square root of the sample size as N → ∞, implying that they are consistently estimated. This in turn leads us to the following result: Intuitively, this result follows because of how the special properties of FE-PPML allow us to rewrite the three-way gravity model as a two-way model without introducing a 1/T bias. The form of the bias in β is therefore the same as in (7), such that three-way FE-PPML is consistent as N → ∞ largely for the same reason two-way FE-PPML and other two-way PML gravity estimators are generally consistent. However, in the context of three-way estimators, we can also state a stronger result that applies more narrowly to FE-PPML in particular:

Proposition 2. Assume the conditional mean is given by
and consider the class of "three-way" FE-PML gravity estimators with FOC's given by β: where i, j = 1, . . . , N , t = 1, ..., T, and g( λ ijt ) is an arbitrary function of λ ijt that can be specialized to construct various PML estimators. For example, g( λ ijt ) = 1 delivers PPML, If T is fixed, then for β to be consistent under general assumptions about Var(y|x, α, γ, η), we must have that g(λ ijt ) is constant over the range of λ's that are realized in the data-generating process. That is, the estimator must be equivalent to FE-PPML.
In other words, three-way FE-PPML is unique among three-way PML estimators in that its consistency does not require strong assumptions about the conditional variance of y ijt . To draw an appropriate contrast, it is possible to obtain a closed form solution for the pair fixed effect η ij so long as g( λ ijt ) is of the form g( λ ijt ) = λ q ijt , where q can be any real number. Notably, this latter class of estimators not only includes FE-PPML (for which q = 0), but also includes other popular gravity estimators such as Gamma PML (q = −1) and Gaussian PML (q = 1). However, as we discuss in the Appendix, these other estimators are only consistent if the conditional variance is proportional to λ 1−q ijt , in which case they inherit the properties of their associated MLE estimators.

Asymptotic Bias
Because three-way FE-PPML inherits the consistency properties of the two-way estimator, one might expect that it also inherits its "no asymptotic bias" properties as well. However, this is where the limitations of PPML's no-IPP properties become apparent. While the profile log-likelihood in (10) is now of a similar form to the two-way models considered in Fernández-Val and Weidner (2016), notice that it no longer resembles the original FE-PPML log-likelihood. Their no-bias result for two-way FE-PPML therefore does not carry over to the three-way model, and it is possible to show that FE-PPML estimates have an asymptotic bias in this setting.
Before proceeding, it is helpful to first revisit the intuition established in Section 2.2 that shapes how we expect the bias to behave. As we have discussed, in a model with p parameters and n observations, the bias should be proportional to p/n, whereas the standard error should vary with 1/ √ n. After profiling out η, the resulting two-way model has ∼ 2N T parameters vs. ∼ N 2 T observations. If T is held fixed, we would expect the bias and the standard error to decrease at the same rate (1/N ), raising concerns about a possible asymptotic bias problem and guiding us as to its form. Readers should keep this intuition in mind in reading through the technical details that follow.
To illustrate more precisely where the bias comes from, it is necessary to examine how the estimated fixed effects enter the score for β using a Taylor expansion. To that end, first let φ := vec(α, γ) be a vector that collects all of the exporter-time and importer-time fixed effects, such that we can rewrite ij slightly as ij = ij (β, φ). We can then similarly define the function φ(β) as collecting the estimated fixed effects α and γ as functions of β. Next, we construct a second-order expansion of the score for β around the true set of fixed effects φ 0 and evaluated at the true parameter β 0 : =0 (reflects estimation error in FEs)
This expression is near-identical to a similar expansion that appears in Fernández-Val and Weidner (2016)-differing mainly in that ij is a vector rather than a scalar-and communicates the same essential insights: because the latter two terms in (12) are generally not equal to zero, the score for β is biased, with the bias depending on the interaction between the higher-order partial derivatives of ij and the estimation errors in α i and γ j as well as their variances and covariances.
Demonstrating the bias in this particular setting then requires that we introduce some additional notation, mainly to provide some shorthand for the higher-order partial derivatives of ij that appear in (12). To do so, we first find it convenient to let ϑ ijt := λ ijt / τ λ ijτ . We then define the T × 1 "score" vector S ij , the T × T "Hessian" matrix H ij and the T × T × T cubic tensor G ij (the "third partial"), with their respective elements given by where it should be understood that all of these terms are evaluated at the true values for all parameters. Explicit formulas for G ij,tsr are provided in the Appendix.
The value of defining these objects is that they allow us to easily form terms identified by (12) as being important for the bias of the score. For example, S ij allows us to obtain where we use the convention that G ij x ij,k is a T × T matrix with elements [G ij x ij,k ] st = r G ijrst x ijr,k . We also find it useful to define the expected HessianH ij = E(H ij | x ij ) and, similarly, the expected third partialḠ ij = E(G ij | x ij ). 16 Finally, we define the K-vector x ij as an appropriate two-way within-transformation of x ij that purges it of the fixed effects; see the Appendix for details.
Obtaining a tractable expression for the bias then involves following the logic of (12) and plugging in the just-defined objects S ij , H ij , G ij , and x ij where appropriate. Before doing so, we invoke the assumption that observations are serially correlated within pairs but independent across pairs, as is commonly assumed in the literature (see Yotov, Piermartini, Monteiro, and Larch, 2016.) This assumption turns out to cause the IPPs associated with α i and γ j to "decouple", 17 leading to the following proposition: 16 BecauseH ij is only positive semi-definite (not positive definite), we use a Moore-Penrose pseudoinverse whenever the analysis requires we work with an inverse ofH ij . Specifically, we have thatH ij ι T = 0, where ι T = (1, . . . , 1) is a T-vector of ones. Thus,H ij is only of rank T − 1 rather than of rank T . 17 In particular, all elements of the cross-partial objects where W N and Ω N are K × K matrices given by and B N and D N are K-vectors with elements given by where a † denotes a Moore-Penrose pseudoinverse.
The above proposition establishes the asymptotic distribution of the three-way gravity estimator as N → ∞, including the asymptotic bias (N −1) −1 W −1 N (B N +D N ). Intuitively, this bias can be decomposed as the product of the inverse expected Hessian with respect to β (i.e. W −1 N ), the rate of asymptotic convergence (essentially 1/N ), and the bias of the score from (12), which here is given by the combined term B N + D N . B N reflects the contribution to the bias from the noise in α i , whereas D N reflects the contribution from the noise in γ j . The first terms in both B N and D N come from the second term in (12), be shown to be asymptotically small for N → ∞. Thus, in what follows, B N reflects the contribution of the α i parameters to the bias and D N reflects the contribution of the γ j parameters. As we discuss in the Appendix, relaxing this assumption can change the expression of the bias.
reflecting the estimation error in the estimated fixed effects, and the second terms in B N and D N echo the third term in (12), reflecting their variance.
Thus, in the end, the bias reduces to essentially the same simple formula we gave in Section 2.2 for the bias of the two-way gravity model, i.e., are constants that do not vary with N . Importantly, and unlike in the two-way FE-PPML setting, the three-way model does not give us the no-bias result that B N = D N = 0, as the following discussion helps to illustrate.

Illustrating the Bias using the T = 2 Case
Admittedly, the complexity of the objects that appear in Proposition 3 may make it difficult to appreciate the general point that the three-way estimator is not asymptotically unbiased. One way to make these details more transparent is to focus our attention on the simplest possible panel model where T = 2. The convenient thing about this simplified setting is that the likelihood function ij can be reduced to just a scalar: ij = , and ∆γ j = γ j1 − γ j2 . These normalizations allow us to express all of the objects that appear in Proposition 3 as also just scalars, and we can therefore easily derive the following result:

The bias term B k N in Proposition 3 can then be written as
with an analogous expression also following for D k N .
Two points then stand out based on the above expression: (i) all bias terms in B k N and D k N generally depend on the distribution of y ij and do not depend on it in the same way; (ii) none of these terms generally equals 0. The first of these two observations can be seen from how the bias depends on the expected second moments of y ij (e.g., , marking an important difference from the models that were considered in Fernández-Val and Weidner (2016). 18 Among other things, the difficulty associated with estimating these second moments means that analytical bias corrections may not necessarily offer superior performance to distribution-free method such as the jackknife. The second observation mainly follows from the first. It can also be shown that j =iḠij ∆ x ij = 0, which ensures that the second term can never be zero.

What if T is Large?
While Proposition 3 only focuses on asymptotics where N → ∞, the three-way gravity panel also features a time dimension (T ), and it is interesting to wonder how the above results may depend on changes in T . As we show in the Appendix, large T only makes a difference for the asymptotic order of the bias of β if there is only weak time dependence between observations belonging to the same country pair, in the sense described by Hansen, 2007. 19 We will henceforth assume any time dependence is weak. The following remark then describes some additional asymptotic results for when T is large.

Remark 2.
Under asymptotics where T → ∞, we have the following: goes to zero at a rate of 1/(N T ). Therefore, because the standard error is of order 1/(N √ T ), there is no bias in the asymptotic distribution of β as N and T both → ∞.
To elaborate further, letting T → ∞ is obviously not sufficient for either α or γ to be consistently estimated and does not solve the IPP, as stated in part (i). However, as 18 The specific examples they use are the Poisson model, which is unbiased, and the probit model, which requires the distribution of y ij to be correctly specified. They also provide a bias expansion for "conditional moment" models that allow the distribution of y ij to be misspecified. Beyond this theoretical discussion, bias corrections for misspecified models have yet to receive much attention, however. 19 By "weak" time dependence, we mean that any such dependence dissipates as the temporal distance between observations increases. Alternatively, if observations are correlated regardless of how far apart they are in time, the standard error is always of order 1/N (see Hansen, 2007), and the same will also be true for the asymptotic bias. The latter is arguably a less natural assumption in this context, however.

part (ii) tells us, T still plays an interesting role in conditioning the bias when both N
and T jointly become large. Intuitively, because W −1 N is of order 1/T as T → ∞, whereas B N and D N are both of order 1, the bias in β effectively vanishes at a rate of 1/(N T ) as both N, T → ∞, such that it disappears asymptotically in relation to the order-1/(N √ T ) standard error. However, since T is usually small relative to N in this context, it remains to be seen whether these asymptotic results carry over to practical settings.

Downward Bias in Robust Standard Errors
Of course, even if the point estimates are correctly centered, inferences will still be unreliable if the estimates of the variance used to construct confidence intervals are not themselves unbiased. For PPML, confidence intervals are typically obtained using a "sandwich" estimator for the variance that accounts for the possible misspecification of the model. (2001), the PPML sandwich estimator is generally downward-biased in finite samples. Furthermore, for gravity models (both two-way and three-way), the bias in the sandwich estimator can itself be formalized as a kind of IPP. 20

However, as shown by Kauermann and Carroll
To illustrate the bias of the sandwich estimator in our three-way setting, recall that we can express the variance of β as Var( As is also true for the linear model (cf., MacKinnon and White, 1985;Imbens and Kolesar, 2016), the bias arises because plugin estimates for the "meat" of the sandwich Ω N depend on the estimated score variance E( S ij S ij ) rather than on the true variance E(S ij S ij ). Even though E( S ij S ij ) is a consistent estimate for E(S ij S ij ), it will generally be downwardbiased in finite samples. Notably, this bias may be especially slow to vanish for models with gravity-like fixed effects.
To see this, we follow the same approach as Kauermann and Carroll (2001). Specifically, we use the special case where E(S ij S ij ) = κH ij (such that Ω N = κW N , meaning PPML is correctly specified) to demonstrate that E( S ij S ij ) generally has a downward bias. Under this assumption, it is possible to show that the expected outer product of the fitted score E( S ij S ij ) has a first-order bias of captures the expected Hessian of the concentrated likelihood with respect to α and γ and where The two terms on the right-hand side of (13) are both negative definite, implying that the sandwich estimator is generally downward-biased-and definitively so if the model is correctly specified. Since we work with cluster-robust standard errors, a relevant comparison to draw here is with Cameron, Gelbach, and Miller (2008), who have previously shown that the cluster-robust sandwich estimator has a downward bias that depends on the number of clusters. In our setting, the standard Cameron, Gelbach, and Miller (2008) bias is reflected in the first term on the righthand-side of (13), which captures how the the bias depends on the variance of β. The second term, which arises because of an IPP, captures how much of the bias is due to the variance in the estimated origin-time and destination-time fixed effects in φ. The former term decreases with 1/N 2 -i.e., with the number of pairs/clusters-but the latter term only decreases with 1/N , since increasing N by 1 only adds 1 additional observation of each origin-time and destination-time fixed effect. 22 All together, this analysis implies that the estimated standard error for β will exhibit a bias that only disappears at the relatively slow rate of 1/ √ N . We should therefore be concerned that asymptotic confidence intervals for β may exhibit inadequate coverage even in moderately large samples, similar to what has been found for the two-way FE-PPML estimator in recent simulation studies by Egger and Staub (2015), Jochmans (2016), and Pfaffermayr (2019). Indeed, the bias approximation we have derived in (13) can be readily adapted to the two-way setting or even to more general settings with k-way fixed effects.
21 A detailed derivation of (13) is provided in the Appendix.

Bias Corrections for the Three-way Gravity Model
We now present two methods for correcting the bias in estimates: a jackknife method based on the split-panel jackknife of Dhaene and Jochmans (2015) and an analytical correction based on the expansion shown in Proposition 3. We also provide an analytical correction for the downward bias in standard errors.

Jackknife Bias Correction
The advantage of the jackknife correction is that it does not require explicit estimation of the bias yet still has a simple and powerful applicability. To see this, note first that the asymptotic bias we characterize can be written as where B β is a combined term that captures any suspected asymptotic bias contributions of order 1/N . The specific jacknife we will apply for our current purposes is a split-panel jackknife based on Dhaene and Jochmans (2015). As in Dhaene and Jochmans (2015), we want to divide the overall data set into subpanels of roughly even size and then estimate β (p) for each subpanel p. Given the gravity structure of the model, we first divide the set of countries into evenly-sized groups a and b. We then consider 4 subpanels of the form "(a, b)", where "(a, b)" denotes a subpanel where exporters from group a are matched with importers from group b. The other three subpanels are (a, a), (b, a), and (b, b). For randomly-generated data, we can define a and b based on their ordering in the data (i.e., a := i : i ≤ N/2; b := i : i > N/2). For actual data, it would be more sensible to draw these subpanels randomly and repeatedly. 23 The split-panel jackknife estimator for β, β J N , is then defined as This correction works to reduce the bias because, so long as the distribution of y ij and x ij is homogeneous across both the i and j dimensions of the panel, 24 each β (p) has a leading bias term equal to 2B β /N . The average β (p) across these four subpanels thus also has 23 This is just one possible way to construct a jackknife correction for two-way panels. We have also experimented with splitting the panel one dimension at a time as in Fernández-Val and Weidner (2016), but we find the present method performs significantly better at reducing the bias. 24 By "homogeneity" we mean that the vector (y ij , x ij , α i , γ j ) is identically distributed across both i a leading bias of 2B β /N and any terms depending on B β /N cancel out of (14). Thus, the bias-corrected estimate β J N only has a bias of order o p (N −1 ), which is obtained by combining the second-order bias from β with that of the average subpanel estimate. This latter bias can be shown to be larger than the original second-order bias in (3.4), but the overall bias should still be smaller because of the elimination of the leading bias term.

Analytical Bias Correction
Our analytical correction for the bias is based on the bias expression in Proposition 3. In the Appendix, we show how appropriate sample analogs W N , B N , D N of the expressions for W N , B N , D N can be formed. The resulting bias-corrected estimate is then given by It is possible to show that these plug-in corrections lead to estimates that are asymptotically unbiased as N → ∞. Still, for finite samples, it is evident that the bias in some of these plug-in objects could cause the analytical bias correction to itself exhibit some bias. For this reason, it is not obvious a priori whether the analytical correction will outperform the jackknife at reducing the bias in β. One clear advantage the analytical correction has over the jackknife is that it does not require any homogeneity restrictions on the distribution of y ij and x ij in order to be valid.

Bias-corrected Standard Errors
Under the assumption of clustered errors within pairs, a natural correction for the variance estimate is available based on (13). Specifically, let N . The corrected variance estimate is then given by and j, which is the appropriate translation of Assumption 4.3 in Fernández-Val and Weidner (2016) to our setting. This does allow (y ij , x ij ) to be heterogeneously distributed conditional on the fixed effects in the sense that the fixed effects themselves introduce heterogeneity into the model. Nonetheless, this is a strong assumption. One of the main advantages of the analytical bias correction is that it does not require such assumptions.
The logic of this adjusted variance estimate follows directly from Kauermann and Carroll (2001): if the PPML estimator is correctly specified (such that E(S ij S ij ) = κH ij ), then V U can be shown to eliminate the first-order bias in V ( β − β 0 ) shown in (13). It is not generally unbiased otherwise, but it is plausible that it should eliminate a significant portion of any downward bias under other variance assumptions as well.

Other Practicalities
As we have noted, implementations of our analytical corrections are available via our Stata command ppml_fe_bias. Since this command can be applied to data sets that do not conform exactly to our theoretical framework, we provide here some brief comments on its applicability in general cases that may not be explicitly covered in the notation and formulas used above. For example, it is worth pointing out that our corrections may be used with data sets that have missing values. They also continue to apply if the data includes the diagonal (i = j) terms versus treating them as missing or inapplicable.
Similarly, we can allow for the data to have unequal numbers of exporters and importers.
In the latter case, we need the numbers of exporters and importers to grow at the same rate asymptotically for our results to apply.
One practicality that is not covered in our current framework is the case of "four way" gravity models that have an added index for industry. Though a full analytical characterization of this type of model is left for future work, our Appendix describes how a modified version of the heuristic from Fernández-Val and Weidner (2018) may be used to assess the order of the bias and derive a suitable jackknife correction. Interestingly, this discussion reveals that the asymptotic bias problem may be more severe for four-way models than for three-way models. The heuristic we propose may be used to assess IPPs in other, non-trade settings as well.

Simulation Evidence
For our simulation analysis, we assume the following: (i) the data generating process (DGP) for the dependent variable is of the form y ijt = λ ijt ω ijt , where ω ijt is a log-normal disturbance with mean 1 and variance σ 2 ijt . (ii) β = 1. (iii) The model-relevant fixed effects α, γ, and η are each ∼ N (0, 1/16). (iv) x ijt = x ijt−1 /2 + α + γ + ν ijt , where ν ijt ∼ N (0, 1/16). 25 (v) Taking our cue from Santos Silva and Tenreyro (2006), we consider 4 different assumptions about the disturbance ω ijt : where we also allow for serial correlation within pairs by imposing such that the degree of correlation weakens for observations further apart in time. 26 The relevance of these various assumptions to commonly used error distributions is best described by considering the conditional variance Var(y ijt |x it , α, γ, η). For example, DGP I assumes that the conditional variance is constant, as in a Gaussian process with i.i.d disturbances. In DGP II, the conditional variance equals the conditional mean, as in a Poisson distribution. DGP III-which we will also refer to as "log-homoskedastic"-is the unique case highlighted in Santos Silva and Tenreyro (2006) where the assumption that the conditional variance is proportional to the square of the conditional mean leads to a homoskedastic error when the model is estimated in logs using a linear model. Finally, DGP IV provides a "quadratic" error distribution that mixes DGP II and DGP III and also allows for overdispersion that depends on x ijt . As shown by Santos Silva and Tenreyro (2006), this type of DGP tends to induce a relatively large degree of bias.
Tables 1 and 2 present simulation evidence comparing the uncorrected three-way FE-PPML estimator with results computed using the analytical and jackknife corrections described in Section 3.4. As in the prior simulations, we again compute results for a variety of different panel sizes-in this case for N = 20, 50, 100 and T = 2, 5, 10. 27 In 25 These assumptions on α, γ, η, x ijt , and ν ijt are taken from Fernández-Val and Weidner (2016).
Notice that x ijt is strictly exogenous with respect to ω ijt conditional on α, γ, and η. 26 The 0.3 that appears here serves as a quasi-correlation parameter. Replacing 0.3 with 1 would be analogous to assuming disturbances are perfectly correlated within pairs. Replacing it with 0 removes any serial correlation. Choosing other values for this parameter produces similar results. 27 Note that the trade literature currently recommends using wide intervals of 4-5 years between time periods so as to allow trade flows time to adjust to changes in trade costs (see Cheng and Wall, 2005.) Thus, for practical purposes, T = 10 may be thought of as a relatively "long" panel in this context that might span 40+ years. IPPs do not necessarily vanish for larger values of T , as discussed further below.
order to validate our analytical predictions regarding these estimates, we compute the average bias of each estimator, the ratios of the average bias to the average standard error and of the average standard error to the standard deviation of the simulated estimates, and the probability that the estimated 95% confidence interval covers the true estimate of β = 1. In particular, we expect that the bias in β should be decreasing in either N or T but should remain large relative to the estimated standard error and induce inadequate coverage for small T . We are also interested in whether the usual cluster-robust standard errors accurately reflect the true dispersion of estimates. Results for DGPs I and II are shown in Table 1, whereas Table 2 shows results for DGPs III and IV.
The results in both tables collectively confirm the presence of bias and the viability of the analytical and jackknife bias corrections. The average bias is generally larger for DGPs I and IV than II and III. As expected, it generally falls with both N and T across all the different DGPs, though only weakly so for DGP III (the log-homoskedastic case), which generally only has a small bias. 28 To use DGP II-the Poisson case, where PPML should otherwise be an optimal estimator-as a representative example, we see that the average bias falls from 3.840% for the smallest sample where N = 20, T = 2 to a low of 0.249% at the other extreme where N = 100, T = 10. For DGP IV, the least favorable of these cases, the average bias ranges from −6.1364% down to −1.878%. On the whole, these results support our main theoretical findings that β should be consistently estimated even for fixed T but has an asymptotic bias that depends on the number of countries and on the number of time periods.
Interestingly, while the average bias almost always decreases with T , the ratio of the bias to standard error usually does not, seemingly contrary to the expectations laid out in Remark 2. Evidently, increasing T does not automatically reduce the bias at a rate of 1/T . As we discuss in more detail below, researchers should thus be careful to note that the implications of Remark 2 do not necessarily apply to settings with small T or even moderately large T . Furthermore, the estimated cluster-robust standard errors themselves clearly exhibit a bias in all cases as well. Even when N = 100, SE/SD ratios are uniformly below 1; generally they are closer to 0.9 or 0.95, and for DGPs I and IV, they are often closer to 0.85 or even 0.8. Because of these biases, the simulated FE-PPML coverage ratios are unsurprisingly below the 0.95 we would expect for an unbiased estimator.
Bias corrections to the point estimates do help with addressing some, but not all, of 28 Numerically, we have found that the two terms that appear in both B N and D N in Proposition 3 tend to have opposite signs when the DGP is log-homoskedastic, thereby mitigating one another.
these issues. The jackknife generally performs more reliably than the analytical correction at reducing the average bias when compared across all values of N and T ; notice how, for the Poisson case, for example, the average bias left by the jackknife correction is never greater than 0.25% in absolute magnitude, whereas the analytical-corrected estimates still have average biases ranging between 0.01% and 1.12%. However, when N = 100, the analytical correction begins to closely match the jackknife, especially when T = 10.
All the same, both corrections generally have a positive effect, and the better acrossthe-board bias-reduction performance of the jackknife comes at the important cost of a relatively large increase in the variance. Thus, the analytical correction generally performs as well as or better than the jackknife in terms of improving coverage even in the smaller samples. Neither correction is sufficient to bring coverage ratios to 0.95, however, though corrected Gaussian-DGP estimates and Poisson-DGP estimates both exceed 0.94 using the analytical correction when N = 100 and T = 10, with the latter reaching 0.948. Table 3 then evaluates the efficacy of our bias correction for the estimated variance.
Keeping in mind that this correction is calibrated for the case of a correctly specified variance (which corresponds to DGP II), we would naturally expect that the effect of this correction should vary depending on the conditional distribution of the data. In that light, it is encouraging that we observe positive effects across all cases. The best results by far are for the DGPs I, II, and III, where combining the analytical bias correction for the point estimates with the correction for the variance yields coverage ratios that range between 0.925 and 0.952 when N is either 50 or 100 and generally get closer to the the target value of 0.95 as either N or T increases. These corrections lead to dramatic improvements in coverage for DGP IV as well, but there the remaining biases in both the point estimate and the standard error remain large even for N = 100 and T = 10.
To summarize, these simulations suggest that combining an analytical bias correction for β with a further correction for the variance based on (13)

What happens for larger values of T ?
Based on our Remark 2, one might expect that increasing the size of the time dimension should reduce the asymptotic bias in relation to the standard error. However, in the range of T values we used in Tables 1-3

Empirical Applications
For our main empirical application, we estimate the average effects of an FTA for a variety of different industries using a panel with a relatively large number of countries. The value of this exercise is that we expect that trade flows could be distributed very differently across different industries. Based on our results so far, this should lead to a range of differerent bias behaviors in the data.
Our trade data is from the BACI database of Gaulier and Zignago (2010), from which we extract data on trade flows between 167 countries for the years 1995, 2000, 2005, 2010, and 2015. Countries are chosen so that the same 167 countries always appear as both exporters and importers in every period; hence, the data readily maps to the setting just described with N = 167 and T = 5. We combine this trade data with data on FTAs from the NSF-Kellogg database maintained by Scott Baier and Jeff Bergstrand, which we crosscheck against data from the WTO in order to incorporate agreements from more recent years. 30 The specification we estimate is where y ijt is trade flows (measured in current USD), F T A ijt is a 0/1 dummy for whether or not i and j have an FTA at time t, and ω ijt is an error term. As we have noted, estimation of specifications such as (15) Table 4 presents results from FE-PPML estimation of (15), including results obtained using our bias corrections. The estimation is applied separately to 28 2 digit ISIC (rev. only 0.52 in this case. As shown in the left-hand panel of Figure 2, the reason is because the bias tends to decrease more slowly than the standard error as T increases while N is fixed under this DGP. 3) industries as well as to aggregate trade. While we do indeed see a range of different biases in the industry-level estimates, the results for aggregate trade flows, shown in the bottom row of Table 4, are fairly representative. To provide some basic interpretation, the coefficient on F T A ijt for aggregate trade is initially estimated to be 0.082, which equates to an e 0.082 − 1 = 8.5% average "partial" effect of an FTA on trade. 31 The estimated standard error is 0.027, implying that this effect is statistically different from zero at the p < 0.01 significance level. Our bias-corrected estimates do not paint an altogether different picture, but do highlight the potential for meaningful refinement. Both the analytical and jackknife bias corrections for β suggest a downward bias of 0.04-0.06, or about 15%-22% of the estimated standard error. As our bias-corrected standard errors show (in the last column of Table 4), the initially estimated standard error itself has an implied downward bias of 11% (i.e., 0.027 versus 0.030).
Turning to the industry-level estimates, the analytical bias correction more often than not indicates a downward bias ranging between 5%-20% of the estimated standard error, though exceptions are present on both sides of this range. Estimates for the Chemical and Furniture industries appear to be unbiased, for example, and some (such as Tobacco) are associated with an upward bias. On the other end of the spectrum, implied downward biases can also be larger than 20% of the standard error, as is seen for Petroleum (47%), Fabricated Metal Products (31%), Electrical Equipment (27%), and Agriculture (22%).
The biases implied by the jackknife are often even larger (see Fabricated Metal Products, for example), consistent with what we found in our simulations for smaller panel sizes.
One possible interpretation is that the jackknife-corrected estimates are giving us a less conservative alternative to the analytical corrections in these cases. Indeed, the general correspondence between the two sets of results adds validity to both methods. However, as we have noted, these jackknife estimates could also be reflecting non-homogeneity in the data and/or the higher variance introduced by the jackknife. Implied biases in the standard error, meanwhile, tend to range between 10%-20% of the original standard error, again with some exceptions. 31 The term "partial effect" is conventionally used to distinguish this type of estimate from the "general equilibrium" effects of an FTA, which would typically be calculated by solving a general equilibrium trade model where prices, incomes, and output levels (which are otherwise absorbed by the α it and γ jt fixed effects) are allowed to evolve endogenously in response to the FTA. In the context of such models, β can usually be interpreted as capturing the average effect of an FTA on bilateral trade frictions specifically, holding fixed all other determinants of trade.
To further illustrate the types of results that can occur, we also obtain replication data for several recent articles that have used three-way gravity models and re-examine their findings using our bias corrections. The results, reported in Table 5, help to demonstrate how these corrections can matter for assessing statistical significance. Instances where conventional significance levels are affected include the coefficients for EIA × CONTIG and EIA × LEGAL from Baier, Bergstrand, and Clance (2018) and the coefficient for log approval rating from Rose (2019). 32 The implied bias to standard error ratios are sometimes as large as 40%-45%, as occurs for the EIA × LANG coefficient from Baier, Bergstrand, and Clance (2018) and the Total EIA Effect from Bergstrand, Larch, and Yotov (2015). However, the largest effects are actually for the standard error, which is downward-biased by more than 40% in several cases (the Total FTA effect from Baier, Yotov, and Zylkin (2019), for example). Notably, there are meaningful differences found even for the data used in Larch, Wanner, Yotov, and Zylkin (2019), a large data set with 213 countries and 66 time periods. These results reinforce our earlier finding that bias corrections may be useful even in settings with seemingly large N and T .

Conclusion
Thanks to recent methodological and computational advances, nonlinear models with three-way fixed effects have become increasingly popular for investigating the effects of trade policies on trade flows. However, the asymptotic and finite-sample properties of three-way fixed effects estimators have not been rigorously studied, especially with regards to potential IPPs. The performance of the FE-PPML estimator in particular is of natural interest in this context, both because FE-PPML is known to be relatively robust to IPPs as well as because it is likely to be a researcher's first choice for estimating three-way gravity models. Our results regarding the consistency of PPML in this setting reflect these unique properties of PPML and support its current status as a workhorse estimator for estimating the effects of trade polices.
Given the consistency of PPML in this setting, and given the nice IPP-robustness properties of PPML in general, it may come as a surprise that three-way PPML estimates nonetheless suffer from an asymptotic bias that affects the validity of inferences. In theory, the bias should become less of a problem when the country and time dimensions are both large, but our experiments with the time dimension indicate the bias can be of comparable magnitude to the standard error even in ostensibly large trade data sets. Typical clusterrobust estimates of the standard error are also biased, implying estimated confidence intervals not only off-center but also too narrow.
These issues are not so severe that they leave researchers in the wilderness, but we do recommend taking advantage of the corrective measures we have described. In particular, we find that analytical bias corrections based on Taylor expansions to both the point estimates and standard errors generally lead to improved inferences when applied simultaneously. We caution that we have not found these corrections to be a panacea, however, and several avenues remain open for future work. For example, confidence interval estimates could be adjusted further to account for the uncertainty in the estimated variance-Kauermann and Carroll (2001) describe such a correction for the PPML case.
A quasi-differencing approach similar to Jochmans (2016) could provide another angle of attack, and a recent contribution by Pfaffermayr (2021) suggests that jackknife and bootstrap confidence interval methods hold promise as well. Turning to broader applications, the essential dyadic structure of our bias corrections could be easily adapted to network models that study changes in network behavior over time, including settings that involve studying the number of interactions between network members.   T=2 T=5 T=10  T=2 T=5 T=10  T=2  T=5 T=10 A. Gaussian DGP ("DGP I") Average bias (×100) . The data is generated using α it ∼ N (0, 1/16), γ jt ∼ N (0, 1/16), η ij ∼ N (0, 1/16) and and ν ijt ∼ N (0, 1/2). Results are shown for two different assumptions about V (y ijt ). The "Gaussian" DGP (panel A) assumes V (ω ijt ) = λ −2 ijt . The "Poisson" DGP (panel B) assumes V (ω ijt ) = λ −1 ijt .SE/SD refers to the ratio of the average standard error of of β relative to the standard deviation of β across simulations. Coverage probability refers to the probability β 0 is covered in the 95% confidence interval for β N T .
"FE-PPML" indicates uncorrected estimates. SEs allow for within-ij clustering.  T=2  T=5 T=10  T=2  T=5 T=10  T=2  T=5 T=10 A. Log-homoskedastic DGP ("DGP III") Average bias (×100) . The data is generated using α it ∼ N (0, 1/16), γ jt ∼ N (0, 1/16), η ij ∼ N (0, 1/16) and and ν ijt ∼ N (0, 1/2). Results are shown for two different assumptions about V (y ijt ). The "Log-homoskedastic" DGP (panel A) assumes V (y ijt ) = λ 2 ijt . The "Quadratic" DGP (Panel D) assumes ω ijt is log-normal with variance equal to λ −1 ijt + exp(2x ijt ). SE/SD refers to the ratio of the average standard error of of β relative to the standard deviation of β across simulations. Coverage probability refers to the probability β 0 is covered in the 95% confidence interval for β N T . "Analytical" and "Jackknife" respectively indicate Analytical and Jackknife bias-corrected FE-PPML estimates. "FE-PPML" indicates uncorrected estimates. SEs allow for within-ij clustering.     (2018) EIA 0.062 (0.039) 0.073 (0.044)* EIA is a dummy for the presence of an Economic Integration Agreement. The interaction terms are log distance and dummies for adjacency, common legal system, common religion, common language, and colonial history. Our estimation differs from BBF's in that we use PPML, whereas BBF use OLS with first-differences and pair-specific time trends. The specification we replicate is their Table 4 (2015) Total EIA Effect 0.521 (0.060)*** 0.546 (0.098)*** Reported coefficient is for the "Total EIA effect" inclusive of 4-and 8-year lags for the same specification as in their This table shows replications of several recent papers that use three-way gravity models with exporter-time, importer-time, and country-pair fixed effects. Standard errors shown in parentheses. Bias-corrected estimates use our analytical corrections for the point estimates and standard errors. * p < 0.10 , ** p < .05 , *** p < .01.

A Appendix
This Appendix first describes an additional simulation exercise based on the trade data. We then introduce additional notation, definitions, and other technical details that supplement the formal theorems and remarks presented in Section 3. Proofs of our formal results then follow, starting with a poof of Proposition 3, which characterizes the asymptotic distribution of β and its asymptotic bias. This proof naturally lends itself to further discussion of the "large T " results from Remark 2 as well as the consistency result from Proposition 1, which itself follows as a by-product of Proposition 3. We then demonstrate the uniqueness of this latter result as stated in Proposition 2 and highlight the general inconsistency of other three-way gravity estimators.
After the proofs, we include further supplementary discussions on the downward bias in the estimated standard errors, on allowing for conditional dependence across pairs in the trade data, and on how FE-PPML is affected by IPPs in more general settings beyond the gravity framework.

A.1 Simulation Based on Trade Data
As a additional simulation exercise, we revisit the BACI aggregate trade data we used in Section 5 and ask: if the estimated effect of an FTA and its standard error were indeed biased to the degrees implied by our bias corrections, would our corrections be successful at correctly identifying the bias and improving inference?
To answer this question, we start from the original aggregate trade data and FTA data but reconstruct the conditional mean λ ijt as though β = 0.086. That is, we first adjust the estimated λ ijt 's from the original estimation to account for the change in β and so that the FOC's for all fixed effects are consistent with β = 0.086. This gives us new "true" values of the conditional mean that we denote by λ ijt . The original data is therefore assumed to have been generated by y ijt = λ (1) ijt ω ijt , where the true disturbance ω ijt is backed out using ω ijt = y ijt /λ (1) ijt . Next, we choose a DGP for ω ijt that can reproduce the biases implied by our corrections. Taking our cues from DGP IV, which we found earlier to produce a downward bias in β, we consider a DGP where the conditional variance of y ijt has the form Var[y ijt |.] = aλ ijt + bF T Aλ 2 ijt . That is, we allow for some overdispersion that depends on the regressor of interest, as in DGP IV. We choose the two parameters a and b in order to come close to matching the following three values: (i) the bias in β, (ii) the standard deviation of β (assumed to be 0.03), (iii) the bias of the standard error of β. To keep things simple, all of our simulations sample new values for ω ijt only, holding λ (1) ijt fixed. As in the main text, we use 5,000 replications and assume ρ = 0.3. For our chosen values of a and b-a = 200, 000, b = 0.08-we obtain an average β of 0.0823, an average standard error of 0.0267, and a standard deviation of 0.0307. The uncorrected coverage is 0.9078.
When our preferred bias corrections are applied, they do not completely solve the coverage problem but do induce across-the-board improvements. The average corrected β using the analytical bias correction is 0.0842 and the corrected standard error is 0.0290.
Coverage improves as well, but only to 0.9210. As discussed in the main text, one important factor that limits the improvement in coverage is the fact that applying bias corrections to the point estimates increases their variance. In this case, the standard deviation of the corrected estimate is 0.0321.
Turning to the jackknife bias correction, we find as before that it does a superior job of bias reduction than the analytical correction, producing a average corrected estimate of 0.0851. However, the standard deviation of the jackknife-corrected estimates is 0.0371, echoing our previous finding that the improved bias reduction performance of the jackknife comes at a steep penalty in terms of increased variance. As a result, the coverage we obtain when we combine the jackknife with the corrected standard errors is only 0.8756.
Interestingly, if we only use the correction to the standard errors, i.e., without applying any correction to the point estimates, we obtain a coverage ratio of 0.9312, which is better than if we also use the analytical correction. It nonetheless remains true that using corrections to both the point estimates and standard errors leads to an improvement in coverage, as we have consistently found throughout our results. In general, these simulations reinforce our earlier conclusion that bias corrections, though helpful, are not necessarily a panacea to the issues we raise in the paper.

A.2 Additional Notation and Definitions
It is convenient to define the log-likelihood as a function of the index vector π ij as follows, In this appendix, we will also be more explicit than in the main text in distinguishing true parameter values β 0 , α 0 it , γ 0 jt , and the corresponding π 0 ij and ϑ 0 ijt , from their generic equivalents. For example, the S ij , H ij and G ij that were already defined in the main text can more formally be written as The T × K matrix x ij that was informally introduced in the main text as a two-way within-transformation of x ij , can be formally defined by subject to appropriate normalizations on α x i and γ x j (e.g. ι T α x i = ι T γ x j = 0, where ι T = (1, . . . , 1) is a T-vector of ones). Each within-transformed regressor vector x ij,k can be interpreted as containing the residuals left after partialing out x ij,k with respect to any iand j-specific components and weighting byH ij . 33

A.2.1 Analytical Bias Correction Formulas
The analytical bias correction discussed in Section 3.4 of the main text requires estimates of the expressions W N , B N , D N defined in Proposition 3. For this, we first require plugin objects x ij , S ij , H ij , H ij , and G ij -these objects are formed in the obvious way by 33 While we present the computation of x ij as a two-way within-transformation to preserve the analogy with Fernández-Val and Weidner (2016), each individual element x ijt,k can also be shown to be equivalent (subject to a normalization) to a three-way within-transformation of x ijt,k with respect to it, jt, and ij and weighting by λ ijt . Readers familiar with Larch, Wanner, Yotov, and Zylkin (2019) may find the latter presentation easier to digest.
replacing λ ijt with λ ijt and ϑ ijt with ϑ ijt := λ ijt / τ λ ijτ where needed. Then, the B N and D N are K-vectors with elements given by The replacement of N with N − 1 in B k N and D k N stems from a degrees-of-freedom correction. This correction is needed because creating plug-in values for the E S ij H ij x ij,k and E S ij S ij x ij,k objects that appear in Proposition 3 requires computing terms of the form E[y 2 ijt ] and E[y ijs y ijt ], as illustrated in Remark 1.

A.2.2 Details on large T bias expansion
We also want to explain the result in Remark 2 of the main text in more detail here by rewriting the bias terms B N and D N to illuminate the role of the time dimension. Using generic definitions for S ij , H ij , G ij , and x ij (e.g., S ij := ∂ ij /∂π ij , H ij := ∂ 2 ij /∂π ij ∂π ij , etc.), the formulas for the asymptotic distribution in Proposition 3 apply generally to M-estimators based on concave objective functions ij (β, α it , γ jt ). Unlike in the two-way FE-PPML case, these formulas do not reduce to zero when we further specialize them to the profiled Poisson pseudo-likelihood in (11), but we still find it instructive to do so (e.g., to discuss the large T limit from Remark 2). For that purpose, we define the T × T matrix M ij = I T − ϑ ij ι T . Furthermore, let Λ ij be the T × T diagonal matrix with diagonal elements λ ijt , and for i, j ∈ {1, . . . , N } define the T × T matrices The bias term B N = (B k N ) in Proposition 3 can then be expressed as and an analogous formula for D N follows by interchanging i and j appropriately. As long as there is only weak time dependence between observations the matrix objects R ij and Q i Λ ij M ij above are both of order 1 as T → ∞, such that both terms in brackets in (17) are likewise of order 1. This explains the result stated in Remark 2 of the main text. Given those differences-and before introducing any further complications-we briefly want to restate the main result in FW for the two-way pseudo-panel case. Outcomes Y ij , i, j = 1, . . . , N , conditional on all the strictly exogenous regressors X = (X ij ), fixed effect N -vectors α and γ, and common parameters β are assumed to be generated as

A.3 Proof of Proposition 3 Known result for two-way fixed effect panel models
where the conditional distribution f Y is known, up to the unknown parameters α i , γ j ∈ R and β ∈ R K . It is furthermore assumed that α i and γ j enter the distribution function only through the single index π ij = α i + γ j ; that is, the log-likelihood can be defined by The maximum likelihood estimator for β is given by Also, define the K-vector Ξ ij with components, k = 1, . . . , K, where here and in the following all expectations are conditional on regressors X = (X ij ), and on the parameters α, γ, β. For q ∈ {0, 1, 2}, the (within-transformation) differentia- Theorem 1. Assume that (i) Conditional on X, α 0 , γ 0 , β 0 the outcomes Y ij are distributed independently across (ii) The map (β, π) → ij (β, π) is four times continuously differentiable, almost surely.
All partial derivatives of ij (β, π) up to fourth order are bounded in absolute value by a function m(Y it , X it ) > 0, almost surely, uniformly over a convex compact set almost surely, for some ν > 0.
In addition, assume that the following limits exist where expectations are conditional on X, α, γ, β. Finally, assume that W > 0. Then, as

Remarks:
(a) This is just a reformulation of Theorem 4.1 in FW to the case of pseudo-panels, and the proof is provided in that paper. Since we consider only strictly exogenous regressors, all the analysis is conditional on X; and the bias term B simplifies here, since conditional on X (and the other parameters), we assume independence across both i and j. Thus, no Nickell-type bias (Nickell, 1981;Hahn and Kuersteiner, 2002) appears here, but we still have incidental parameter biases because the model is nonlinear (Neyman and Scott, 1948;Hahn and Newey, 2004). Thus, there is no need to adjust the terms that explicitly depend on N if some of the data are missing. So long as the missing data occur at random, applying the formulas as written should still generally be expected to deliver an asymptotically valid bias correction. A similar observation applies for missing values in the threeway model. That said, if the missing values occur in such a way that some of the α i 's or γ j 's appear only a small number of times in the data, they will tend to be estimated with a larger degree of estimation noise than the other fixed effects, which could affect the performance of bias corrections based on these formulas in practice.
(d) The above theorem assumes that the log-likelihood ij (β, α i + γ j ) for Y ij | X, α, γ, β is correctly specified. This is an unrealistic assumption for the PPML estimators in this paper, where we only want to assume that the score of the pseudolog-likelihood has zero mean at the true parameters, that is, This extension to "conditional moment models" is discussed in Remark 3 of FW. The statement of the theorem then needs to be changed as follows: where the definition of W is unchanged, but the expression of B = B 1 + B 2 , D = D 1 + D 2 and Ω now read These are the formulas that we have to use as a starting point for the bias results derived in this paper.
Our task in the following is to translate and generalize the conditions, statement, and proof of Theorem 1, as extended in (19) and (20), to the case of the three-way PPML estimator discussed in the main text.

Regularity conditions for Proposition 3
The following regularity conditions are required for the statement of Proposition 3 to hold.
(ii) The range of x ijt , α 0 it and γ 0 jt is uniformly bounded, and there exists ν > 0 such that Those assumptions are very similar to those in Theorem 1 above: Assumption A(i) is analogous to condition (i) in the theorem, except that we only impose the conditional mean of y ijt to be correctly specified, as already discussed in remark (c) above. Notice also that this assumption requires conditional independence across i and j, but we do not have to restrict the dependence of y ijt over t for our results.
We consider the Poisson log-likelihood in this paper, which after profiling out η ij gives the pseudo-log-likelihood function ij (β, α it , γ jt ) defined in equation (11). This loglikelihood is strictly concave and arbitrarily often differentiable in the parameters, so corresponding assumptions in Theorem 1 are automatically satisfied. Assumption A(ii) is therefore already sufficient for the corresponding assumptions (ii) and (iii) in Theorem 1.
Finally, Assumption A(iii) simply corresponds to the condition W > 0, which is just an appropriate non-collinearity condition on the regressors x ijt .

Translation to our main text notation
The main difference between Theorem 1 in the Appendix and Proposition 3 in the main text is that Theorem 1 only covers the case where π ij = α i + γ j is a scalar, while in our model in the main text α i , γ j and π ij = α i + γ j are all T -vectors. We can impose additional normalizations on those T -vectors, because the profile likelihood L(β, α, γ) in . . . , 1) is the T -vector of ones. 34 In the following we choose the normalizations ι T α i = 0 and ι T γ j = 0, implying ι T π ij = 0 for all i, j.
Accounting for this normalization we actually only have (T − 1) fixed effects α i and γ j for each i, j here. Theorem 1 is therfore directly applicable to the case T = 2, but for T > 2 we need to provide an appropriate extension.
The appropriate generalization of the operator D βα q i = D βγ q j in (18) to the case of vector-valued α i and γ j was described in Section 4.2 of Fernández-Val and Weidner (2018).
Remember the definition of ij (β, π ij ) = ij (β, α i , γ j ) and one achieves that the expected Hessian of L * (β, α, γ) = i,j * ij (β, α i , γ j ) is block-diagonal, in the sense that E ∂ βα i L * (β 0 , α 0 , γ 0 ) = 0 and E ∂ βγ j L * (β 0 , α 0 , γ 0 ) = 0 -the definition of α x i and γ x j by (16) in the main text exactly corresponds to those block-diagonality conditions. With those definitions, we then have that In particular, we find that our definitions of E D β ij (D β ij ) in the notation of Theorem 1 and equation (20). Thus, the asymptotic variance in (19) indeed corresponds to the asymptotic variance formula in Proposition 3.

Inverse expected incidental parameter Hessian
The asymptotic bias results that follow require that we first derive some key properties of N non-zero diagonal T × T blocks given by i∈N\{j}Hij . By contrast, the matrixH (αγ) consistents of blocks E −∂ α i γ j L(β 0 , φ 0 ) =H ij that are non-zero for i = j, because any two parameters α i and γ j jointly enter into T observations. The incidental parameter Hessian matrixH therefore has strong diagonal T × T blocks of order N , but also many off-diagonal elements of order one.
The pseudoinverse ofH crucially enters in the stochastic expansion for β below. It is therefore necessary to understand the asymptotic properties of this pseudoinverseH † . The following lemma shows thatH † has a structure analogous toH, namely, strong diagonal T × T blocks of order 1/N , and much smaller off-diagonal elements of order 1/N 2 . We The matrix D is block-diagonal, and its pseudoinverse D † is therefore also block-diagonal with T ×T blocks on its diagonal given by This result is crucial in order to derive the stochastic expansion of β. Indeed, as we will see below, once Lemma 1 is available, then the proof of Proposition 3 is a straightforward extension of the proof of Theorem 4.1 in FW. Lemma 1 is analogous to Lemma D.1 in FW, but our proof strategy for Lemma 1 is different here, because we need to account for the vector-valued nature of α i and γ j , which requires new arguments.
Proof of Lemma 1. # Expansion ofH † in powers of G: The matrixH is (minus) the expected Hessian of the profile log-likelihood L = i,j ij . Because in that objective function we have already profiled out the fixed effect parameters η ij we haveH ij ι T = 0 for all i, j, where ι T = (1, . . . , 1) is the T -vector of ones. This implies that The last equation describes 2N zero-eigenvectors ofH (i.e. the eigenvalue zero ofH has multiplicity at least 2N ). Because the original log-likelihood function of the Poisson model was strictly concave in the single index x ijt β + α it + γ jt + η ij it must be the case that any additional zero-eigenvalue ofH is due to linear transformations of the parameters α and γ that leave α it + γ jt unchanged for all i, j, t. 35 There is exactly one such transformation for every t ∈ {1, . . . , T }, namely the likelihood is invariant under α it → α it + c t and 35 Notice that any collinearity problem in the likelihood involving the regression parameters β is ruled out for sufficiently large sample sizes by our assumption that lim N →∞ W N > 0, which guarantees that the expected Hessian wrt β is positive definite asymptotically.
γ jt → γ jt − c t for any c t ∈ R. The expected HessianH therefore has additional zeroeigenvectors, which are given by the columns of the 2N where M ι T := I T − P ι T and P ι T := T −1 ι T ι T . In the last display we could have used the identity matrix I T instead of M ι T , but we want the columns of v to be orthogonal to the zero-eigenvectors already given by (22), which is achieved by using M ι T . As a consequence of this, we have rank(v) = T − 1; that is, since we already have (22) we only find T − 1 additional zero-eigenvectors here. Thus, the total number of zero eigenvalues ofH (i.e. the multiplicity of the eigenvalue zero) is equal to 2N + T − 1. It is easy to verify that indeed Equations (22) and (24) describe all the zero-eigenvectors ofH. The projector onto the null-space ofH is therefore given by where The Moore-Penrose pseudoinverse ofH therefore satisfies where the RHS is the projector orthogonal to the null-space ofH (i.e. the projector onto the span ofH). The definition of the Moore-Penrose pseudoinverse guarantees thatH † has the same zero-eigenvectors asH; that is, we also haveH † v = 0 andH † (I 2N ⊗ ι T ) = 0.
The last equation together with the symmetry ofH † implies that Next, similar to the above argument forH we have that the only zero-eigenvector of the T × T matrices j∈N\{i}Hij and i∈N\{j}Hij is given by ι T , and therefore we have which can equivalently be written as where P ι T := T −1 ι T ι T . Now, using (26) andH = D + G we havē Multiplying this with D † from the right, using (28) and (27), and bringingH † GD † to the By transposing this last equation we obtain and now plugging (29) into the RHS of (30) gives where in the second step we used that D † GP null = −P null , which follows from 0 =H P null = DP null +GP null by left-multiplication with D † and using that D † DP null = 0. This expansion argument forH † so far has followed the proof of Theorem 2 in Jochmans and Weidner (2019). We furthermore have here that D † (I 2N ⊗ P ι T ) = 0, becauseH ij ι T = 0, implying that D † P null = D † P v . The expansion in the last display therefore becomes with 2N T × T matrix v defined in (23). This expansion is the first key step in the proof of the lemma.
# Bound on the spectral norm ofH † : The term D † GH † GD † in the expansion (31) still containsH † itself. In order to bound contributions from this term we therefore need a preliminary bound on the spectral norm ofH † .
The objective function ij (β, π ij ) := ij (β, α it , γ jt ) in (11) is strictly convex in π ij , apart from the flat direction given by the invariance π ij → π ij + c ij ι T , c it ∈ R. This strict convexity together with our Assumption A(ii) that all regressors and parameters are uniformly bounded over i, j, N, T implies that for the T × T expected Hessian The last display states thatH ij is positive definite in all directions orthogonal to ι T .
Again, the lower bound b > 0 holds uniformly due to Assumption A(ii). The last display result can equivalently be written asH where ≥ means that the difference between the matrices is positive definite.
For all i, j ∈ N := {1, . . . , N } we then have where we also used (32). It is easy to show that for N > 2 the 2N × 2N matrix Q N has an eigenvalue zero with multiplicity one, an eigenvalue N − 2 with multiplicity N − 1, an eigenvalue N with multiplicity N − 1, and an eigenvalue 2(N − 1) with multiplicity one. Thus, the smallest non-zero eigenvalue of Q N is (N − 2). Also, the zero-eigenvector of Q N is given by v 0 := (ι N , −ι N ) , and therefore we have where P null is the projector onto the null-space ofH, as already defined above. From this and therefore for the spectral norm # Final bound on H † − D † max : Using (32) we find This together with our boundedness Assumption A(ii) implies that The definition of the 2N T × T matrix v in (23) implies that where we also used that M ι T max ≤ 1. In the following display, let e k = (0, . . . , 0, 1, 0, . . . , 0) be the k'th standard unit vector of dimension 2N T . We find that where the first line is just the definition of · max , the second step uses definition of the spectral norm H † , the third step is an obvious rewriting, the fourth step uses that the norm of 2N T -vector Ge k can at most be √ 2N T times the maximal absolute element of the vector, and the final step uses that T is fixed in our asymptotic and G max = O P (1) and also (33).
Next, for general 2N T ×2N T matrices A and B we have the bound (notice that · max is not a matrix norm) but because D is block-diagonal (with non-zero T × T blocks on the diagonal) we have for any 2N T × 2N T matrix A the much improved bound Applying those inequalities to the expansion ofH † − D † obtained from (31), and also using (34) and (35) and (36), we find that as N → ∞ (remember that T is fixed in our asymptotic.) This is what we wanted to show.

Proof of Proposition 3
The with P null defined in (25). Here, the penalty term φ P null φ guarantees strict concavity in (β, φ). However, in the following all derivatives of L(β, φ) are evaluated at the true parameters, and since we impose the normalization P null φ 0 = 0 the only derivative of 36 Since we have a concave objective function, we can apply Theorem B.3 in FW to obtain preliminary convergence results for both β and φ. That theorem guarantees that that the consistency condition on φ(β) in Assumption (iii) of Theorem B.1 in FW is satisfied under our Assumption A, and it also guarantees β − β 0 = O P (N −1/2 ), which is important to apply Corollary B.2 in FW to obtain the expansion result in our equation (37).

L(β, φ) where the penalty term gives a non-zero contribution is the incidental parameter
Hessian matrixH = E [−∂ φφ L(β 0 , φ 0 )] for which the penalty term provides exactly the correct regularization. However, instead of that regularization, we can equivalently use the pseudoinverse; namely we have for any c > 0. In all expressions below whereH † appears we could equivalently writē H † + 1 N P null , but the additional contributions from 1 N P null will always vanish because the gradient of L(β, φ) with respect to φ is orthogonal to P null .
By applying Theorem B.1 and its Corollary B.2 in FW we thus obtain where was already defined in Proposition 3, and we have U N := U (0) Here, * ij was defined in (21), all "bars" denote expectations conditional on X and φ, and all the partial derivatives are evaluated at the true parameters. We also defined L * (β, φ) := N i=1 j∈N\{i} * ij (β, α it , γ jt ). Remember that we use a different scaling of the (profile) likelihood function than FW; namely in (10)  The score term ∂ β * ij = x ij S ij has zero mean and finite variance and is independent across i and j, conditional on X and φ. By the central limit theorem we thus find Thus, the term U (0) N only contributes variance to the asymptotic distribution of β, but no asymptotic bias. By contrast, the term U (1) N only contributes bias to the asymptotic distribution of β, but no variance. Namely, one finds that with B N and D N as given in the proposition. The proof of (38) is exactly analogous to the corresponding discussion of those terms in the proof of Theorem 4.1 in FW, which we restated above as Theorem 1 (remember that for T = 2 our result here is indeed just a special case of Theorem 4.1 in FW.) Therefore, instead of repeating those derivations here, we provide in the following a slightly less rigorous, but much easier to follow, derivation of those bias terms.

Derivation of the asymptotic bias in Proposition 3
Remember that the main difference between Theorem 1 and our case here is that for us the incidental parameters α i and γ j are T -vectors, while in Theorem 1 the index π ij = α i + γ j is just a scalar. An easy way to generalize the asymptotic bias formulas in Theorem 1 and display (20) to vector-valued incidental parameters is to use a suitable parameterization for the incidental parameters α i and γ j . The formulas for B 1 and D 1 can most easily be generalized by parameterizing the incidental parameters as follows where α i and γ j are T − 1 vectors, and A i and C j are T × (T − 1) matrices that satisfy Let L(β, α, γ) = L(β, (A i α i ), (C j γ j )). This reparameterization guarantees that That is, the Hessian matrix with respect to the incidental parameters α i and γ j is normalized to be an identity matrix under that normalization. It can be shown that this implies that the incidental parameter biases B 1 and D 1 "decouple" across the T − 1 components of α i and γ j ; that is, the total contribution to the incidental parameter bias of β just becomes a sum over T − 1 contributions of the form B 1 and D 1 in (20). Thus, for k ∈ {1, . . . , K} we have where in the second step we used the fact that j E ∂α2 i,q ij = 1 according to (41) here are conditional on X (in the main text we always make that conditioning explicit), andH ij and x ij,k are non-random conditional on X; that is, we can also write this last expression as Analogously we find Next, to generalize the incidental parameter biases B 2 and D 2 in (20) to vector-values α i and γ j we again make a transformation (39), but this time we choose Notice that for a correctly specified likelihood we have the Bartlett identitiesH ij = E S ij S ij x ij , implying that (40) and (42) are identical for correctly specified likelihoods.
In general, however, the transformation now is different. Instead of normalizing the Hessian matrices to be identities, as in (41), the new transformation defined by (42) guarantees that Again, it can be shown that with this normalization the incidental parameter bias contributions B 2 and D 2 "decouple"; that is, each component of α where in the second step we used that i,q ij 2 = 1 according to (43), in the third step we rewrote the sum over q ∈ {1, . . . , T − 1} as a trace over the (T − 1) × (T − 1) matrix of third-order partial derivatives E(D β kαiα i ij ), in the fourth step we used that α i = A i α i , and in the final step we used the cyclicity of the trace and (42) and the definitions ofḠ ij , x ij,k , and the tensor-vector productḠ ij x ij,k (which, recall, is a T × T matrix).
Analogously we find We have thus translated all the formulas in Theorem 1 and in display (20) to the case of vector-valued α i and γ j to find exactly the expression for the asymptotic biases B k N = B 1,k + B 2,k and D k N = D 1,k + D 2,k in Proposition 3.

Rewriting the bias expressions as in Appendix A.2.2
In the following, we unpack the formulas provided in Appendix A.2.2 in order to provide additional detail on why the leading bias term is of order 1/(N T ) as both N and T → ∞ simultaneously. Remember that E(y ijt |x ijt , α it , γ ij ) = λ ijt := exp(x ijt β + α it + γ ij ) and ϑ ijt := λ ijt τ λ ijτ , and denote the corresponding T -vectors by y ij , λ ij and ϑ ij . It is convenient to define the T × T matrices which is the unique idempotent T ×T matrix (i.e. M ij M ij = M ij ) that satisfies rank(M ij ) = T − 1, M ij λ ij = 0, and ι T M ij = 0. Notice also that λ ij = Λ ij ι T , and therefore where t, s, r ∈ {1, . . . , T }.
This shows that W N has an additional sum over t, so W N increases linearly in T , and , which is the diagonal T × T matrix with diagonal entries λ ijt x * ijt,k . The first-order conditions of the optimization problem that defines which can also be written as These FOC's are only important to simplify the term B 2,k in what follows. We have where, in the second-to-last step, we used the definition of M ij , (44), that ι T D ij,k ι T = λ ij x * ij,k , and that D ij,k ι T = Λ ij x * ij,k ; and in the last step, we used that Λ ij x * ij,k = Λ ij M ij x ij,k and λ ij x * ij,k = 0. We also used the definition of Q i given in Appendix A.2.2. We then have for B k N = B 1,k + B 2,k that where we have now also used the definition of R ij from Appendix A.2.2 in order to simplify B 1,k . Under appropriate regularity conditions, the T × T matrices Q i and R ij each maintain diagonal elements of order one and off-diagonal elements of order 1/T 2 through their dependence on Var(y ij ). Therefore, all the numerators and denominators in the last expression for B k N remain of order one as T → ∞, such that B k N = O(1) as T → ∞, with an analogous result also following for D k N . Recalling that W N increases linearly with T , we thus conclude that the bias term is of order 1/(N T ) as both N and T grow large.

Comment on Proposition 1
We note that the consistency result from Proposition 1 also follows from the above proof of Proposition 3.

Remark 3. If the asymptotic bias in β is characterized by Proposition 3, then β is
consistently estimated as N → ∞.
As we have noted in the text, for this consistency result to hold, we need for the score of the profile log-likelihood ij (β, α it , γ jt ) from (11) to be unbiased when evaluated at the true parameters (β 0 , α 0 , γ 0 ). In particular, we need for there to be no incidental parameter bias term of order 1/T associated with the pair fixed effect η ij . As the following proof and subsequent discussion clarify, the FE-PPML estimator is quite special in this regard.

A.4 Proof of Proposition 2
To prove Proposition 2, it will first be useful to prove the following lemma: Lemma 2. Assume a "one way" panel data model with λ it = exp(x it β + α i ) and consider the class of FE-PML panel estimators with FOC's given by β: . . . , N , t = 1, ..., T, and g( λ it  To proceed, let the true data generating process be given by where λ it is the true conditional mean and with z it a randomly-generated variable distributed N (0, 1). ω it is therefore a heteroskedastic multiplicative disturbance that follows a log-normal distribution with E[ω it ] = 1 and Var(ω it ) = λ ρ it . The conditional mean of y it is in turn given by E[y it |x, α] = λ it and the conditional variance is given by Var(y it |x, α) = Var(y it |λ it ) = λ 2 it Var(ω it ) = λ ρ+2 it . Our focus is the exponent ρ, which governs the nature of the heteroskedasticity and can be any real number. With this in mind, it is useful to document the following results, Put another way, the expected value of the change in ω it with respect to ρ must always be zero because E[ω it ] = 1 regardless of ρ. Similarly, the expected change in the second moment of ω it must be λ ρ it ln λ it because this gives the change in the variance of ω it . 37 To facilitate the rest of the proof, we invoke the following conceit: the random disturbance term z it , once drawn from N (0, 1), is known and fixed, such that each ω it may be treated as a known transformation of the underlying value for z it given by (45). Among other things, this means we can always treat the partial derivatives ∂ω it ∂ρ and ∂y it ∂ρ = λ it ∂ω it ∂ρ as well-defined; similarly, we can treat the estimated parameters β and α i as deterministic functions of the variance parameter ρ with well-defined total derivatives d β dρ and d α i dρ . That is, for a given draw of z it 's, we can perturb how the corresponding ω it 's are generated and consider comparative statics for how estimates are affected. If β is consistent regardless of the variance assumption used to generate ω it , then small changes in ρ should have no effect on β asymptotically. Thus, our goal in the following is to determine if there are any estimators in this class other than FE-PPML under which lim N →∞ d β dρ = 0 in this experiment.
The next step is to totally differentiate the FOC's for β and α i with respect to a change in ρ. Let L denote the pseudo-likelihood function to be maximized. 38 For notational 37 Note here that The implied pseudo-likelihood function is given here by L : convenience, we can express the scores for β and α i as L β and L α i , such that their FOCs can respectively be written as L β = 0 and L α i = 0. Differentiating the FOC for β, we where L ββ is the matrix obtained from partially differentiating the score for β with respect to β, L βρ (a vector) is the partial derivative of L β with respect to ρ, and L βα i (also a vector) is its partial derivative with respect to α i . Applying a similar set of operations to the FOC for α i then gives where L α i α i and L α i ρ are scalars that respectively contain the partial derivatives of L α i with respect to α i and ρ. Plugging (49) into (48), we have where I is an identity matrix whose dimensions equal the size of β.
Let P henceforth denote the combined matrix object I − L −1 It is straightforward to show that that first term in (51), −P −1 L −1 ββ L βρ , converges in probability to a zero vector when N → ∞. To see this, note first that P and L ββ must be non-singular and finite for β to be at a maximum point of L and for d β dρ to exist. Furthermore, lim x it ] −1 must also be non-singular and finite. Slutsky's theorem then implies lim N →∞ −P −1 L −1 ββ L βρ → p 0 if lim N →∞ N −1 T −1 L βρ → p 0. Examining the vector L βρ more closely, we have lim N →∞ N −1 T −1 L βρ → p 0 then follows via standard arguments because E ∂ω it ∂ρ = 0 (by (46)). We may therefore focus our attention on the second term on the RHS in (51), Noting that L −1 α i α i must be < 0, in this case we consider the conditions under which lim N →∞ N −1 T −1 i L −1 α i α i L βα i L α i ρ similarly converges in probability to zero. The summation in this latter term may be expressed as Re-arranging this expression, we have that Focusing first on the second of the two summation terms in (52), we again apply This follows for the same reason lim N →∞ N −1 T −1 L βρ → p 0 above. The first summation term in (52) obviously → p 0 as well if the estimator is FE-PPML, in which case g ( λ it ) = 0.
To complete the proof, we just need to show that this term does not reduce to 0 if g ( λ it ) = 0. A final step gives us To elaborate, the terms where s = t vanish as N → ∞ because disturbances are assumed 39 The remaining details follow from (47). 40 We have now shown lim N →∞ d β dρ = 0 if and only if g ( λ it ) = 0. In other words, the estimator must be FE-PPML, which assumes g( λ it ) is a constant. For other FE-PML estimators, even if β is consistent for a particular ρ, it cannot be consistent for 39 Note that under FE-PPML, where g ( λ it ) = 0, the estimator is consistent even if disturbances are correlated. This is yet another reason why FE-PPML is an especially robust estimator.
all ρ because β does not converge to the same value for N → ∞ when we vary ρ. As we discuss below, this is what happens for FE-Gamma PML (where g( λ it ) = λ −1 it ) and some other similar estimators.
To be clear, the robustness of the FE-PPML estimator to misspecification is a known result established by Wooldridge (1999). However, to our knowledge, it has not previously been shown that FE-PPML is the only estimator in the class we consider that has this property. 41 At the same time, it is worth clarifying that FE-PPML is not the only estimator that is capable of producing consistent estimates of three-way gravity models.
Rather, it is the only estimator in the class we consider that only requires correct specification of the conditional mean and for the covariates to be conditionally exogenous in order to be consistent. The following discussion describes some known cases in which other estimators will be consistent.

A.5 Results for Other Three-way Estimators
Depending on the distribution of the data, there may be some other consistent estimator available aside from FE-PPML. In particular, if g( λ ijt ) is of the form g( λ ijt ) = λ q ijt , with q an arbitrary real number, the FOC for η ij has a solution of the form η ij = [ T t=1 µ q+1 ijt ] −1 T t=1 y ijt µ q ijt . It is therefore possible to "profile out" η ij from the FOC for β, just as in the FE-PPML case. As such, it is possible for the estimator to be consistently estimated, but only if the conditional variance is correctly specified (more precisely, we must have Var(y|x, α, γ, η) ∝ λ 1−q it , the equivalent of ρ = −1 − q.) In this case, the estimator is not only consistent, but should be more efficient as well.
An interesting example to consider in the gravity context is the Gamma PML (GPML) estimator, which imposes g( λ ijt ) = λ −1 ijt . Generally speaking, GPML is considered the ensuring that β does not depend on ρ for the large N, large T case. This follows because Alternatively, it is possible to extend the above result to an even more general class of estimators by considering estimators that depend on g( α i ) rather than g( λ it ). The same type of proof may be used to show that β depends on the variance assumption if g ( α i ) = 0. Furthermore, the estimator can be shown to be consistent if g ( α i ) = 0. primary alternative to PPML and OLS as an estimator for use with gravity equations (see Head and Mayer, 2014;Bosquet and Boulhol, 2015.) However, to our knowledge, no references to date on gravity estimation make it clear that, unlike in a two-way setting, the three-way FE-GPML estimator is only consistent when the conditional variance is correctly specified. 42 Thus, it is possible that researchers could mistakenly infer that the appeal of FE-GPML as an alternative to FE-PPML in the two-way gravity setting carries over to the three-way setting. 43 This is especially a concern now that recent computational advances have made estimation of FE-GLM models significantly more feasible.
To illuminate the unique IPP-robustness properties of FE-PPML in the three-way context, Fig. 3  As we would expect based on Proposition 2, FE-PPML is relatively unbiased across all four different assumptions considered for the distribution of the error term. The general inconsistency of the three-way OLS estimator-which is only unbiased for DGP III where the error term is log-homoskedastic-is also as expected. However, the reasons behind the bias in the OLS estimate are well-documented (see Santos Silva and Tenreyro, 2006) and 42 As discussed in Greene (2004), the fixed effects Gamma model is generally known not to suffer from an incidental parameter problem, similar to FE-Poisson. However, the result stated in Greene (2004) is for the Gamma MLE estimator, which restricts the conditional variance to be equal to the square of the conditional mean. The FE-Gamma PML estimator is consistent under the slightly more general assumption that the conditional variance is proportional to the square of the conditional mean.
43 For example, Head and Mayer (2014) where the distribution of ω ijt depends on the DGP and the true value of β is 1 (indicated by the vertical dotted lines). The size of the i and j dimensions is given by N = 50 and the t dimension has size T = 5. See text for further details.
do not have to do with the incidental parameters included in the model. The three-way FE-GPML estimator is also consistent under DGP III because it assumes the error term has a variance equal to the square of the conditional mean. Both OLS and GPML are also more efficient than PPML in this case. However, as the other three panels show, when this variance assumption is relaxed, three-way FE-GPML clearly suffers from an IPP, exhibiting an average bias equal to roughly half that of OLS in all three cases.
We have also performed some simulations with three-way FE-Gaussian PML, which imposes g( λ ijt ) = λ ijt . We do not show results for this other estimator because the HDFE-IRLS algorithm we used to produce the FE-PPML and FE-Gamma PML estimates frequently did not converge for the FE-Gaussian PML estimator. However, the results we did obtain were in line with our results for FE-GPML and with our discussion of Proposition 2 above: the FE-Gaussian PML estimates were consistent when the DGP for ω ijt was itself Gaussian (as in DGP I), but were inconsistent otherwise.

A.6 Allowing for Conditional Dependence across Pairs
The bias expansion in Proposition 3 allows for errors to be clustered within each pair (i, j), but assumes conditional independence of y ij and y i j for all (i, j) = (i , j ). This assumption is consistent with the standard practice in the literature of assuming that errors are clustered within pairs when computing standard errors (see Yotov, Piermartini, Monteiro, and Larch, 2016.) However, it is important to clarify that the results in Proposition 3 may change when other assumptions are used. For example, if we want to allow y ij and y ji (i.e., both directions of trade) to be correlated, then the bias results would not actually change, but we would need to modify the definition of Ω N to allow for the additional clustering; namely, we would need Notice, however, this is just one possibility. Similar adjustments could be made to allow for clustering by exporter or importer, for example, or even for multi-way clustering à la Cameron, Gelbach, and Miller (2011). In these cases, the bias would also need to be modified; specifically, one would have to modify the portions of D k N that B k N that depend on the variance of S ij to allow for correlations across i and/or j.

A.7 Showing Bias in the Cluster-robust Sandwich Estimator
For convenience, let x ij := (x ij , d ij ) be the matrix of covariates associated with pair ij, inclusive of the it-and jt-specific dummy variables needed to estimate α i and γ j .
Similarly, let b := (β , φ ) be the vector of coefficients to be estimated and let b be the vector of coefficient estimates. Note that we can write a first-order approximation for S ij which is consistent with the approximation provided in (13) Now we turn our attention to the outer product S ij S ij : Because we assume we are in the special case where FE-PPML is correctly specified, we We also have that E[S ij S ij ] = κH ij . Therefore, after applying expectations where appropriate, we have that which can be seen as extending Kauermann and Carroll (2001) which simplifies to the expression shown in (13).
Results for the two-way model. Though we have focused on the downward bias of the sandwich estimator for the three-way gravity model, it is also known to be biased for the standard two-way gravity model without pair fixed effects (Egger and Staub, 2015;Jochmans, 2016;Pfaffermayr, 2019). As it turns out, the analytics for the two-way and three-way models are very similar here, and we can easily adapt our results to the simpler two-way setting. The main change we would need to make is to replace H ij everywhere it appears with Λ ij , including in the definitions of x ij , W N , and W (φ) N . The rest of the derivations then follow in the same manner as for the three-way model. The resulting correction has been included in our ppml_fe_bias Stata package for users working with two-way gravity models. A version of this correction has been studied alongside other methods in a recent paper by Pfaffermayr (2021).

A.8 More Discussion of IPPs in FE-PPML Models
In this part of the Appendix, we wish to give a more expansive discussion of when IPPs may arise in case of an FE-PPML estimator. We have already reviewed the two-way and three-way gravity models in the main text; thus, here, we will focus first on the classic "one-way" FE panel setting where no IPP occurs. Doing so will allow us to draw a contrast with other, more complex models where IPPs could be a problem. As part of this discussion, we give some examples of panel models where FE-PPML is actually inconsistent, unlike the models covered in the main text.

The Classic (One-way) Setting
Consider a static panel data model with individuals i = 1, . . . , N , time periods t = 1, . . . , T , outcomes y it , and strictly exogenous regressors x it satisfying E(y it |x it , α i ) = λ it := exp(x it β + α i ).
The FE-PPML estimator maximizes i,t (y it log λ it + λ it ) over β and α. The corresponding FOC's may be written as where λ it := exp(x it β + α i ). Solving for α i and plugging the expression back into the FOC for β we find which, as long as (54) V is the asymptotic variance. The FE-PPML estimator therefore does not suffer from an IPP: even though α i is an inconsistent estimate of α i , the FE-PPML score for β has zero mean when evaluated at the true parameter β 0 , and β therefore converges in probability to β 0 without any asymptotic bias. This is a well known result that can also be obtained in the Poisson-MLE case by conditioning on t y it ; see Cameron and Trivedi (2015). For our purposes, it gives us a benchmark against which other, more complex models may be compared.

Examples where FE-PPML is Inconsistent
In the above "classic" setting, every observation is affected by exactly one fixed effect. In current applied work, it is common to specify models with what we will call "overlapping" fixed effects, where each observation may be affected by more than one fixed effects. Some standard examples include the gravity model from international trade (as is our focus in the main text) as well as other settings where researchers may wish to control for multiple sources of heterogeneity (e.g., firm and employee, teacher and student). Thus, it is important to clarify that the presence of overlapping fixed effects can easily lead to an IPP, even when the underlying estimator is Poisson or PPML. We give the following simple example: Example 1. Consider a model with three time periods T = 3 and two fixed effects α i and γ i for each individual: t = 1 : E(y i1 |x i1 , α i , γ i ) = λ i1 := exp(x i1 β + α i ), t = 2 : E(y i2 |x i2 , α i , γ i ) = λ i2 := exp(x i2 β + α i + γ i ), t = 3 : E(y i3 |x i3 , α i , γ i ) = λ i3 := exp(x i3 β + γ i ).
Example 2. In addition to i = 1, . . . , N and t = 1, . . . , T we re-introduce another panel dimension j = 1, . . . , J and consider E(y ijt |x ijt , α it , γ ij ) = λ ijt := exp(x ijt β + α it + γ ij ), where α it is now indexed by both i and t and our second fixed effect is similarly indexed by i and j. The FE-PPML estimator in this case maximizes i,j,t (y ijt log λ ijt + λ ijt ) over β, α and γ. We consider N → ∞ with both J and T fixed, e.g. J = T = 2.
In these examples, because the fixed effects are overlapping, we have that α enters into the FOC for γ, and vice versa. Therefore, when for a given value β we want to solve the FOC for α and γ we have to solve a system of equations, and the solutions become much more complicated functions of the outcome variable than in the one-way model. While having this type of co-dependence between the FOCs for the various fixed effects need not necessarily lead to an IPP, as we discuss next, it does create one in models where more than one fixed effect dimension grows at the same rate as the panel size, as is the case with α and γ in both of these examples.
In gravity settings, by contrast, the crucial distinction is that the dimensions of each fixed effect grow only with the square root of the sample size as the number of countries increases. As mentioned in the text, this ensures that the IPPs associated with each fixed effect "decouple" from one another, in the sense described by Fernández-Val and Weidner (2016). However, in the inconsistency examples just given, the estimation noise in the estimated α parameters will always depend on the estimation noise in the estimated γ parameters, and vice versa, even as N → ∞. Thus, decoupling cannot occur. For illustration, we have used the model from Example 1 as our example of inconsistency in our earlier Figure 1. We have also performed simulations for the model in Example 2 and found similar results.

A Suggested Heuristic for IPPs when the Estimator is FE-PPML
As has been made clear from our discussion and results, the usual generic bias heuristic shown in (6) from Fernández-Val and Weidner (2018) is generally not appropriate for FE-PPML. Indeed, because FE-PPML can be asymptotically unbiased in special cases, it may not be productive to try to boil down how their biases are likely to behave to a single formula. Instead, we propose the following approach: 1. If there are no fixed effect dimensions that grow proportionately with the sample, we expect FE-PPML to be unbiased asymptotically.
2. Otherwise, the likely order of the bias can be derived as follows: (a) Construct the equivalent multinomial model by profiling out the largest fixed effect dimension.
(b) Infer what the order of the asymptotic bias would be for the equivalent multinomial model by calculating p/n (i.e., as in (6)).
For example, in the three-way gravity model, the number of observations is on the order of N 2 T and the number of parameters is on the order of N 2 pair fixed effects plus 2N T exporter-time and importer-time fixed effects. However, after profiling out the pair fixed effects, we only have the 2N T exporter-time and importer-time fixed effects. Thus, we take p/n to be proportional to 1/N as N → ∞, implying an asymptotic bias of order 1/N . For further illustration, consider Examples 1 and 2 above. In these cases, even after profiling out α, one still finds that p is proportional to n as n → ∞, implying inconsistency.
As a contrast, consider the two-way gravity model. As we have just discussed, all of the fixed effects grow only with the square root of n in that case, implying it is asymptotically unbiased.
"Four-way" gravity models. As a more complicated example, consider the following "four-way" gravity model: This type of model may be used for trade data that is observed separately for different industries or commodities, which here are indexed by l = 1...L. α ilt , α jlt , and η ijl respectively are industry-level analogs of α it , α jt , and η ij from the three-way model. Thus, they allow multilateral resistance effects and cross-sectional heterogeneity in trade costs to vary by industry. The fourth fixed effect, ζ ijt , captures general changes in trade across all industries for a given pair. x ijlt is assumed to be an industry-specific policy variable of interest (e.g., tariffs). We assume that the error term ω ijlt exhibits correlation over time within the same exporter-importer-industry triplet but is independent across trade partners and across industries within the same exporter-importer pair.
The four-way model does not conform to the framework from our main analysis, but we can nonetheless use the above heuristic to infer the order of the bias and propose a correction. After profiling out the order-N 2 L exporter-importer-industry fixed effects, the model has on the order of 2N LT exporter-industry-time and importer-industry-time fixed effects and N 2 T exporter-importer-time fixed effects. The number of observations is on the order of N 2 LT . Following our discussion from Section 2.2, the bias is thus expected to be of the form Where b (α) , b (γ) , and b (ζ) are unknown constants. Two observations stand out. First, for consistency, we require both N and L to be large. For data sets where the number of industries is relatively small, the 1/L bias term associated with the ζ ijt fixed effect is likely to induce substantial bias. We will thus consider the implications for asymptotic bias as N and L grow large at the same rate. Second, the order of the standard error as N and L both → ∞ while T is fixed is 1/(N √ L). The ratio of the asymptotic bias to the standard error as N and L both → ∞ is expected to be of the form where c > 0 is a positive constant. This ratio diverges to infinity as N, L → ∞, implying that the standard error shrinks to zero faster than the bias does. This is a more severe form of the asymptotic bias problem than the one we found for the three-way model, where the bias and standard error both decreased at the same rate asymptotically. It is therefore advised that a jackknife correction should be used to reduce the bias. This can be done by holding out industries to inflate the 1/L bias while simultaneuously holding out countries to inflate the 1/N bias. For example, a jackknife sample with half the number of countries and half the number of industries will have an asymptotic bias of order 2/N + 2/L. A closely related model that we can also discuss is the case when the fourth fixed effect, ζ ijt , is not included in the model shown in (57). An example of when this might be desirable is when the policy variable of interest does not vary across industries (e.g., a trade agreement dummy). In this case, one can infer from the formula in (58) that we would expect an asymptotic bias of order 1/N , like what we found in the case of the threeway model. Furthermore, if only the number of countries is allowed to become large, while both L and T are held fixed, the standard error is also of order 1/N , and the behavior of the asymptotic bias is exactly like what we found for the three-way case. However, interestingly, if both N and L → ∞, we are back in the case where the bias-to-standard error ratio heads to infinity. Thus, the severity of the problem will depend importantly on the number of industries in both of these models.
Finally, note that if T is no longer fixed, so that N , L, and T all → ∞ jointly, we expect the special properties of PPML to cause the IPP bias to become more benign.
We know from Fernández-Val and Weidner (2016)'s earlier results for the two-way model and from our own results for the three-way model that the decoupling of the IPPs as all dimensions of the panel become large at the same rate eliminates the asymptotic bias in these settings. Future work can investigate to what extent this holds true for four-way panel models.