Revisiting the Identification of Wintertime Atmospheric Circulation Regimes in the Euro-Atlantic Sector

Atmospheric circulation is often clustered in so-called circulation regimes, which are persistent and recurrent patterns. For the Euro-Atlantic sector in winter, most studies identify four regimes: the Atlantic Ridge, the Scandinavian Blocking and the two phases of the North Atlantic Oscillation. These results are obtained by applying k-means clustering to the first several empirical orthogonal functions (EOFs) of geopotential height data. Studying the observed circulation in reanalysis data, it is found that when the full field data is used for the k-means cluster analysis instead of the EOFs, the optimal number of clusters is no longer four but six. The two extra regimes that are found are the opposites of the Atlantic Ridge and Scandinavian Blocking, meaning they have a low-pressure area roughly where the original regimes have a high-pressure area. This introduces an appealing symmetry in the clustering result. Incorporating a weak persistence constraint in the clustering procedure is found to lead to a longer duration of regimes, extending beyond the synoptic timescale, without changing their occurrence rates. This is in contrast to the commonly-used application of a time-filter to the data before the clustering is executed, which, while increasing the persistence, changes the occurrence rates of the regimes. We conclude that applying a persistence constraint within the clustering procedure is a superior way of stabilizing the clustering results than low-pass filtering the data.


Introduction
The study of atmospheric circulation, or weather, regimes has a long history.Starting from the 1940s, when the German weather service developed a set of weather types classifying the daily synoptic circulation (Deutscher Wetterdienst, 2019), the concept of weather regimes as an expression of the low-frequency variability of the atmospheric circulation has been a topic of research.The underlying concept is that the weather itself is a stochastic process, whose statistics are strongly conditioned on the weather regime.From around 1990 onwards different clustering methods have been used to identify these persistent and recurrent circulation patterns (e.g.Mo and Ghil, 1987;Vautard, 1990;Molteni et al., 1990), primarily focussing on the wintertime Northern Hemisphere.Later specific sectors of the Northern Hemisphere, primarily the Euro-Atlantic sector (e.g.Michelangeli et al., 1995;Kageyama et al., 1999), as well as the Southern Hemisphere (e.g.Mo, 2000), have been studied, along with the relation of circulation regimes with e.g.climate change (Corti et al., 1999) and regional weather (Cassou et al., 2005).
Initially different clustering methods, such as hierarchical clustering (e.g.Cheng and Wallace, 1993), using the analysis of the probability density function (e.g.Kimoto and Ghil, 1993), or kmeans clustering (e.g.Michelangeli et al., 1995), were used to identify the atmospheric circulation regimes.The last method, k-means clustering, has subsequently become the most used approach for identifying the regimes in atmospheric data (Hannachi et al., 2017).The number of clusters k has to be set a priori, making finding the optimal number of regimes part of the problem.Commonlyused methods to do this are the verification of significance by using synthetic datasets (e.g Dawson et al., 2012) or looking at the reproducibility and consistency when the algorithm is run multiple times (e.g.Michelangeli et al., 1995).Nearly always the data is first projected onto the first several Empirical Orthogonal Functions (EOFs), after which the clustering algorithm is applied to the time series of these EOFs (e.g.Vautard, 1990;Ferranti et al., 2015).In addition a number of studies apply a low-pass time filter to the data to remove the high frequency, noisy oscillations and focus on the low-frequency behaviour (e.g.Straus et al., 2007;Grams et al., 2017).This method enforces a higher persistence of the regimes compared to standard k-means clustering which is independent of the time-ordering of the data.
Since clustering methods represent a projection of the data to a lower dimensional state space, applying clustering to the already filtered data of EOFs means a projection of the data is done twice.Thirty years ago, this approach was necessary because computational limitations did not allow using the full field dataset.However this is no longer a constraint.Nevertheless most studies continue to follow the original approach and use EOFs.As EOFs give the modes associated with the most variability, while clusters give the recurrent patterns, the means of dimension reduction is quite different.The question thus arises of what the effect of this double filtering is on the resulting atmospheric circulation regimes.Similarly, applying a low-pass filter to remove the high-frequency behaviour before the cluster analysis also means the data is filtered twice.This is likely to not only affect the persistence, but also the occurrence of the found regimes and possibly the clusters themselves, thus also raising the question of how strong this effect is.
In this paper we compare the results of k-means clustering using EOF data with the results found for the full field data, for the case of the wintertime Euro-Atlantic sector.We pay special attention to the optimal number of clusters k, as it has a large influence on the regimes that are found.Furthermore, we study the effect of time-filtering on the found regimes, by comparing it with an adapted k-means clustering algorithm that incorporates a constraint on the regime duration to enforce persistence of the regimes.This novel algorithm does not change the data itself, but only the method to identify the regimes.Both comparisons give insight into the effect of filtering the data before applying the k-means clustering algorithm.We start by explaining the methods used, followed by a discussion of the differences and similarities of the results.In the end the findings are summarized and discussed.

Methods
We use 500 hPa geopotential height data from ERA interim on a 2.5 • by 2.5 • longitude-latitude grid for a domain covering the Euro-Atlantic sector, 20 • to 80 • N and 90 • W to 30 • E (Dee et al., 2011).Daily data (00:00 UTC) is considered for the months December through March using 39 years of data (1979 -2018).Deviations from a fixed background state are used throughout this period.The main argument for considering a fixed background state instead of a seasonally varying one is that when applying cluster analysis the data used is preferably as complete as possible to avoid any type of bias.This means that few to no assumptions, such as a seasonal cycle, are made in preparing the data to retain the information present in the data.Or, phrased differently, how can you compare two days if they are deviations with respect to a different background state?The risk of this approach is that seasonality affects the regimes that are found and introduces a bias in the occurrence and persistence.Thus there is a trade-off to be made between obtaining as large a sample size as possible whilst minimizing such effects.The rationale for the choice made is discussed in the supplementary information.

Clustering Methods
The method used for the identification of circulation regimes in this study is k-means clustering using the standard Euclidian distance (L 2 -norm) (Jain, 2010).This method is applied to both the full field dataset as well as the time series of the first 5, 10, 15 and 20 EOFs.Furthermore, the method is also applied to the full field data after applying a 5-and a 10-day low-pass filter to remove high-frequency oscillations.The results for this time-filtered dataset are compared with those obtained by applying an adapted k-means algorithm to the unfiltered data that incorporates a persistence constraint in the clustering procedure itself.This persistent clustering method is described in what follows.
Given a dataset {x t } t≤T ∈ R n , with t time and T the length of the dataset, the aim of any clustering method is to find a set of k cluster centres that accurately describe the dataset based on some measure.Let Θ = (θ 1 , ..., θ k ) be the set of parameters describing the k cluster centres.Here Θ represents the different circulation regimes for 500 hPa geopotential height anomaly data {x t } t≤T .To assess how well the cluster centres represent the data, a model distance functional g(x t , θ i ), giving the distance between a cluster centre and a data point, is required.We use the standard L 2 -distance weighted by the cosine of latitude (Chung and Nigam, 1999).In addition we consider the affiliation vector Γ = (γ 1 (t), ..., γ k (t)), which indicates the weight of a certain cluster at some point in time.
In practice γ i (t) is nearly always either zero or one, indicating to which cluster that point belongs.This is because a linear optimization problem always has an optimal solution on the boundary of the admissable set (Cottle and Thapa, 2017).For this reason the affiliation vector is in general not considered when k-means clustering is discussed.Here we do consider this vector because it allows for the incorporation of persistence in the clustering procedure.
The task of identifying the atmospheric circulation regimes best representing the data means one has to find the optimal parameters for the cluster centres Θ and the affiliations of the data Γ.This is done by minimizing the averaged clustering functional (Franzke et al., 2009) subject to This is what the k-means procedure is doing implicitly, where γ i (t) is assumed to be zero or one.
Finding the minimum of this functional minimizes the within-cluster variance as L is a measure of the distance between the cluster centres and the data points assigned to it.Because the within-cluster variance is minimized simultaneously for all clusters, the distance between data points assigned to different clusters becomes large.In other words; the between-cluster variance is maximized.This clustering functional does not yet incorporate any persistence; an arbitrary reshuffling of the data leads to exactly the same result.To include persistence in the clustering method we add a constraint on Γ that limits the number of transitions between regimes that is allowed (de Wiljes et al., 2014).This constraint on the number of transitions between regimes, or switches, that are permitted throughout the whole time-series is: for some constant C. The value of C gives twice the number of switches allowed (as every switch is counted twice), so e.g. an average cluster length of five days corresponds to C = 2×#days/5 ≈ 1900.
In Table 1 the average regime duration corresponding to several values of C are given.The rationale behind this constraint is that in a chaotic atmospheric circulation not every data point can be straightforwardly assigned to a cluster.Some data points can be in-between clusters or outliers, e.g.transitioning between clusters or extreme events.K-means clustering assigns these points to the nearest cluster (by distance), while it can be more sensible to assign it to the same cluster as its neighbours if the distance to that cluster is also quite small.This is exactly what the constraint in Equation 3 is doing for reasonable values of C. The minimization of the clustering functional L taking into account the persistence constraint is done in two steps which are iterated until convergence: The first part is done by linear programming using the Gurobi package for python (Gurobi Optimization LLC, 2019).The second part is done by k-means clustering.The computation is terminated when the difference between consecutive L becomes smaller than a set tolerance.
Because k-means clustering can only identify local minima we run the algorithm 500 times with different random initial conditions for the standard k-means algorithm.The initial condition at every location in space is drawn independently from a normal distribution around zero with the same standard deviation as the data.Note that this means there is no correlation in space to not make any assumptions on the spatial patterns of the regimes.The tolerance used depends on k and is 0.001/k 2 for the full field data and 0.01/k 2 for the EOF data.The algorithm including persistence is run 100 times with different initial conditions; the reduced number of runs is due to increased computation time by the incorporation of linear programming which is similar for the EOF and full field data.The final result is chosen to be the one with the smallest clustering functional L. This method in general works reasonably well, but even for the Lorenz 63 system (Lorenz, 1963) the "correct" clusters cannot be found for every realisation (simulation with different initial condition), as can be seen in the supplementary information.The data close to the cluster centres is always assigned correctly, but there is a significant uncertainty in assigning the data further away from the cluster centres.Thus it is important to be careful when applying k-means clustering and not blindly trust the result, especially as the method assigns every data point to a cluster even if it actually is in-between different clusters.

Information Criteria
For k-means clustering the number of clusters k has to be set a priori and the question is how to determine the best value for k.The main methods used to this end in the identification of atmospheric circulation regimes are tests by synthetic datasets (e.g.Straus et al., 2007), using a classifiability index (e.g.Plaut and Simonnet, 2001) or looking at the similarity of runs with different initial conditions (e.g.Jung et al., 2005).An alternative method is to use an information criterion (e.g.O' Kane et al., 2013).An information criterion is a tool from model selection which is used to identify the optimal model (Burnham and Anderson, 2004); it finds a balance between how well the model fits the data and the number of parameters needed, to prevent over-fitting.The optimal balance is where the information criterion is minimal.As the clusters are effectively a model representing the data, the concept can be applied here as well.In addition to allowing for finding the optimal number of clusters k, an information criterion also allows for finding the best constraint value C when persistence is incorporated in the clustering procedure.
The two information criteria that are used most widely are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) (Burnham and Anderson, 2004).The AIC is based in information theory and is an approximation of how different two probability distributions (one for the data, one for the model) are (Akaike, 1973).It is given by where L( θ|data) is the likelihood of the optimal model θ given the data and K the number of parameters in the model.For cluster analysis the number of parameters is determined by the number of clusters, their dimension and the length of the data time series.The BIC is based on the limiting behaviour of Bayes estimators, which minimizes the expectation value of the loss (e.g.error), and reads (Schwarz, 1978): where n is the sample size, here being the dimension of the data (number of gridpoints or number of EOFs) times the number of days.Just as with the AIC, the BIC finds a balance between the (log-)likelihood and the number of parameters in the model.For both the AIC and BIC we refer to the second term as the penalty term as it penalizes the use of many parameters in finding the optimal model.
To compute the values of both information criteria the log-likelihood is needed.Assuming the errors of the model are independent and normally distributed the likelihood term can be written as (Burnham and Anderson, 2004) where σ2 = ˆ 2 t /n is the error variance for residuals ˆ t .This allows for a straightforward computation of both information criteria.
Which information criterion is best to use depends on the situation.As a rule of thumb the BIC is better suited for "small" models, while the AIC gives better results for "large" models (Shen and Ye, 2002).Whether a model is "small" or "large" is determined with respect to the dimension of the data considered.The argument behind this distinction in performance is that for a "small" model the penalty term in the AIC is weak and likely unable to sufficiently compensate the increase in likelihood when more parameters are used (resulting in a high optimal number of parameters), while the penalty term in the BIC is significantly stronger as it also depends on the sample size and can better balance the increasing likelihood (Burnham and Anderson, 2004).On the other hand for "large" models the penalty term in the BIC can be too strong leading to an unreasonably small optimal number of parameters, making the AIC a better choice.Here this means that we expect the BIC to perform better when the full field data is used, as the number of clusters is small with respect to the dimension of the data (latitude × longitude), while the opposite holds for the EOF data.There the dimension of the data (number of EOFs) is only slightly larger than the number of clusters, meaning the cluster model is "large" and the AIC is expected to give better results.When using these criteria one always has to judge whether the found minimum is physically reasonable and suitable for the purpose of the study.Furthermore, if the minimum of the information criterion is not very clear then further arguments, based on e.g.consistency, are needed to decide on the best k to use.

Results
First we compare the results of the regimes found using the EOF data and the full field data.In this we focus on the optimal number of regimes using the information criteria discussed and draw on similarity and consistency evaluations to provide further confidence in the result.Second we discuss the difference between enforcing persistence in the clustering method and the use of time-filtered data on the occurrence and persistence of the regimes.

Number of Regimes
The standard number of wintertime regimes identified over the Euro-Atlantic sector in literature is four (e.g Vautard, 1990;Cassou, 2008;Dawson and Palmer, 2015).Few studies question this number (e.g.Fereday et al., 2008).This optimal number of four clusters has always been found in the context of EOF data.Looking at the AIC, which is expected to give better results for EOFs as discussed in Section 2.2, in Figure 1a indeed a minimum at k = 4 is found when 20 EOFs are used, although the AIC is also small for k = 3.For lower numbers of EOFs the optimal number is found to be lower, while a higher number of EOFs leads to a higher optimum for k.This is to be expected because the use of a limited number of EOFs means that some variability of the original data is neglected.This loss of variability is larger for less EOFs and as a consequence less clusters are needed to account for the variability of the EOF data.The BIC has its minimum at k = 2 for every number of EOFs    considered, indicating the penalty term for the number of parameters is indeed too strong for the EOF data (Section 2.2).
Based on the AIC one might wonder why three regimes is not the standard in literature.The answer to this lies in the consistency of the results.There are two ways in which this can be assessed.First, one can look at the distribution of L for runs of the clustering algorithm with different initial conditions.Second, one can look at how similar the assignment of the data to the different regimes is between these runs (Fereday et al., 2008).This data similarity is computed by determining the number of data points that is assigned to the same regime for different runs.
Histograms of both consistency measures are shown in Figure 2 when the first 20 EOFs are used.When looking at the spread in L in Figure 2a clearly the spread for k = 3 is smallest, while for k = 4 even two peaks are visible.This double peak indicates that there exist a significant number of runs in which the value of L is only slightly smaller than the global minimum, but which have a significantly different data similarity.One additional year of data could possibly shift the value of this local minimum, making it the global minimum, which could result in different regimes.This is something to keep in mind when interpreting the results.
When looking at the data similarity in Figure 2b both distributions for k = 3 and k = 4 show a double peak.To assess whether we can distinguish these peaks we compute the mean and variance of the data similarity for all tests with ∆L < 0.01, where the bound is set based on the tolerance used Table 2: The mean (µ) and variance (σ 2 ) of the data similarity for data with L below a set threshold for both the EOF and full field results.For the full field results also the values for the odd and even years are given.The number of tests that are below the threshold is shown as well. in the k-means clustering (Section 2.1).The variance can be used as a measure of how consistent the results are, as a lower variance indicates the regimes are more similar.As can be seen on the left side of Table 2 k = 4 clearly proves to be more consistent, hence in literature four has become the standard number of regimes and we conclude that when one uses EOF data this indeed is the optimal number.However, the optimal number of regimes when using the full field dataset is not the same.The BIC, which is more suitable for the full field data than the AIC (Section 2.2), in Figure 1b points towards an optimum of k = 6.The AIC does not show a minimum in the range considered as the penalty term is not strong enough for the high dimensional full field data as discussed in Section 2.2.Since the differences in the value of the BIC for k ∈ {4, ..., 7} are small no value of k within this range can yet be ruled out.Therefore further arguments are needed to identify the optimal number of regimes.First we again use the distributions of L and the data similarity, including the variance measure, to assess the consistency of the results for different k.Second we look at the results of the clustering algorithm for a subset of the data, which is a standard approach to test the robustness of clustering methods to e.g.identify cats on photos (Jain, 2010).Here the results for the datasets of odd and even years are studied, as stationarity of the dataset cannot be assumed.
In Figure 3 the distribution of L and the data similarity are shown for a comparison with the optimal result (smallest L).The first aspect we note is that the spread for k = 7 in both L and the data similarity is large.This indicates the results found are not very consistent and we rule out k = 7 as a suitable number of regimes.We note that this does not improve significantly when the tolerance is reduced.The second aspect we observe is that, just as for the EOF data, a double peak in both L and the data similarity is found for k = 4. Looking at the distribution of L we see that for k = 6 there are significantly more tests with a very small value than for k = 5, as well as a smaller spread.On the other hand the data similarity shows a more pronounced peak near the total number of data points for k = 5 compared to the result for k = 6.Turning to the same measure as used for the EOF results, the variance of the data similarity for tests with ∆L < 0.001 is given in Table 2.The variance is minimal for k = 6, which was also indicated by the BIC as being optimal, although the difference with k = 4 is small.We rule out k = 5 as a suitable number of regimes based on the variance results.
The use of subsets of odd and even years could possibly provide further evidence in the choice between using four or six regimes.The distributions of L for the subsets of odd and even years can be found in the supplementary information.Here we restrict ourselves to the variance of the data similarity for ∆L in Table 2 as a measure of the consistency of the regimes.We note that the minimum value of L is 0.011 larger for the odd years and 0.028 for the even years compared to using the full dataset.For the odd years we find that k = 6 is the optimal number of regimes, while k = 4 is found to be even worse than k = 5 when studying the variance.On the other hand for the even years k = 4 is optimal, with k = 6 having a slightly larger variance.Because the values of the variance for k = 6 differ less between the odd and even years, we conclude this number of regimes is more consistent when full field data is used.The ambiguous results, especially for k = 4, raise the question whether half the dataset is of sufficient length to draw reliable conclusions about the clustering results.This also means that non-stationarity of the regimes due to e.g.climate change is difficult to study accurately using clustering methods.
The regimes that are obtained by using either a sufficient number of EOFs (10 or higher) or the full field data do not differ significantly for the same k.Similarly the occurrence rate and persistence of the regimes do not differ significantly.In Figure 4 the four regimes known from literature (e.g.Hannachi et al., 2012;Straus et al., 2007) are shown as obtained by applying k-means clustering on the full field data for k = 4.They are the Atlantic Ridge (AR), Scandinavian Blocking (SB) and the two phases of the North Atlantic Oscillation (NAO).The persistence and occurrence rate of these regimes can be found in Figures 6a and 6c labeled by 'Field'.The positive phase of the NAO is the most occurring regime, followed by SB.The high occurrence of the NAO+ regime may reflect the fact that it is the only regime associated with a northern low pressure area.Both phases of the NAO are found to be most persistent, while the AR exhibits the least persistence.
In Figure 5 the regimes found using k-means clustering on the full field data for k = 6 are shown.The first four regimes are in essence the same as those found for k = 4 in Figure 4. Small differences occur in the location of the maximum high or low pressure area for the AR and NAO+.The two additional regimes found have a low pressure area either in the central Atlantic or over Scandinavia.The first thus can be identified as the opposite phase of the AR and we refer to it as AR-, while the original regime is denoted by AR+.Similarly we refer to the second additional regime as SB-, as it represents the opposite phase of the SB regime (from now on denoted by SB+).The use of six clusters thus introduces a pleasing symmetry in the found regimes, with an equal number of regimes having a high and low pressure area in the north of the domain.
The occurrence rate and persistence of the six regimes show different behaviour than found for k = 4, as can be seen in Table 3. Instead of the NAO+ the SB+ is found to be the most occurring regime.The NAO+ is the second ranked regime in occurrence, albeit with a small, but significant difference relative to SB+.The other four regimes show similar occurrence rates.When looking at how long the regimes last, the NAO-remains the most persistent, with exactly the same value.The NAO+ however does lose some of its persistence, reducing it to a rate similar to that of the SB+.The AR-is found to be the second most persistent regime and the AR+ remains the least persistent.

Persistence
In Section 2.1 two methods to enforce persistence of the atmospheric circulation regimes have been discussed.The first method is the standard approach of applying a time-filter, and the second method is to include a persistence constraint in the clustering algorithm itself.The regimes found using these two methods do not differ substantially from those found and discussed in the previous section.When  a time-filter is applied to the data the regimes are found to be slightly weaker, but they do not show a significant difference in the configuration of high and low pressure areas.For the results using a persistence constraint differences in the regimes only emerge for very strong (unrealistic) constraints in the form of slight shifts in the location of the centres of high and low pressure.For weak (realistic) constraints the regimes found are the same as for the unconstrained method and no weakening is found.By a 'realistic' constraint we mean one that does not force data points into regimes which are a large distance away, but only switches those data points that are in-between different regimes.In practice these are constraints corresponding to an average regime duration below circa 9 days (the corresponding C can be found in Table 1).
To verify the reliability of the result we first check the consistency of the results by considering the distribution of L and the data similarity for results with different initial conditions.Because less runs are available for the constrained algorithm the measure used in Section 3.1 is less reliable, hence we restrict ourselves to a brief discussion of the distributions which are given in the supplementary information.The results for the constrained algorithm are quite similar to those of the unconstrained algorithm, which were shown in Figure 3.The stronger the constraint, i.e. the less transitions that are allowed (smaller C, longer average regime duration), the more differences emerge.The first differences are found for k = 5 for C ≈ 1800 and lower, where the peaks become less significant for both L and the data similarity.The next deviations for even stronger constraints occur for k = 6 where the peak for L becomes smaller.However, the clear peak in the data similarity remains.Only for C smaller than 1000 do the results for k = 4 start to deviate.This leaves intact the conclusions drawn about the number of clusters in the previous section.For the low-pass filtered data the main differences in the distribution of L are that there is no double peak for k = 4 and that the peak for k = 5 is weaker.These conclusions extend to the distribution of the data similarity for both k = 4 and k = 5.This indicates that, just as for the use of EOFs, time-filtering of the data leads to increased consistency for k = 4 compared to the unfiltered, but still persistent, result.
The effects of the time-filtering and persistence constraint method are shown in Figures 6a (k = 4) and 6b (k = 6).On the left of each panel the results for the constrained algorithm are shown for various C and as expected the persistence increases with decreasing C. The smaller the value of C, the less switches between regimes are allowed.The increase of persistence with decreasing C is approximately linear for all regimes, and starts at the 'raw' persistence of the regimes.The value of C where the persistence is the same as for the unconstrained method differs between k = 4 and k = 6.For k = 6 it is found for C = 2200, corresponding to an average regime duration of 4.3 days.For k = 4 there are less regimes to switch to meaning the 'raw' persistence is larger, and is found for C = 1800 which corresponds to an average regime duration of 5.3 days.
Comparing the results for time-filtered data with those of the constrained method in Figure 6 we see that using a 5-day low-pass filter corresponds to a constraint of roughly 2000 for k = 6 and 1400 for k = 4.This difference is mainly due to the stronger effect of the constraint for a larger number of clusters.For the 10-day filter the corresponding values of C are approximately 1400 and 1100 for k = 6 and k = 4 respectively.Note that the persistence of certain regimes differs slightly between the two methods.For example the AR+ regime is found to increase its persistence relatively stronger for the time-filtered data.
The occurrence rates of the different regimes are shown in Figures 6c and 6d for k = 4 and k = 6 respectively.We start by looking at the results of the constrained algorithm.The occurrence rate remains the same as for the unconstrained data, even for constraint values significantly stronger than the 'raw' persistence of the data.Only for very low C (strong constraints) do the occurrence rates start to differ.This indicates that the method causes a switching of the 'in-between-cluster' points to the cluster of their neighbours instead of the cluster they are slightly closer to.We regard the constraint as being 'weak' so long as the occurrence rates are not affected, and helping to identify true physical persistence.In contrast, the results for the time-filtered data show significant differences in the occurrence rates of the regimes.Especially for the 10-day filter the differences for e.g. the AR+ regime (k = 4) or the NAO+ regime (k = 6) are substantial.As this is the standard filter used in literature (e.g.Straus et al., 2017) it raises the question of how reliable the occurrence rates are and whether they are not solely a feature of the method used.In contrast, the inclusion of the constraint within the clustering procedure itself does not lead to such a bias and therefore provides a more robust way of finding persistent regimes, i.e. of isolating the signal from the noise.
When using the constrained algorithm one of the choices that needs to be made is which constraint value C is best to use.Here we base this choice on the BIC, as we did for finding the optimal number  The error bars indicate the maximum and minimum value of occurrence/persistence for clustering results with a slightly smaller L (bounds for the difference are {0.00968,0.00936, ..., 0.00232, 0.002, 0.002, 0.002} decreasing with increasing C, which are chosen sufficiently small to give similar regimes according to the data correspondence. of clusters k.In Figure 7 the BIC is shown for both k = 4 and k = 6.For k = 4 the minimum is found for C = 1400 and for k = 6 it is found for C = 1500.These constraint values correspond to an average regime duration of 6.8 and 6.3 days respectively, and match the point beyond which smaller values of C start to affect the occurrence rates (Figure 6c and 6d).This increases the confidence of the optimal value of C being around these values.Interestingly, the optimal average regime duration for k = 4 and k = 6 differs by less than 10% (∆C ≈ 1500 − 1400 = 100), whereas without the persistence constraint the average duration differs by 20% (∆C ≈ 2200 − 1800 = 400).This confirms that the persistence constraint is helping identify a physical signal that is less dependent on the number of clusters chosen.The range where the BIC is very close to its minimum is between 6.3 (C = 1500) and 7.9 (C = 1200) days for k = 4.For k = 6 this range is from 5.9 (C = 1600) to 6.8 (C = 1400) days.Twice this timescale, which is the minimum for recurrence of a regime, thus corresponds roughly to a period of 12 to 14 days.This is somewhat longer than the timescale of synoptic weather systems (Blackmon et al., 1977;Boljka et al., 2018).As this is an average there are a substantial number of longer lasting regimes showing persistence well beyond the synoptic timescale.

Conclusion and Discussion
In this study we have shown, using an information criterion and further arguments based on the consistency of the clustering result, that the traditional number of four clusters is not optimal for representing wintertime Euro-Atlantic weather regimes when full field data is used.The traditional approach of applying clustering to the first few EOFs involves a loss of information, which affects the number of regimes that is best to represent the data.The optimal number of regimes for the full field data was identified using the Bayesian Information Criterion (BIC), which finds a balance between how well the regimes fit the data and the number of parameters needed to describe them.We find that for the full field data, six regimes is the optimal choice.The two additional regimes are the opposite phases of the Atlantic Ridge and Scandinavian Blocking, introducing a pleasing symmetry in the found clusters.Furthermore, the dominant occurrence of the NAO+ when there are only four clusters, which likely is due to it being the only regime with a low pressure area up north, is reduced by the addition of two regimes that also have this feature.Therefore, six regimes allow for more variability in their representation of the circulation and prevent all data with a more zonal flow projecting onto the NAO+.
Furthermore we looked into ways to enforce persistence of the regimes.The standard approach in literature is to apply a low-pass filter to remove high frequency oscillations and focus on the persistent behaviour (e.g.Straus et al., 2017).This alters the data to which the clustering algorithm is applied, just as the use of EOFs does.We have shown that this leads to a significant change in the occurrence rate of the circulation regimes.A new method, which incorporates a persistence constraint in the algorithm itself, does not change the data while still enforcing persistent regimes.The results for this approach do not exhibit the change in occurrence rate found for the time-filtered data, as long as the constraint is not too strong, while still having an increased persistence.Therefore this method leads to a more robust and unbiased result compared to the time-filtering approach.
A choice that needs to be made in this adapted clustering method is the value of the constraint C. Using the BIC the optimal value of C is found to lie around an average regime duration of six to seven days.Interestingly, this matches the point beyond which smaller values of C start to affect the occurrence rates.Thus it can be viewed as a more accurate estimate of the physical persistence of the regimes than that provided by the raw data without the persistence constraint.Double this value, which is the minimum for recurrence of a regime, thus is slightly longer than that of synoptic weather systems (Blackmon et al., 1977;Boljka et al., 2018).This shows that the atmospheric circulation indeed exhibits persistence beyond the synoptic timescale, suggesting the presence of predictable low-frequency modes.
Both results indicate that care must be taken when applying filtering methods to the data before a clustering algorithm is applied.Clustering itself provides a way of dimension reduction, by projecting onto components representing recurrent patterns in the data.Since this is a method of filtering the data it seems ill-advised to apply this to already filtered data, as it is not clear what the effect of this double filtering is on the result.A similar argument holds for applying a time-filter to the data before clustering.Information is lost in this procedure, introducing a bias in the resulting circulation regimes and their occurrence rates.for k = 4.This decrease, even if just significant, needs to be kept in mind when considering the results but is not strong enough to provide an argument for reconsidering the period studied.The occurrence rate of the NAO+ also slightly decreases for k = 4, whereas for the other regimes no significant changes are found in the occurrence.Finally for k = 6 only the occurrence of the NAOregime shows a significant change, being smaller than when March is included in the data.This means that the NAO-occurs relatively more often in March and this is the only difference that is strong enough to possibly argue for excluding March from the period considered.However, because it is only one of many aspects that show such a change we decide to stick to the period December through March for the benefit of the additional data, meanwhile keeping the difference in mind when discussing the occurrence of the regimes.

B Technical Details B.1 Time-Filtering
There are numerous ways in which data can be filtered to get rid of the high-frequency oscillations.Therefore we detail the method used to obtain the filtered data.The low-pass filter we use is the sinc-function multiplied by the Blackman window (Smith, 2002).This window is applied because otherwise the filter never fully reaches zero.The Blackman window is given by: where N is the total number of points and n ∈ [0, N − 1].Multiplying the Blackman window with the sinc function results in the windowed sinc-filter: where f c is the cut-off frequency.The time-filtered data is computed by convolving the filter with the time-series for each grid point.the time-filter this is the cut-off frequency and for the persistent algorithm this is the value of the persistence constraint.The approach taken here for choosing these two parameters is mirrored to the approach in the main article.This means we a priori set the cut-off frequency for the time-filtering to 150 time steps, being the equivalent of the 10-day filter used for the atmospheric circulation regimes.On the other hand the value of the persistence constraint is determined using an information criterion.Both the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are explored and in the end the BIC is chosen as the better criterion.This is because when the two criteria differ in the location of their minimum the clusters for the BIC-minimum are closer to the best result than those for the AIC-minimum.Note that the information criterion approach can also be taken to determine the optimal cut-off frequency for the time-filtering.

C Consistency Results
For the same realisation as in Figure S10 the results for the time-filtered data and clustering using a persistence constraint are shown in Figure S11.Both results show a clear improvement with respect to the standard approach in Figure S10b.The result for the time-filtered data shows a slightly different assignment of data to clusters for the transition trajectories, but other than that the results both are as desired.We note that for this realisation of the model the BIC shows a clearly identifiable minimum, which is not always the case.The discussed results show a realisation of the model in which both methods work nicely.This however is not the case for every realisation of the Lorenz 63 model.Already when applying k-means clustering to the y-z data the correct clusters are not always identified, as can be seen in Figure S12a.This result likely improves if more data is included, but as a limited amount of data is one of the difficulties of real world clustering it is important to note this limitation.Furthermore the BIC does not always point towards the correct result, which can be seen in Figure S12b.By looking at the clusters for different values of the constraint it is possible to identify a better value, but as this is impossible for the high-dimensional atmospheric data in which the circulation regimes are identified this is not a desirable option.We note that also for the time-filtered data the result is not always as good as shown in Figure S11a, although in general it is slightly more robust than the results for the persistence constraint.
Applying the clustering methods used to identify circulation regimes in atmospheric data to the Lorenz 63 system teaches us to be careful in relying too much on the outcome of the algorithm.Even for the 'simple' Lorenz 63 system the clustering algorithm does not always identify the correct clusters.For the even more complex atmospheric data in which the circulation regimes are identified this is likely an even larger difficulty.This does not mean the result is not useful, but it is important to be aware of the limitations of the method.
(a) The AIC for different numbers of EOFs.(b) The AIC and BIC for the full field data.

Figure 1 :
Figure 1: Information criteria for both the full field and EOF datasets for the range k = 2, ..., 10.Both the AIC and BIC, as given in Section 2.2, are shown for the full field data.For the EOF data only the AIC is shown.
(a) The clustering functional L. (b) The data similarity.

Figure 2 :
Figure2: Histograms for the clustering functional L and the data similarity with respect to the optimal (minimal L) result using the first 20 EOFs for k = 3, ..., 6.

Figure 3 :
Figure3: Histograms for the clustering functional L and the data similarity with respect to the optimal (minimal L) result using the full field data for k = 4, ..., 7.

Figure 4 :
Figure 4: The clustering result of the standard k-means algorithm applied to the full field data for k = 4.

Figure 5 :
Figure 5: The clustering result of the standard k-means algorithm applied to the full field data for k = 6.
(a) The persistence rates for k = 4.(b) The persistence rates for k = 6.(c)The occurrence rates for k = 4.(d) The occurrence rates for k = 6.

Figure 6 :
Figure 6: The occurrence and persistence rates of the different regimes for k = 4 and k = 6 for the clustering results including the persistence constraint C depending on the value of C. To the right the values for the unconstrained algorithm (field) and the 5-and 10-day low-pass filter are shown.The error bars indicate the maximum and minimum value of occurrence/persistence for clustering results with a slightly smaller L (bounds for the difference are {0.00968,0.00936, ..., 0.00232, 0.002, 0.002, 0.002} decreasing with increasing C, which are chosen sufficiently small to give similar regimes according to the data correspondence.

Figure S2 :
Figure S2: The clustering results of the standard k-means algorithm for the data including November applied to the full field for k = 4.

Figure S3 :
Figure S3: The clustering results of the standard k-means algorithm for the data without March applied to the full field for k = 6.

Figures
Figures showing the distributions of ∆L and the data similarity are shown for different datasets and clustering algorithms.

Figure S4 :
FigureS4: Histograms for the clustering functional L and the data similarity with respect to the optimal (minimal L) result using the odd years of the full field data for k = 4, ..., 7.

Figure S5 :
FigureS5: Histograms for the clustering functional L and the data similarity with respect to the optimal (minimal L) result using the even years of the full field data for k = 4, ..., 7.

Figure S6 :
FigureS6: Histograms for the clustering functional L and the data similarity with respect to the optimal (minimal L) result for the 10-day low-pass-filter results.

Figure S7 :
Figure S7: Histograms for the clustering functional L and the data similarity with respect to the optimal (minimal L) result for the constrained results with C = 2000.

Figure S8 :
FigureS8: Histograms for the clustering functional L and the data similarity with respect to the optimal (minimal L) result for the constrained results with C = 1500.

( a )
The clustering functional L.(b) The data similarity.

Figure S9 :
FigureS9: Histograms for the clustering functional L and the data similarity with respect to the optimal (minimal L) result for the constrained results with C = 1000.
(a) Applying a low-pass filter to the data for a cut-off frequency of 150 time steps.(b) Incorporating a persistence constraint in the clustering algorithm.

Figure S11 :
Figure S11: Clustering of the Lorenz 63 system in the x-z-plane using different methods to enforce persistence.
(a) Standard k-means applied to the y-z-data.(b) Incorporating a persistence constraint in the clustering algorithm.

Figure S12 :
Figure S12: The clustering approaches discussed do not always identify the correct clusters.

Table 1 :
The value of C with corresponding average regime duration.

Table 3 :
The values of the occurrence rate and persistence for the unconstrained result for both k = 4 and k = 6.