Spatial regression of multi-ﬁdelity meteorological observations using a proxy-based measurement error model

. High-resolution weather maps are fundamental components of early warning systems, since they enable the (near) real-time tracking of extreme weather events. In this context, crowd-sourced weather networks producing low-ﬁdelity observations are often the only type of data available at local (e


Data
For 29 June 2021, the KNMI issued a "code orange" weather warning for heavy precipitation in two provinces (i.e. Noord-Holland and Gelderland). We collect the official high-fidelity observations (N = 32) from the KNMI network (i.e. 08:00 to 08:00 LT -local time -CET the day next), and the lowfidelity observations (N ≈ 650) from personal weather stations (Kirk et al., 2021) from the WOW-NL network (https: //wow.knmi.nl/, last access: 30 May 2023). We also utilize high-resolution covariates (CBS, 2023), such as distance to coast, population, tree cover and rain radar (https://data.4tu. nl/articles/_/12675278/1, last access: 16 December 2022). The rationale is that, for example for wind speed, we expect a local effect on wind speed in areas with high percentage of tree cover, proximity of open water, etc.

Methodology
In the context of Bayesian spatial data analysis (Wikle and Berliner, 2007) for producing high-resolution maps (de Baar and Garcia-Marti, 2022; van Beekvelt, 2022), we are interested in predicting m grid values X of an individual variable, conditional on n observations y. In the presence of covariates, we have the prior where the m × m matrix P is the process covariance matrix (Sect. 2.3, Wikle andBerliner, 2007 andEq. 3, Hengl et al., 2007). We use a Gaussian kernel to model the prior covariance matrix P ij =σ 2 x exp − 1 2 h 2 ijθ −2 , withσ 2 x the estimated process variance and lag h ij the spatial distance between locations i and j . The correlation length scaleθ remains to be estimated. The m × q matrix Q contains the q covariates evaluated at the grid points, where we use the rain radar as a covariate for precipitation, and distance to coast, population and tree cover as covariates for temperature and wind speed. The q ×1 covariate weight vectorβ Q remains to be estimated. The possibility of using a non-Euclidian lag is explored in de Baar and Garcia-Marti (2022) and references therein.  The likelihood is interpreted as a model of the observational process (de Baar et al., 2014;Sect. 2.3, Wikle and Berliner, 2007) with H the n × m observation operator and B the m × b bias budget matrix containing b bias proxies evaluated at the observation locations. For the n×n noise covariance matrix, we assume spatially uncorrelated noise √ R = Nβ N I n×n , with N the n × p noise budget matrix containing p noise proxies evaluated at the observation locations. The b × 1 bias weight vectorβ B and the p ×1 noise weight vectorβ N remain to be estimated. Our rationale is that the -calibrated -high-fidelity observational process includes no bias and only limited uniform noise, while the low-fidelity observational process involves significant bias and noise budgets.
The novelty in this approach is that we define a bias budget and noise budget as part of the likelihood. Currently, the bias and noise budgets contain indicators for network type, but they could also include variable members like local population density, local sky view factor, station height above ground, etc. We then estimate the weights of the members of these budgets from the data y. In particular, we estimate the hyperparametersσ xβQ andβ B from an analytical maximum likelihood estimate (Hengl et al., 2007, Eq. 4), where we use the same approach for each individual weather variable. For the reason of robustness, we have opted to estimate the hyperparamtersθ andβ N from brute-force iterative leave-one-out cross-validation (James et al., 2013, Sect. 5.1), where, because of the bias in the low-fidelity data, we only leave out high-fidelity data during cross-validation. The resulting map, which is the result of the above Bayesian updating process, is then a conditional multivariate normal distribution p(X|y) = N (x,Ĉ), defined by the posterior mean x = E(X|y) and posterior covarianceĈ = cov(X|y), which are given by Hengl et al. (2007, Eqs. 5-6) and Wikle and Berliner (2007, Eqs. 5-7). In the posterior covariance, we take into account the uncertainty caused by the use of covariates (Hengl et al., 2007) and apply an inflation factor f = n/(n − n hyp ) to account for the observation-based estimation of n hyp hyperparameters.

Results and discussion
Figures 1-3 illustrate weather maps for 08:00-09:00 LT (CET). Here we show results for air temperature, wind speed and precipitation. Each figure shows (column 1) the maps based only on high-fidelity, (column 2) the maps based on high-fidelity and covariate data (column 3) the maps based on high-and low-fidelity, and (column 4) the maps based on high-fidelity, low-fidelity and covariate data. To demonstrate the general applicability of the methodology, we do not presently apply any transformations to the different variables; although this would be possible to do in the future.
In order to quantify the improvement achieved by including additional data, in the following sub-sections we use the terminology by Roache (1998). By validation, we mean that we compare the output of a computer code to observations that were not included as an input. By verification, we mean that we compare the output of one run of a computer code with a different run of the same code, but with different inputs or settings.

Cross-validation
We apply leave-one-out cross-validation of the high-fidelity KNMI stations to compute a root-mean-squared (RMS) prediction error (James et al., 2013). For robustness, we leave out the lowest and highest error before we compute the RMS prediction error. We repeat this for each of the 24 hourly time slices, and compute an overall RMS error. The results are summarized in Table 1. For air temperature and wind speed, we observe overfitting in method II (KNMI + cov), while method IV (KNMI + WOW + cov) gives a slight improvement. For precipitation, we see that method II already gives a good result, which is an indication that the rain radar is a very good covariate for precipitation by itself, even without including WOW data.

Verification and synergy effect
We have shown that high-resolution, high-confidence maps can be produced by combining high-fidelity data with a dense network of low-fidelity data and suitable covariates. However, the cross-validation in Sect. 3.1 is somewhat biased towards the "ideal" locations of the KNMI stations, which do not sample urban areas and forests, and do not always sample the locations of extreme precipitation. Therefore, from cross-validation we might not observe the improvement of the maps in those areas. To address this issue, in this subsection we run a verification study to quantify the improvement of the entire map, as achieved by including WOW data and/or covariates.
While in the cross-validation in Sect. 3.1 we used left-out observations as a reference for the output of our spatial regression, in the verification study in this sub-section we use a reference map to compare with. For each hour, we build this reference map by including all available data. Then, we run the spatial regression with a reduced number of WOW stations and/or a reduced number of covariates. Comparing these "reduced data" maps with the reference map provides us with an RMS difference for the entire country. The change in this RMS difference that we observe when increasing the amount of WOW stations and/or covariates gives us an idea of their contribution to the spatial regression output.
In Fig. 4a, we show the effect of including WOW data and/or covariates over time, while in Fig. 4b we show the same data in a scatter plot. Interestingly, we see a "synergy" effect, where adding WOW data and covariates at the same time gives a larger improvement than the sum of the individual improvements. The most likely reason for this is that, by adding crowd-sourced WOW stations, we are not only increasing spatial station density, but are also sampling the higher-value areas (e.g. cities, forests, rain showers) of covariates that we were not sampling before.

Competing interests.
The contact author has declared that none of the authors has any competing interests.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Special issue statement.
This article is part of the special issue "EMS Annual Meeting: European Conference for Applied Meteorology and Climatology 2022". It is a result of the EMS Annual Meeting: European Conference for Applied Meteorology and Climatology 2022, Bonn, Germany, 4-9 September 2022. The corresponding presentation was part of session OSA3.2: Spatial climatology.