Advanced information criterion for environmental dat quality assurance

Abstract. A new method for testing time series of environmental data for internal inconsistencies is presented. The method divides the dataset into several disjunct blocks. By means of a comparison of the blocks' estimated probability density distributions, each block is compared with the others. In order to judge the differences, four different measures are used and compared: Kullback-Leibler Divergence, Jensen-Shannon Divergence, Earth Mover's Distance and the Root Mean Square. By looking at the resulting patterns, conclusions on possible inconsistencies in the data can be drawn. This paper shows some sensitivitiy tests and gives an example for an application to real data. Furthermore, it is shown, in which cases of errors (shift in mean, shift in variance and rounding), which measure performs best.


Introduction
When using data measured from natural systems to draw conclusions about the observed system, data quality assurance is a very important factor.In this quality assurance process not only the metadata, but also the data itself should be controlled.For this kind of control, there are several generic methods available, which means that they may be applied to any data set without detailed knowledge of its specifics.
These generic methods are mostly based on rules described by Meek and Hatfield (1994).These rules control datasets separately for each data point, on whether specified limits are exceeded (LIM), the number of successive elements, which are not changing, exceeds a predefined number (NOC) or whether the rate of change between two successive data points exceeds a limit (ROC).Those rules were applied to several datasets with different methods for defining the parameters of the tests (e.g.Hubbard et al., 2005;Zahumensky, 2007;Jiménez et al., 2010;Durre et al., 2010).For data sets that are available on a regular basis, like meteorological networks, methods like "Complex Quality Control" (CQC), developed by Gandin (1988), homogenization (Peterson et al., 1998) or Mathes et al. (2008) might be more useful.
When the sources of data are unknown, a more general procedure is required.In this paper a newly developed method for this problem is presented.It is based on the analysis of the estimated probability density of data, for which two basic forms are possible.One takes a look at statistical moments (like mean, standard deviation or percentiles) and their development within the whole dataset.The other approach investigates all the distribution information.This is what will be pursued in this paper.
To avoid any preconditioning of the results, a nonparametric density estimation shall be the starting point.With the help of these estimates an evaluation of the two probability densities is performed.The assumption used is that two probability densities, characterizing the data in two time windows, are identical.A strict hypothesis test in the statistical sense is not performed.
In the next step a distance measure between two densities is defined.Standard methods like the Kolmogorov-Smirnov test are very sensitive to sample variability (Owen, 1995).Therefore, we would like to use more robust measures.These have to take into account the full structure of the estimated densities or their integrals, the probability distributions.This is in contrast to the Kolmogorov-Smirnov test which only Published by Copernicus Publications.
A. D üsterhus and A. Hense: Advanced information criterion for environmental data quality assurance take into account the difference at one point, namely the maximum deviation.There are several divergences available for comparing distributions.Those used in this paper are described in Sect.2: Kullback-Leibler (Kullback and Leibler, 1951), Jensen-Shannon (Endres and Schindelin, 2003) and the Earth Mover's distance (Rubner et al., 2000).In Sect. 3 some sensitivity studies are performed before an application is shown in Sect. 4. The paper ends with a discussion in Sect. 5 and is summarized in Sect.6.

Method
The basic methods rely on a division of the dataset into blocks of blocksize s b and a comparison of every block to the others.This is carried out by comparing the blocks' normalized histograms as estimators of the underlying probability density, which uses a number of bins n b .These bins are uniformly distributed between the maximum and minimum of both blocks.
To determine the difference between both histograms ( f, g ∈ R n b ) the following distance measures are used: Kullback-Leibler Divergence (KLD).KLD is based on the work described in Kullback and Leibler (1951).It is an unsymmetric function between two histograms and defined by Lin (1991) as follows: It is obvious that a problem occurs when g(x) = 0 for any x ∈ [1, n b ].
To prevent this, a prior estimation a p is introduced for every bin of both estimated probability densities: where h i is the resulting bin of the histogram, a i is the number of observations in bin i and s b is the total number of observations in the block.To couple a p to the number of observations s b , a p depends on a small factor a f and s b and is defined by the following equation: Jensen-Shannon Divergence (JSD).JSD is a symmetrization of the KLD and can be defined as follows (Endres and Schindelin, 2003): JSD and KLD are positive definite functionals, but neither the first nor the second are "real" distance measures because they do not obey the triangle inequality.

Earth Mover's Distance (EMD).
EMD was developed as a solution of a transportation problem (Rubner et al., 2000).In contrast to KLD and JSD it does not rely on a bin-wise ratio.Rather, it figures out how to transform one histogram to the other.To do this the probability of every bin is seen as a mass, which has to be transported.
The EMD measures the minimal work that has to be invested for this task.Important here is that the distance between two bins is not neglected, but defined as d(i, j) = |i− j| n b .For a one-dimensional histogram this leads to (Rabin et al., 2008): where F and G are the cumulative distribution functions of f and g.EMD is a true distance measure being positive definite, symmetric and obeying the triangle inequality.EMD is a special case of the more general Wasserstein distance of probability density functions (Levina and Bickel, 2001).
Root Mean Square (RMS).RMS is only used as a reference in this paper.The well known definition is given by: When such a method is used to evaluate a dataset, a typical resulting plot consists of a two dimensional array.Each entry is the result of a comparison of two parts of the dataset.On the diagonal, each part of the dataset is compared to itself and the value should be zero.This condition is fulfilled by all of the four distance measures.The rest of the array is filled with the distances between the histograms of every part to the others.Also, all but the KLD deliver symmetric arrays.
In the next section sensitivity tests are performed in order to simulate the influence of the different distance measures on this method.Because this method delivers only relative results, it is necessary to define a measure that makes the different measures comparable.Therefore the dataset will be separated into two parts.Each part gets a different characteristic.When the blocks are compared to each other it is now known, which comparison looks at blocks with the same characteristics and which at blocks with different characteristics.To determine the difference between blocks of differences of same and different characteristics the definition of x sd is introduced as follows: µ same and σ same are the mean and standard deviation of those distances, which compare sections with the same characteristics of the dataset.The same is valid for the sections with different characteristics (diff).The distances, which are zero are neglected in the calculation of x sd .
It is plausible that higher values for x sd means that the differences in the data set are easier to detect than lower ones.

Sensitivity tests
In this section some characteristics of the methodology using simulated observations are discussed and the different distance measures are compared.For the simulation a sample of 2000 realizations of a Gaussian distributed and normalized (mean zero, variance 1) random variable is used.The sample is split into two equally large subsamples where the second sample is subjected to a change.Afterwards, the method is applied with a blocksize s b = 100 and x sd is calculated.In this calculation the comparisons with "different characteristics" are represented by the influence of the first (block 1 to 10) on the second half (block 11 to 20).For the comparison with the "same characteristics" the influence of the second half on itself is used.The treatment of the second half in the next section is a rounding on the first digit.

Influence of a f
In the definition of KLD and JSD the value a f is used to incorporate the amplitude of the prior for each bin.In Fig. 1 the results for x sd are calculated for 100 different randomly drawn vectors and the mean is shown for 200 different n b and eleven different a f , which are distributed on an logarithmic scale.
Principally, better values are achieved for a higher number of bins.It is also better to use higher a f values, what is  7) for 100 randomly generated data sets (without shift normally distributed with expectation mean = 0, sd = 1) for the four different measures.
equivalent to a lower prior a p for each bin.For values higher than a f = 100 no further significant difference is detectable in comparison to higher values.That is why this value is chosen for the next tests with the KLD and the JSD.As a next step, x sd have to be chosen, whereby two sections with different characteristics are clearly distinguishable.This can be defined, when x sd exceeding 1.In Sects.3.2 and 3.3 x sd = 1 is also used as a detection limit of inconsistencies within a dataset.The condition of x sd exceeds 1 is also used here to determine the number of bins n b of the histograms.It is fulfilled at approximately n b = 65, which is used throughout the remainder of the paper.

Shift in mean
As a second sensitivity test a detection of a regime shift is used.Unlike before the second half of the tested dataset is not rounded, but a factor of y sd standard deviations is added.This y sd is now selected in the range of 0 to 5 and the evaluation is carried out like before.The mean results for 100 vectors and their standard deviation with n b = 65 are shown in Fig. 2. Here, the results measured in x sd are plotted against the added value measured in standard deviations y sd .The detection limit chosen before is indicated by a line at x sd = 1.Since KLD is asymmetric it delivers different results, if the ingoing histograms are transposed.Therefore, only the better result of the KLD is shown.
The best detection result is achieved by the EMD.This distance measure is highly sensitive for low values of y sd and reaches the detection limit at about y sd ≈ 0.4.The three other distance measures are less sensitive and reach their detection limit at about twice the value of the EMD y sd ≈ 0.9.For higher values of y sd , the JSD measure detects shifts slightly  7) for 100 randomly generated data sets (without shift normally distributed with expectation mean = 0, sd = 1) for the four different measures.
better than the KLD.RMS proves to be worst in detecting the shifts.
Increasing the number of bins n b deteriorates these results except for EMD (not shown).

Shift in variance
Like before the second half of the dataset is manipulated, but now the y sd is not added but multiplied increasing the variance.The results are shown in Fig. 3 and are constructed as specified in Sect.3.2.Once again EMD delivers the best results, by reaching the detection limit with the smallest deviation of y sd = 1.5.The next is the KLD, where the better of the both possibilities to calculate this measure reach the detection limit at around y sd = 2.0.The RMS follows with y sd = 2.1 and the worst results are delivered by the JSD with reaching the detection limit at around y sd = 2.5.At this point it is necessary to mention briefly that the asymmetry property of the KLD plays a huge role in this test.While the differences of choosing D KL ( f ||g) or D KL (g|| f ) can be neglected when a change in the mean occur, in the case of a variance shift, the detection limit of the inferior was not reached under y sd = 5.0.
As previously, the tests are performed with all four different divergence measures.The parameters are set to n b = 65 and s b = 365.The latter choice serves to eliminate seasonal effects.This prevents a bias of taking into account a season more often than an other into one block.The results are shown in Fig. 4.
Especially KLD (Fig. 4a) and JSD (Fig. 4b) show a pattern of higher values in the years 1991, 1999 and 2000.RMS (Fig. 4c) also delivers such an indication, but in the result produced with the EMD (Fig. 4d) no evidence of these special time periods can be found.
A reason for these higher values are demonstrated for the period 1999 and 2000.In Fig. 4e displaying the time series for the period July 1998 till July 2000.Obviously, at 1 December 1998 there was a change in the recording procedure initiated with the data stored only to the nearest integer.This period ends at the beginning of April 2001, when another change in the recording procedure has occurred.The same rounding of the data can be found in the dataset up to 1 June 1992, which explains the high values in 1991.This shows that EMD is apparently insensitive to this type of change in data in contrast to the remaining measures.The reason will be discussed in the next section.

Discussion
The method for testing data quality presented in this paper offers a simple way to detect potential errors and discrepancies to data users.We propose to use a set of measures derived from estimated probability densities (histograms).These have been tested on artificial data with the tests showing a clear advantage in most situations of the EMD, which is a distance measure for probability densities.
It is shown that different measures of these changes react differently to distinct types of these changes.For example, the EMD is much more sensitive to potential regime shifts or changes in the variance of the data than KLD, JSD and RMS.This is rooted in the definition of EMD as a solution of the minimal work for the transportation problem.The focus is set on the distance, when the probability of one bin is "transported" to another.KLD, JSD and RMS are simply comparing the difference between the bins, without looking at the range.The same argumentation holds for the better results of KLD and JSD in rounding problems.Because the range is so small between the bins with different probabilities, the difference in value matters more than the distance between the bins.
The regime and variance shifts are a common phenomenon in observational data sets.Therefore, a number of tests are available for these kinds of potential errors (Ducré-Robitaille et al., 2003).In contrast, rounding problems are mostly neglected, although they deliver a good indication for changes in measurement techniques.The presented method with the KLD or JSD as a measure delivers a good test for such changes.
Tests on internal consistency are an important part of a data quality assurance workflow.If it is known what type of data is under review, simple rules can be applied to highlight the problematic parts of a dataset.Examples are the ROC and NOC rules by Meek and Hatfield (1994).Others can be found in the framework of a complex quality assurance (Gandin, 1988;Graybeal et al., 2004) or homogenization (Peterson et al., 1998).
If there is no prior information on the data that is actually checked, the task will become more complicate.Of course, normalized limits can be checked (Hubbard et al., 2005).
All these tests only validate one value to check against one or more recently measured values of the same measurement or measurement type.The approach presented here is different, because it evaluates complete datasets.
An additional advantage is the flexibility of choosing the blocks within a dataset.This enables the possibility to perform these checks on two or more dimensional data like model outputs.

Conclusions
In this paper a new method for data quality assurance is presented.It divides the dataset to be tested into disjunct blocks, before each block is compared to the others.This works by a comparison of the blocks' estimated probability density.In order to determine the differences, four different distance measures are applied.While the Earth Mover's Distance A. D üsterhus and A. Hense: Advanced information criterion for environmental data quality assurance delivers good results for detection of regime and variance shifts in data, the Kullback-Leibler and Jensen-Shannon Divergences are best at rounding problems.

Figure 1 .
Figure 1.Influence of a f on the results measured in x sd , displayed as shadings of gray, of the KLD (upper figure) and JSD (lower figure) for different number of bins n b .

Figure 2 .
Figure 2. Results of the regime shift sensitivity test, with artificially included shifts in the mean.The shift is measured in terms of the standard deviation and shown on the x-axis.The curves shows the average and their respective uncertainties of the measure x sd in Eq. (7) for 100 randomly generated data sets (without shift normally distributed with expectation mean = 0, sd = 1) for the four different measures.

Figure 3 .
Figure 3. Results of the regime shift sensitivity test, with artificially included shifts in the variance.The shift is measured in terms of the standard deviation and shown on the x-axis.The curves shows the average and their respective uncertainties of the measure x sd in Eq. (7) for 100 randomly generated data sets (without shift normally distributed with expectation mean = 0, sd = 1) for the four different measures.

Figure 4 .
Figure 4. Analysis of the maximum wind at Lindenberg station between 1991 and 2010 with the four different measures (panel a-d).Also shown in panel (e) is the relevant section in the data between July 1998 and July 2001, where KLD, JSD and RMS show higher values.