Data validation procedures in agricultural meteorology – a prerequisite for their use

Abstract. Quality meteorological data sources are critical to scientists, engineers, climate assessments and to make climate related decisions. Accurate quantification of reference evapotranspiration (ET0) in irrigated agriculture is crucial for optimizing crop production, planning and managing irrigation, and for using water resources efficiently. Validation of data insures that the information needed is been properly generated, identifies incorrect values and detects problems that require immediate maintenance attention. The Agroclimatic Information Network of Andalusia at present provides daily estimations of ET0 using meteorological information collected by nearly of one hundred automatic weather stations. It is currently used for technicians and farmers to generate irrigation schedules. Data validation is essential in this context and then, diverse quality control procedures have been applied for each station. Daily average of several meteorological variables were analysed (air temperature, relative humidity and rainfall). The main objective of this study was to develop a quality control system for daily meteorological data which could be applied on any platform and using open source code. Each procedure will either accept the datum as being true or reject the datum and label it as an outlier. The number of outliers for each variable is related to a dynamic range used on each test. Finally, geographical distribution of the outliers was analysed. The study underscores the fact that it is necessary to use different ranges for each station, variable and test to keep the rate of error uniform across the region.


Introduction
Meteorological information is one of the most important tools used by agriculture producers in decision making (Weiss and Robb, 1986).Some of the applications for these climate data include: crop water-use estimates, irrigation scheduling, integrated pest management, crop and soil moisture modeling, design and management of irrigation and drainage system and frost and freeze warnings and forecasts (Meyer and Hubbard, 1992).
Andalusia is located in the south of the Iberian Peninsula.This region is situated between the meridians 1 • and 7 • W and the parallels 37 • and 39 • N, with an extension around 9 Mha.The climate is semiarid, typically Mediterranean, with very hot and dry summers.In Andalusia 900 000 ha are irrigated (around 20 % of the cultivated area) under very different conditions (Gavilán et al., 2006).
Correspondence to: J. Estévez (jestevez@uco.es) The Agroclimatic Information Network of Andalusia (RIAA in Spanish) was deployed to provide coverage to most of the irrigated areas of the region and to improve irrigation water management (De Haro et al., 2003).Its exploitation and maintenance are carried out by the IFAPA (Agricultural Research Institute of Regional Government of Andalusia).This network provides at present daily estimations of reference evapotranspiration (ET 0 ) using meteorological information collected by nearly one hundred automatic weather stations (Gavilán et al., 2008).This information is easily accessible due to it is published in the Web: http://www.juntadeandalucia.es/agriculturaypesca/ifapa/ria/.
Meteorological data validation is very important for hydrological designs and agricultural decision makings, concretely to estimate irrigation schedules.The quality control system discussed herein was applied to 85 stations, summarized in Table 1.The rest of the stations have been recently installed and their data series were too short.Quality control system consists of procedures or tests against which data are tested, setting data flags to provide guidance to end users.These flags give information about which tests have been applied satisfactorily or not to meteorological data.
Published by Copernicus Publications. 2 Materials and methods

Source of data
The dataset used in the present study was obtained from the daily database of the RIAA and it was from 2004 to 2009.Each station is controlled by a CR10X datalogger (Campbell Scientific) and is equipped with sensors to measure air temperature and relative humidity (HMP45C probe, Vaisala), solar radiation (pyranometer SP1110 Skye), wind speed and direction (wind monitor RM Young 05103) and rainfall (tipping bucket rain gauge ARG 100).Air temperature and relative humidity are measured at 1.5 m and wind speed at 2 m above soil surface.Data from stations are transferred to the data-collecting seat (Main Center) by using GSM modems.This information is saved in a database.The Main Center is responsible for quality control procedures that comprise the routine maintenance program of the network, including sensor calibration and data validation.Accuracy of ET 0 calculations depends on the quality and the integrity of meteorological data used (Allen, 1996), being necessary data quality control application.Different procedures for quality assurance have been described by Meek and Hatfield (1994), Allen (1996), Shafer et al. (2000) and Feng et al. (2004).These tests are based on some rules proposed by O'Brien and Keefer (1985).However, the tests applied in this study are based on statistical decisions and they were conducted for 84 stations (Fig. 1), using data only from a single site.Three procedures were tuned to the prevailing climate: seasonal thresholds, seasonal rate of change and seasonal persistence (Hubbard et al., 2005).These tests are related to station climatology at the monthly level, using dynamic limits for each variable.The tests were applied to the following variables: maximum, minimum and mean air temperature (Tx, Tn, Tm), maximum, minimum and mean relative humidity (RHx, RHn, RHm), and precipitation (Preci).

Theory
The THRESHOLD test is a quality control approach that checks whether the variable x falls in a specific range for the month in question.The equation is where x is the daily mean (e.g., mean of maximum daily temperature for December) and σ x is the standard deviation of the daily values for the month in question.This relationship indicates that with larger values of f , the number of potential outliers decreases.
The STEP CHANGE test compares the change between successive observations.This test checks if the difference value of the variable falls inside the climatologically expected lower and upper limits on daily rate of change for the month in question.The step change test for variable x is given in Eq. (2): where d i = x i − x i−1 , i is the day and σ d i is the standard deviation of d i .
The PERSISTENCE test checks the variability of the measurements.When the variability is too high or too low, the data should be flagged for further checking.If the sensor fails it will often report a constant value and the standard deviation (σ) will become smaller.When the sensor is out for an entire period, σ will be zero.If the instrument works intermittently and produces reasonable values interspersed with zero values, thereby greatly increasing the variability for the period.This test compares the standard deviation for the time period being tested to the limits expected as follows: where σ j is the standard deviation from daily values for each month ( j) and year and σ σ j is the standard deviation of σ j for the month in question.
When the datum is valid and is rejected by the tests, a Type I error is committed.If the datum is not valid but it is accepted by the quality control procedures, a Type II error is committed.The results discussed in this paper only show the potential outliers of Type I error.
This system was developed in open source code, using GNU GPL (General Public License) support and it can be installed on any platform: Linux, Windows, Unix, Mac OS, Solaris, etc. PostgreSQL, PostGIS and PLpgSQL are the selected free technologies under the quality procedures were developed.
PosgreSQL is an object-relational database management system (ORDBMS) based on POSTGRES version 4.2, developed at the University of California at the Berkeley Computer Science Department (Stonebraker and Kemnitz, 1991).It supports a large part of the SQL standard and offers many modern features: complex queries, foreign keys, triggers, views, functions, procedures languages, etc. PostGIS is an extension to PostgreSQL which allows GIS (Geographic Information Systems) objects to be stored in the database.It includes support for a range important GIS functionality, including full OpenGIS support, advanced topological constructs (coverages, surfaces, networks), desktop user interface tools for viewing and editing GIS data, and web-based access tools.Finally, PLpgSQL is a powerful procedure language used to specify a sequence of steps that are followed to procedure an intended programmatic result.The use of SQL within PLpgSQL increases the power, flexibility, and performance of the quality tests.The most important aspect of using this language is its portability.Its functions are compatible with all the platforms that can operate de PostgreSQL database system.

Results and discussion
The next figures show the number of potential Type I errors that would occur when using the specified tests with various f factors.The fraction data flagged is represented on a log scale and related to the all the network tested (85 stations).The general shape of the relationship between f and the fraction of data flagged is shown in Figs. 2, 3 and 4. The results obtained in this work are similar to the results of Hubbard et al. (2005).The results for the threshold analysis indicate that approximately 2 % of the data would be flagged for maximum, minimum and mean temperature if an f value of 2.3 is used.For precipitation, 2 % of the data were flagged in this test for an f value of 3.1.These results are shown in Fig. 2a.The results on Fig. 2b show the same fraction data flagged for minimum and mean relative humidity when f value of 2.2 is used.In this figure and for maximum relative humidity, this percentage of data would be flagged with an f value of 2.7.Similar figures are shown for the step change test (Fig. 3a and b) and the persistence test (Fig. 4a and b).The results for the persistence analysis indicate that approximately 1 % of the data would be flagged for all the variables if an f value less than 2.0 is used.This is consequence of the need for longer series of data to calculate the variability from daily values for each month and year.For precipitation, the step test was not applied because of the discontinuous nature of rainfall.These results are related to the three tests applied to 85 automatic weather stations of the RIAA.It is impor-tant to remark that the fraction flagged for each f value was different for each station.These results show that it will be possible to select dynamic f values for each station and temporal scale and to fix a specific rate of Type I errors across the region.
The spatial distribution of the fraction data flagged for an f value of 3 in threshold and step tests was estimated using GIS techniques for all the variables.This analysis is very useful to visually study the distribution of outliers across the region.The results for threshold test using ordinary krigging interpolation for maximum temperature are shown in Fig. 5.This map shows that the fraction data flagged is higher in coastal weather stations than in inland locations.This is caused by the different climate regime between them.The maximum temperatures are lower in locations near the coast than in inland locations where the air masses are not influenced by a nearby and large water body (Mediterranean Sea or Atlantic Ocean).
The quality control system can dynamically generate this type of maps using any GIS software at any time.
Sometimes, for scientific or other purposes we cannot reject too much data.It can be very useful to fix a rate of  potential outliers for not considering them in our model or study.For fixing a specific rate of fraction flagged in this example of maximum temperature (Tx), we should use different f values for each station.As it can be seen in Fig. 5, using f = 3, the fraction of Tx data flagged ranged from nearly 0 (station located at northeast of Jaén) to 0.6-0.9approximately (coastal stations) across Andalusia region.These automated validation procedures should be accompanied by other tasks such as: field visits for maintenance routines, sensors calibration and manual inspection (Feng et al., 2004;Shafer et al., 2000).This manual inspection is crucial and necessary for ensuring an appropriate flagging process, providing human judgment to it, catching subtle errors that automated techniques may miss (Shafer et al., 2000).

Summary and conclusions
In this study, the validation tests applied to daily climatic data from 85 automatic weather stations varied modestly with climate type and significantly with the variable tested.It is essential to test the capability of validation procedures because of quality control is a major prerequisite for using meteorological information.Several tests based on statistical decisions have been applied to meteorological data from the Agroclimatic Information network of Andalusia (RIAA).The validated variables were maximum, minimum and mean air temperature (Tx, Tn, Tm), maximum, minimum and mean relative humidity (RHx, RHn, RHm) and precipitation (Preci).Although daily precipitation is known to follow a gamma distribution, it was included in these tests to give a reference point.Results obtained from running the quality control procedures showed a high variability when different f values are used.It is essential to test the capability of these tests to produce flags if data are out of range or are internally or temporally inconsistent.
The use of open source code and General Public License technologies (GNU GPL) to develop the procedures allows any meteorological network to implement a similar system with zero cost.All the functions and algorithms can be read and rewritten or adapted for future users.
The possibility of dynamically mapping the percentage of errors for any variable is a powerful tool to visually study the spatial distribution of the fraction data flagged.These results show that it necessary to select dynamic f values for each station and test to preselect a fixed rate of error detection across the Andalusia region.This quality control system can easily be used with any conventional GIS software.The treatment of the meteorological data like geographical variables using GIS techniques can be very useful for maintenance routines and sensors calibration.
Future works of the authors should include spatial consistency procedures and to introduce seeded random errors to examine the Type II errors detection.

Figure 5 .
Figure 5. Fraction of maximum temperature data flagged at f = 3 for threshold test.

Table 1 .
Summary of automated weather stations used in the study.