Statistical analysis of inter-arrival times of rainfall events for Italian Sub-Alpine and Mediterranean area

In this work a set of time-series of inter-arrival times of rainfall events, at daily scale, was analysed, with the aim to verify the issue of increasing duration of dry periods. The set consists of 12 time-series recorded at rain gauges in 1926–2005, six of them belong to an Italian Sub-Alpine area (Piedmont) and six to a Mediterranean one (Sicily). In order to overcome the problem related to limited sample size for high values of inter-arrival times, the discrete probability polylog-series distribution was used to fit the empirical data from partial (20 yr) time-series. Moreover, a simple qualitative trend analysis was applied to some high quantiles of inter-arrival times as well as to the average extent of rain clusters. The preliminary analysis seems to confirm the issue of increasing duration of dry periods for both environments, which is limited to the “cold” season.


Introduction
Rainfall modelling represents an important issue in several research areas, such as hydrology, meteorology and climatology, because of both theoretical and practical implications.In particular, the assessment of the probabilistic structure of rainy days occurrence allows estimating the probability of both rainy and dry periods.
One interesting way to characterize the temporal structure of rainfall is the use of inter-arrival times series, which is constituted by the succession of the times, T , elapsed between a rainy day and the immediately preceding one (Waymire and Gupta, 1981).Because of some typical properties of rainfall, as rainy days clustering and dry periods persistence, the observed T-samples usually present some statistical peculiarities: very-high variability and skewness, relative high frequency associate to T = 1, "monotonically" decreasing frequencies with sporadic sampling for high values of T .
The main purpose of this work is to find a sign of an existing trend for some high T-quantiles (long dry periods).While a large sample is already available for some yearlyaveraged features (i.e., the mean inter-arrival time), a con-sistent sample for high T-quantiles can be obtained only by considering a rainfall record of several years of observation; a sampling window of 20-yr has been judged large enough to fulfil this purpose.However, a time-series of any high Tquantile obtained following such procedure contains few elements only, because of the limited length of available rainfall records (≈ 100 yr of observations).In order to overcome this problem, a partial time-overlap between two successive 20yr sampling windows has been assumed.The introduction of this overlap allows obtaining more reliable time-series, but it makes tricky the detecting of possible trends using the procedures commonly adopted for yearly-averaged features.Sample T's-distributions have been fitted by (discrete) polylog-series probability distributions; that allowed defining un-biased high T-quantiles, removing the difficulty caused by "sparse sampling".Besides, the use of theoretical distributions instead of empirical ones provides a more robust analysis tool.Furthermore, in the context of eco-hydrology, modelling of rainfall process at daily scale enables to infer statistics for some derived processes, as the mean soil moisture and/or the actual evapotranspiration, allowing us to analyse possible trends.

Published by Copernicus Publications.
A set of 12 long time-series of inter-arrival times was analysed, six of them belong to an Italian Sub-Alpine area (Piedmont) and six belong to a Mediterranean one (Sicily); these two areas are characterized by different rainfall regimes and seasonality.Time-series were obtained by daily rainfall data recorded in the period 1926-2005, subdivided in 13 sub-sets of 20-yr duration progressively shifted of 5-yr.Time-series of high quantiles of associate probability distribution and theoretically-derived average cluster size were analysed for detecting possible temporal trend.

Fitting sample distribution of inter-arrival times through polylog-series probability distribution
Assuming that T-values are independent and identically distributed (i.i.d.), them can be studied as random variables that generate a renewal process (Feller, 1968), which can be seen as a generalisation of the Poisson process (implying an exponential inter-arrival times distribution).Several studies already suggested to interpret the rainfall occurrence process as a renewal process (see for example : Buishand, 1978;Foufoula-Georgiou and Lettenmaier, 1987); however, because seasonality character is frequently present in T-series, the renewal property was generally applied to each homogeneous period, distinctly.In order to describe statistical peculiarities of observed T-samples, some discrete probability distributions have been tested: the log-series distribution and the 2-parameter Extended-Yule (Martínez-Rodríguez et al., 2011) and polylog-series distributions (Gupta et al., 2008).
The 2-parameter polylog-series distribution allowed to capture the main characters of T's-samples (clustering, very high variability and skewness, persistence) with better accuracy than the other distributions; accordingly, following analyses referred to this distribution only, which is defined as: where w and s are the parameters of the polylog-series distribution, Li s (•) is the polylogarithm (or Jonquière's) function: It can be observed that for s = 1, the Jonquière's function (2) reduces to − ln(1 − w), and the polylog-series distribution reduces to the logarithm-series distribution (Feller, 1968), often used in the past in order to regularise dry spell data (Chatfield, 1966;Kottegoda and Rosso, 1997).
The hypothesis that T s are i.i.d. also implies that the expected cluster size, cl , can be derived by the following simple relationship: (3) where P 1 = P {T = 1}.Thus, a good estimate of the observed average cluster size mainly depends on the capability of the distribution to well reproduce P{T = 1}.The estimate of the two parameters was performed by means of the maximum likelihood method, which consists of the solution of a system of two equations in the unknown w and s (for the details see Gupta et al., 2008).Goodness of fit was tested according to Martinez-Rodriguez et al. (2011); these authors highlight that the classical chi-square test is not appropriate to evaluate the goodness-of-fit for distributions characterized by long tail (as polylog-series one), because expected frequencies less than 5 are common.Also, it has to be pointed out that grouping the classes with expected frequencies less than 5 could lead to some problems (Spierdijk and Voorneveld, 2007).So, an alternative procedure to test the goodness of fit was adopted by numerically reconstructing the null hypothesis of chi-square (according to polylogseries) on the base of a Monte Carlo simulation.A set of 2000 replicates of the observation was produced by randomly sampling from the theoretical discrete distribution (using Monte Carlo approach), and the p-value associated to the actual observation was computed from the numerically reconstructed null hypothesis (Martinez-Rodriguez et al., 2011).

Dataset
A set of 12 time-series of inter-arrival rainfall times was obtained by daily rainfall data recorded in the period 1926-2005 by 12 rain gauges.As it is shown in Fig. 1, 6 stations are located in the Sicily Island (hereafter referred as SIC), which is characterized by a typical Mediterranean climate, whereas the other 6 stations are placed in a sub-Alpine region including Piedmont and Aosta Valley Italian regions (from here on referred to as PMT).The 12 series were selected in order to satisfy the following criteria: (i) minimum gaps in the 80year recorded data (no gap filling was performed); (ii) homogeneity of the time series.Relatively to the latter point, both control of data quality and specific homogeneity tests have been carried out; in particular, the procedure suggested by Wijngaard et al. (2003) has been followed.Each time-series has been subdivided in 13 sub-sets of 20-yr duration; the first subset refers to the period 1926-1945, the other subsets were obtained by progressively shifting of 5-yr the sampling, then obtaining an overlap of 15 yr between two successive series (1931-1950, 1936-1955, . . . , 1986-2005).Moreover, the analysis was performed for three different datasets separately: (1) whole year (Y); (2) six-months relatively "warm" season (WS -from April to September) and (3) six-months relatively "cold" season (CS -from October to March).As it  is known, Sicily's rainfall regime is characterised by a strong seasonality, with a "warm" (dry) and a "cold" (wet) season.During the WS, rainfall is usually lower than atmospheric demand, whereas the converse is true during the CS.Rainfall regime in Piedmont and Aosta Valley is almost virtually opposite to the one in Sicily; indeed, WS is generally wetter than CS.In this phase, no additional disaggregation has been considered, mainly to not further reduce the sample size for high T-values (or for not increase the length of sampling window); therefore, similar analyses for three-month seasons have not been performed.

Analysis of inter-arrival times series distribution
The polylog-series distribution was fitted to the 468 available datasets (12 stations × 3 periods × 13 sub-series).Test of goodness of fit was performed separately for each data set, rejecting the null hypothesis when the p-value associated to the fitted distribution was < 0.05.
Synthetically, the chi-square test results indicate that the null hypothesis is frequently rejected for the yearly datasets for both SIC and PMT stations (74.4 % for SIC and 78.2 % for PMT), while only in few cases it is rejected for WS and CS data sets (14.1 % and 26.9 % for WS-SIC and WS-PMT, respectively; 18.2 % and 27.1 % for CS-SIC and CS-PMT, respectively).These results generally show how polylog distribution better agrees with SIC seasonal datasets than with PMT ones.
Figure 2 shows two examples of data fitting with polylogseries distribution; for both cases the null hypothesis has not be rejected and it can be seen how the visual agreement results satisfactory.A preliminary analysis of range of variability of distribution parameters w and s has been carried out.In particular, Fig. 3 reports the frequency distributions of w parameter, which is the one that mainly influences tail probabilities.Interestingly, w distributions of Sicily CS and Piedmont WS (Fig. 3a), as well as of Sicily WS and Piedmont CS (Fig. 3b), are rather similar in both range of variability and frequency distribution; that is likely due to the typical features of the two studies areas, which have somewhat opposite rainfall regimes.

Average cluster size
As previously introduced, the use of inter-arrival times allows characterizing the temporal structure of both rainy days and dry periods.A further test of the polylog fitting was performed in terms of average cluster size, computed with Eq. (3).In Fig. 4 observed vs. estimated cl are represented for the stations located in the two areas (SIC, Fig. 4a and PMT, Fig. 4b) and for the two periods (WS and CS).These plots highlight in general a good performance of the model, with a slightly systematic underestimation (< 4 % on average), that becomes significant only for some recorded high values (occurred during CS for both SIC and PMT).Moreover, a significant seasonality for cl , as well as a quite wide range of variability (from 1.2 to 2.7 days), was observed for SIC region.On the contrary, in PMT sites the observed range of variability of cl is quite similar for the two periods, with a wide overlap range from 1.5 to 2.0 days, and with values exceeding 2 only for CS period.This further confirm how the seasonality is more evident in SIC area than in PMT one in term of clustering of rainy days.

Modelling of high quantiles
Table 1 reports root mean square difference (RMSD) statistics computed for some high T-quantiles (0.80, 0.85, 0.90 and 0.95) by comparing data observed and polylog-series estimates; for both investigated area, available sample size is equal to 78 (13 periods × 6 stations).As can be noticed, RMSD values are generally very small, less than three days, with just few exceptions (SIC WS for P ≤ 0.95; PMT CS for P ≤ 0.90).The greatest discrepancies observed for SIC-WS could be likely explained by the reduction of the sample size with the increase of the quantiles considered.Similar considerations arise from the analysis of Fig. 5, where observed and estimated quantiles are compared for two stations (Corleone -WS and Torino -CS).Table 1.Mean (M) and root mean square difference (RSMD) for some quantiles, for both seasons and both investigated areas.M (RMSD) in days T (P < 0.80) T (P < 0.85) T (P < 0.90) T (P < 0.95)

Preliminary analysis of the trends
The results reported in the previous sub-sections synthetically showed: (i) the polylog distribution usually provides a good fitting of sample T distribution for seasonal (WS and CS) data; (ii) the polylog distribution is able to well estimate the average cluster sizes as well; (iii) RMSD for some high T-quantiles (0.80, 0.85, 0.90 and 0.95) are generally low (with the above-mentioned exceptions).Accordingly, both estimated high T-quantiles and average cluster sizes were considered a more robust mean to analyse the trend of such quantities than empirical ones.Moreover, the analysis can be also extended to very high quantiles, for which sample size is small.A simple qualitative trend analysis of the quantiles was performed in order to investigate the issue of increasing duration of dry periods, of which previous studies have been dealt with yet (Alpert et al., 2002;Dai et al., 1998); due to the overlapping window used in the reconstruction of samples, the analysis of trend cannot be performed with the usual methods, which are based on the hypothesis of independence among the single values of the series.For this reason, the analysis of trend is only qualitative at this point.
As synthetically shown in Table 2, a slightly increasing trend of very high T-quantile (P < 0.975) has been observed for almost all the PMT and SIC stations during CS periods.In general, no trend was observed for the WS period.It is noteworthy that no significant differences between the two  study areas were appreciated.As an example, in Fig. 6, time series of two estimated T-quantiles (P ≤ 0.9, P < 0.975), for two stations, Enna (SIC) and Cuneo (PMT), are reported.
A similar qualitative trend analysis has also been performed for the average cluster size (Table 2).Generally, no trend was observed for Piedmont, while contrasting results were observed for Sicily.As an example, Fig. 7 shows the time-series of the expected cluster size for two sites, Enna (SIC) and Torino (PMT); a noticeable decreasing trend was observed for Enna, especially for CS.

Conclusions
The capability to reproduce the inter-arrival times statistical structure through a 2-parameter discrete distribution (polylog-series) was tested in two Italian regions: a Mediterranean area and a Sub-Alpine one.
A chi-square test based on the Montecarlo procedure suggests that polylog-series distribution seems to well reproduce the observed data in the whole range of variability of times and associated frequencies when the dataset is split in two sub-period: a relatively "warm" season (April-September) and a relatively "cold" one (October-March).Poor performances were instead obtained when the whole year is analysed, mainly due to the strong seasonality characterising the two study areas.
The theoretical distribution allows also reproducing well some characteristics of rainfall as: average cluster sizes and high T-quantiles.In particular, the average cluster sizes results only slightly underestimated ( 5 %), whereas high quantiles (P ≤ 0.8 to 0.95) are generally well reproduced with the only exception for P ≤ 0.95 during WS (in SIC) and CS (in PMT).This latter result can be justified by the small sample size of empirical data for high inter-arrival times.
A preliminary qualitative analysis of the trends on high quantiles and average cluster size seems to confirm the issue of increasing duration of dry periods for both environments, which is limited to the "cold" season.
No trend was instead observed for the average cluster size in Sub-Alpine areas, while discordant trends were observed in Mediterranean stations.
The polylog-series seems to account for peculiarities of inter-arrival times of the two areas by simple dividing the year in two six-month periods (a relatively "warm" season from April to September and a relatively "cold" season from October to March).This result is promising for the use of polylog-series in ecohydrological models in those areas where classical exponential relationship fails to capture the characteristics of rainfall.However, in order to better catch the differences between the characters of seasonality in different climatic regions, the same procedure followed in this work could be applied to shorter periods (three months).Additionally, the requirement of several years to sufficiently sample high inter-arrival times (and the necessary overlapping between consecutive sampling windows) requires adhoc methodologies for the quantitative analysis of the trends that will be investigated in the next future.Finally, the possibility to address the observed limits of the adopted distribution by using a three-parameter Lerch distribution (which generalises the polylog-series distribution) have to be tested in future researches.

Figure 1 .
Figure 1.Spatial location of the rain gauges that provided rainfall time series used for the analysis.Left panel reports the 6 stations located in the Mediterranean area (SIC), whereas right panel reports the 6 stations located in the sub-Alpine (PMT) area.

Figure 2 .
Figure 2. Fitting frequency distribution of inter-event times, T , with the polylog-series distribution, for the warm season and for Enna (SIC -left panel) and Cuneo stations (PMT -right panel).Sample data refer to the period 1926-1945.

Figure 3 .
Figure 3. Frequency distribution of w parameter for the two studied areas (SIC and PMT), for warm season (WS) and cold season (CS).

Figure 4 .
Figure 4. Comparison between observed and estimated average cluster size cl for the Mediterranean (SIC, panel a) and sub-Alpine (PMT, panel b) areas in both cold (CS) and warm (WS) season.

Figure 5 .
Figure 5.Comparison between some observed and estimated high T-quantiles for two sites: (a) Corleone station in SIC during WS; (b) Torino station in PMT during CS.

Figure 6 .
Figure 6.Time-series of two high quantiles T {P < 0.9} and T {P < 0.975}, estimated by polylog-series distribution, for both season and for two sites: (a) Enna station (SIC) and (b) Cuneo station (PMT).

Figure 7 .
Figure 7. Time-series of expected cluster size estimated by polylog-series distribution, for both season and for two sites: (a) for Enna station (SIC); (b) for Torino station (PMT).

Table 2 .
Summary of the qualitative trend analysis carried out for a T-quantile (P < 0.975) and for the average cluster size.