Probabilistic end-to-end irradiance forecasting through pre-trained deep learning models using all-sky-images

. This work proposes a novel approach for probabilistic end-to-end all-sky imager-based nowcasting with horizons of up to 30 min using an ImageNet pre-trained deep neural network. The method involves a two-stage approach. First, a backbone model is trained to estimate the irradiance from all-sky imager (ASI) images. The model is then extended and retrained on image and parameter sequences for forecasting. An open access data set is used for training and evaluation. We investigated the impact of simultaneously considering global horizontal (GHI), direct normal (DNI), and diffuse horizontal irradiance (DHI) on training time and forecast performance as well as the effect of adding parameters describing the irradiance variability proposed in the literature. The backbone model estimates current GHI with an RMSE and MAE of 58.06 and 29.33 W m − 2 , respectively. When extended for forecasting, the model achieves an overall positive skill score reaching 18.6 % compared to a smart persistence forecast. Minor modiﬁcations to the deterministic backbone and forecasting models enables the architecture to output an asymmetrical probability distribution and reduces training time while leading to similar errors for the backbone models. Investigating the impact of variability parameters shows that they reduce training time but have no signiﬁcant impact on the GHI forecasting performance for both deterministic and probabilistic forecasting while simultaneously forecasting GHI, DNI, and DHI reduces the forecast performance.


Introduction
Model predictive control (MPC) of renewable energy systems relies on accurate predictions to correctly optimize the trajectory of the systems control actions (Riou et al., 2021;Maheri, 2014;Sachs and Sawodny, 2016;Dongol, 2019;Zhu et al., 2015;Tazvinga et al., 2013;Taha and Mohamed, 2016;Zhang et al., 2018;Mbungu et al., 2017).While some system components can be modeled using physics-based approaches, forecasting the electrical load and meteorological parameters in real-time remains a challenge.Several approaches can be used to generate these forecasts.Among them are statistical, empirical, numerical, and artificial neural network (ANN) methods (Sachs, 2016;Dongol, 2019;Bozkurt et al., 2017;Bouktif et al., 2018;Yang et al., 2016;Maitanova et al., 2020;Telle et al., 2020;Bruno et al., 2019;Chaaraoui et al., 2021).ANNs have gained popularity in a variety of applications such as image classification, action recognition, and time-series forecasting.They have shown improved performance and robustness, compared to statistical and physics-based approaches (Krizhevsky et al., 2012;He et al., 2015;Tan and Le, 2020;Devlin et al., 2019;Shoeybi et al., 2020;Brown et al., 2020;Radford et al., 2019;Qian et al., 2021).This trend led to the development of ANNs that use images from all-sky imagers (ASI) to predict the future solar irradiance.Current solutions consist of stand-alone ANN applications or a combination of ANN and physicsbased approaches (Pothineni et al., 2019;Paletta et al., 2021;Yang et al., 2021;Fabel et al., 2022;Nouri et al., 2021;Hasenbalg et al., 2020).In addition, training ANNs with Published by Copernicus Publications.parametric or non-parametric uncertainties can provide information through symmetric probability distributions.Xiang et al. (2021) use a quantile regression method to determine the probability distribution, by defining a set of 19 evenly distributed quantiles as output of their ANN.Paletta et al. (2022), treats their ANN forecasting model as a classifier to generate binned probability classes, by defining 100 output classes which are equally distributed throughout their irradiance normalization range.Feng et al. (2022) use Bayesian model averaging (BMA), which requires a set of ANN models.They differ in the input images they receive, which are augmented through occlusion perturbations, resulting to a global horizontal irradiance (GHI) forecast ensemble.This ensemble is then used to estimate parameters of a probability density function (PDF), defining a normal distribution.Nie et al. (2023) propose a new hybrid approach, by combining a stochastic generative pre-trained transformer ANN and a physics-informed ANN for video prediction.An ensemble of possible image sequence predictions for each time step is generated, based on past images from an ASI.The predicted sequence ensembles are then used to estimate a GHI probability distribution through another ANN for each future time-step.Nouri et al. (2023) propose a more conservative approach, using a U-Net ANN for cloud segmentation and a stereoscopic cloud height estimation with two ASI.With a sequence of ASI images, the authors can then determine the cloud movements and extrapolate the expected direct normal irradiance (DNI) through ray-tracing and a sun tracker measurement device, which decomposes the observed solar irradiance in its diffuse and direct components.The authors combine their method with a persistence forecasting method to enhance the forecast.To obtain probabilistic forecasts, the authors classify the past 15 min of irradiance data into eight variability classes.The variability classes are formalized through a set of variability parameters defining value boundaries for each variability class.Depending on the forecast horizon, the authors use past observations and/or deterministic forecasts for the variability classification in a sliding window fashion.To determine which variability class represents which probability distribution, the authors use approximately 2 years of observation data and simulate forecasts on this data.By logging the forecast error and the corresponding variability class, the authors then determine a discretized probability distribution for each variability class, based on the magnitude of the forecast error.These discretized probability distributions are then stored in a look-up table and fetched as reference when performing forecasts on the validation data set, by determining the variability class with the above mentioned method.
Probability information can be harnessed in probabilistic Model Predictive Control (MPC) applications for renewable energy systems, enabling the incorporation of uncertainty into the state estimation, as well as the constraining and finetuning of control parameters (Sachs, 2016).Deploying current methods to perform such probabilistic forecasts through ASI increases both investment and maintenance costs, attributed to the need for irradiance measurement instruments like sun trackers or pyranometers (Nouri et al., 2021(Nouri et al., , 2023;;Paletta et al., 2021).Furthermore, the prediction of a large number of parameters to define the probability distribution, or the training and inference of multiple models to estimate a smaller number of parameters, both either impact the models performance by distributing ANN model parameters on a large set of output neurons or proportionally increases training and inference time depending on the size of the ensemble required.The exploitation of transformer models to predict image sequences can be a viable approach, but appears to use more computational resources than required by using the predicted images as a bridge to lower resolution outputs.Also, using asymmetric probability distributions appears to be more suitable for our use case, rather than symmetric ones (Barnes et al., 2021;Feng et al., 2022).
In this research, we propose a method to exploit an Im-ageNet pre-trained ResNet50v2 backbone as a feature extractor for probabilistic ASI-based irradiance forecasting of up to 30 min.The feature extractor allows us to output irradiance values with minor modifications.The method consists of two stages.First, the backbone is trained to predict four parameters from an ASI image, defining an asymmetric irradiance probability distribution.Afterwards, we extend and retrain the model to perform the forecasting task using long short-term memory (LSTM) and densely connected layers.Therefore, our method uses a two-stage training process, but performs inference with a one-stage model.Our method also facilitates the easy substitution of the feature extractor backbone with more powerful or more efficient models, such as MobileNet, DarkNet, EfficientNet or ConvNeXt (Howard et al., 2017;Redmon and Farhadi, 2018;Tan and Le, 2020;Liu et al., 2022).We add relevant exogenous variables from literature (Paletta et al., 2021;Yang et al., 2021) and investigate the impact of variability parameters, proposed by Schroedter-Homscheidt et al. (2018), instead of constraining the problem within variability classes, as done by Nouri et al. (2023), allowing the ANN to determine suitable parameters for performing the forecast.The impact of adding DNI and diffuse horizontal irradiance (DHI) on the forecasting and estimation performance is investigated.
Unlike previous work, we avoid predicting complex highdimensional outputs to determine distributions, or computeintensive ensembling techniques and instead restrict our output to four parameters which are sufficient to fully define the asymmetric probability distribution as a continuous function and train only one model in two stages, instead of an ensemble of models.Our approach only needs a single ASI and requires a radiometer only for training.It can be deployed without the usage of a radiometer, greatly reducing maintenance and investment costs for the implementation on the field.We report training and inference times, to determine the models potential for edge computing applications for MPC, and estimate the effort to train our model on a much larger and geographically diversified data set.The following research questions are addressed in this study:  (2019).The data set contains a quality-controlled record of the GHI, DNI, DHI measurements and corresponding ASI images.The data ranges from 2014 to 2016 with a temporal resolution of 1 min and has been collected in Folsom, California.For our application, the image resolution is scaled down to 224 × 224.To avoid trivial cases with no cloud coverage, the majority of clear-sky data points were filtered out with the clear-sky identification method proposed by Reno and Hansen (2016).Additionally, data points were flagged as clear-sky 30 min prior and after detection to ensure that the model did not predict clear-sky data points, focusing in this study on the prediction of the more challenging non-clearsky events.For the forecasting task, we also ensured that ASI images for the past hour were available to calculate the variability parameters, proposed by Schroedter-Homscheidt et al. (2018).From the set of ASI images, we then generated irradiance values from the backbone models.This research follows the assumption that there is no radiometer available during the operation of the system.We therefore exploit the ASI as a radiometer.The data set was randomly separated into 868 training days, 108 validation days, and 108 testing days.Timestamps of the ASI images were rounded to the full minute since the images were not taken exactly to the same minute as the irradiance measurements.

Methodology
In this section, we present the two-stage training approach, first introducing the backbone model in Sect.The forecasting model has an additional variation, adding the aforementioned variability parameters.This variation serves to determine their impact on performance.The variability parameters are calculated through the backbone model's past and current irradiance estimations from the ASI images.Forecasting horizons of 5, 10, 20, and 30 min were evaluated for each forecasting models variation, considering horizons investigated in literature and seamless prediction applications with numerical weather prediction methods (Pothineni et al., 2019;Nouri et al., 2021;Xiang et al., 2021;Paletta et al., 2021;Feng et al., 2022;Diagne et al., 2012;Kober et al., 2012;Owens and Hewson, 2018;Urbich et al., 2020).

Backbone model
The ResNet50v2 ANN architecture proposed by He et al. (2016) is used as a classifier.It distinguishes a set of classes describing the content of an image.The model is trained on the ImageNet data set to validate its performance compared to other architectures (Deng et al., 2009).Minor modifications to the architecture allow the model to be exploited as a feature extractor.This approach is practiced in other disciplines (Rezende et al., 2017;Ou et al., 2019;Ronneberger et al., 2015).Furthermore, a pre-trained model already learned to extract features within images to perform its task.Features can include edges, corners, or color gradients.The model can therefore be deployed to perform other tasks requiring a similar set of features (Ferreira et al., 2018;Du et al., 2018).
In the first step, we modified a pre-trained ResNet50v2 model by removing the output layer after the 5th convolutional block.This step reveals the feature vector, where each element describes a lower level representation of a certain image section.These representations help the ANN fulfill its task.They are not pre-defined and the ANN determines which feature it requires to generate the desired output from the image.A pre-trained ResNet50v2 model had already learned to extract a set of these low level representations from a classification task, a process that was helpful for the investigated application.A two-dimensional global average pooling layer serves to reshape the feature vector to appropriate dimensions (Chaaraoui et al., 2022).
To transfer the classification task to a probabilistic regression, we use an approach proposed by Barnes et al. (2021).The output layer is replaced by four densely connected neurons representing the four parameters location (µ), tailweight (τ ), skewness (γ ), and scale (σ ).These parameters define a sinh-arcsinh distribution, proposed by Jones and Pewsey (2009), which allows for asymmetric probability distributions to be flexibly defined.µ defines the location of the irradiance distribution and gives an estimate of the solar position and its coverage by a cloud.σ scales the distribution and plays a major role in representing the models confidence.τ and γ allow the model to create asymmetries, by altering the distribution's skewness and kurtosis.We expect this asymmetry to be especially helpful with this particular forecasting task, simply by taking into account that certain cloud cover situations have a higher probability to decrease (current irradiance is close to clear-sky) or increase (current irradiance is the result of a cloud blocking the sun) the solar irradiance.
To maintain unsigned values for τ and σ , their logarithm is calculated with log(τ ) and log(σ ) as a bijective map.This bijective map constrains the ANN to output only positive values for these two parameters, because a negative tailweight or negative spread of the distribution are not possible.The exponential function of the corresponding outputs returns τ and σ .The backbone architecture is illustrated in Fig. 1.
The task of the probabilistic backbone model is to estimate the probability distribution of the GHI, DNI, and DHI from an ASI image.For that, we first train the model to understand the relation between the image and the irradiance values, before moving over to a forecasting task.
To examine the performance differences between this probabilistic approach and a deterministic approach, we exchange the layers after the global average pooling layer for a set of densely connected neurons.These neurons output a deterministic irradiance value.This deterministic architecture is shown in Fig. 2.
To output only the GHI, the top and bottom output branches after the global average pooling layer for DNI and DHI are removed.

Backbone model training configuration
A mirroring strategy is applied for our multi GPU system, replicating the models on each GPU and dividing the batch among them.The weight update is performed after each training step, by aggregating the gradients of all four replicas.Data shuffling is applied for each epoch as an augmentation technique.As a loss function, we implement the negative logarithmic likelihood for the probabilistic models, as suggested by Barnes et al. (2021) and defined as: with x i being the ith input from the batch input of the ANN, y i being the corresponding real irradiance value and P the probability density function of the sinh-arcsinh function, which is defined by the parameters µ, σ , γ , and τ .The deterministic models compute the loss with the mean squared error (MSE), defined as: with ŷi being the output of the ANN.Note that the losses of the models predicting GHI, DNI and DHI simultaneously are equally weighted among the irradiance components.
The weights are optimized with the adaptive moment estimation optimizer (ADAM).While the deterministic models use a learning rate reducer and early stopper, the probabilistic configurations do not use the early stopper, due to the low number of epochs necessary to train the models.Figure 3 shows the training pipeline of the backbone ANN model.The image is preprocessed by scaling the pixels to values between −1 and 1 (Abadi et al., 2015).The irradiance values are normalized, with: Parameters for all four configurations can be found in Table 1.

Variability Parameters
Schroedter-Homscheidt et al. ( 2018) utilize variability parameters from relevant literature to classify 1 h of minutely radiation time-series in eight variability classes.These variability classes serve as indicators for different cloud situations, which differ in their impact on solar power production.In our study, we use these variability parameters as additional features for our forecasting model to determine, if they have an impact on forecasting performance.All variability parameters are derived from the GHI and DNI, unless stated otherwise, resulting into 15 variability parameters from the GHI and 13 variability parameters from the DNI: 2. V Coimbra , proposed by Coimbra et al. (2013), resembles the root mean squared difference between the clear-sky index and the clear-sky index one time step ahead, with: 2 . (5) 3. VI Stein , proposed by Stein et al. ( 2012), divides the sum of the root squared difference between the current and previous irradiance with the root squared difference between the current and previous clear-sky irradiance: 4. Using the variability indices proposed by Perez et al. (2011), which takes the absolute differences between two consecutive clear-sky indices and irradiance values throughout the whole hour and calculates the mean, standard deviation and maximum value within the sequence of differences, defined as: 5. OVER5 and OVER10, representing the occurrences of GHI overshoots within the time-series sequence, only counting either 5 % or 10 % over clear-sky as overshoot occurrence: This variability parameter is only applied for GHI values, since overshoots are theoretically not possible with DNI.
6. Counting the changes in the sign of the first derivative (CSFD) in the time-series sequence, proposed by Kraas et al. (2013), where only those changes are counted which show a 15 % change between one local extreme to the other.
7. Using a time-series envelope function, proposed by Jung (2015).Initially the irradiance time-series is divided into overlapping sub-sets of the time-series, each spanning a duration of 4 min.We then identify the minimum and maximum of each sub-set.These points initially form the preliminary upper and lower envelopes, respectively.These sub-sets are then extended by 1 min from the original time-series sequence until a new minimum or maximum is identified, substituting the previously defined minimum or maximum of each sub-set.By methodically applying this process to each 4 min subset across the entire time series, we establish two new time-series -the upper and lower envelopes.These envelope series effectively encapsulate the original irradiance time-series.Using these two envelope timeseries, Schroedter-Homscheidt et al. ( 2018) calculate three variability parameters: -The mean of each delta of the upper and lower envelope time-series UML (Upper Minus Lower).
-The mean of each delta between the upper envelope time-series and corresponding clear-sky radiation UMC (Upper Minus Clear).
-The mean of the lower envelope time-series LMA (Lower Minus Abscissa).
With the variables in the above mentioned equations as: the number of elements in the time-series sequence N .
g is the radiation component used to calculate the variability parameters.Which component is used, depends on the architecture used in Sect.3.3.
g cs is the clear-sky value, calculated with the clear-sky model by Perez et al. (2002).
k c is the clear-sky index of the corresponding radiation component "rad" used in the equation.
The variability parameters are clipped and normalized according to the values shown in Table 2.The values are normalized, with:

Forecast model
We extend the backbone model to perform a forecasting task, by outputting the feature vector for the current time step t and the past time step t − n from the backbone, with n being the forecasting horizon.These feature vectors are then concatenated to predict the future feature vector at time step t + n, using a set of LSTM and densely connected neuron layers.
To translate the predicted feature vector to a deterministic or probabilistic irradiance, we use the approach from the backbone model, shown in Fig. 1 (probabilistic) and Fig. 2 (deterministic).
Figure 4 illustrates the probabilistic approach we address in this section.This configuration outputs the GHI, DNI, and DHI simultaneously.The architectures for the probabilistic GHI-only model and the corresponding deterministic models can be found in Appendix A. These architectures follow an analogous concept to the approach explained here.The backbone model is trained beforehand, as described in Sect.3.1.The model is divided into four components (a), (b), (c), and ().
Figure 4a shows how the variability parameters proposed by Schroedter-Homscheidt et al. (2018) are generated.These parameters are added later to our feature vectors.The variability parameters are derived from a sequence of GHI and DNI values from ASI images of the past time steps t to t − 60 min and t − n to t − n − 60 min, with n being the forecasting horizon.The deterministic GHI and DNI are estimated by the previously described probabilistic backbone model shown in Fig. 1.To obtain explicit values from the probabilistic distributions, the expectation µ mean of the distribution is analytically calculated, with: K v (x) defines the modified Bessel function of the second order, with v being the order of the Bessel function (Amos, 1986;Barnes et al., 2021;Jones and Pewsey, 2009).From the above mentioned sequence of deterministic GHI values, 15 variability parameters can be derived.The sequence of DNI values does not include the parameters counting the occurrence of 5 % overshoots and 10 % overshoots, resulting in 28 variability parameters.This component is not retrained in this stage, but included in the inference.
Figure 4b shows how additional parameters from the current time step t and the past time step t − n are generated.Again, the previously trained backbone model is used to output the four parameters µ, τ , γ , and σ , for GHI, DNI, and DHI.Additionally the mean µ mean , the median µ median , and the standard deviation σ SD are analytically calculated for GHI, DNI, and DHI and added as additional features, with: This process results in 21 parameters, and thus a total of 49 parameters added to our feature vectors.Additional clearsky information is derived from a clear-sky model, proposed by Perez et al. (2002).The model determines the GHI, DNI, and DHI in clear-sky conditions.Additionally, the DHI is divided into the ground and sky-diffuse components.Zenith θ z , azimuth σ s , and their cosine and sine are also computed as additional features, as proposed by Paletta et al. (2021).However, we do not add past irradiance measurements since we assume there will be no additional measurement equipment available while deploying this model in the field.This step results in a maximum total of 60 parameters.Note, the number of parameters varies depending on the architecture used.This component is not retrained in this stage, but needed when deploying the model for forecasting.
Figure 4c shows the only trainable component of this architecture, using a copy of our previously trained backbone https://doi.org/10.5194/asr-20-129-2024 Adv. Sci.Res., 20, 129-158, 2024  b) are concatenated at the corresponding time stamps t and t − n.This feature sequence is then passed through a set of LSTM and densely connected neurons to predict the feature vector at t + n.The parameters for the sinh-arcsinh function are translated from the predicted feature vector in the same fashion as the backbone model.Deterministic irradiance values are calculated using Eq. ( 16).
Figure 4d shows a non-trainable component of the architecture not used during training or inference of the model.It is required to determine the reference feature vector for our loss function in Sect.3.3.1.The backbone model generates the feature vector for each image used in training and adds the output of the clear-sky model by Perez et al. (2002) for time step t + n.

Training configuration
To address the research questions stated in Sect. 1, we divide the forecast model into configurations, differing in: 1. Prediction horizon n (5, 10, 20, 30 min).
2. Whether they output all three irradiance components (GHI, DNI, and DHI) or only one (GHI).
3. Whether the variability parameters are considered as additional inputs.
4. Stochastic or deterministic backbone model and therefore their outputs being either of those.
The batch size for all configurations is 16 with an initial learning rate of 5 −4 .For the deterministic models, we utilize a learning rate reducer with a reducing factor of 0.3, a patience of 8 epochs, and a of 10 −4 .An early stopper stops the training with a of 10 −5 and a patience of 15 epochs.The maximal threshold of epochs before canceling the training is set to 100.For the stochastic model, we do not utilize a learning rate reducer and early stopper, since we observed a reduction of epoch counts to 1 epoch until convergence of the validation loss.This observation may be the result of training the backbone prior to extending to a forecasting task.The weights are optimized with the ADAM optimizer, as was the backbone model.
Aside from the loss functions presented in Eqs. ( 1) and (2), we also add the loss between the predicted feature vector ŷi and the feature vector y i determined by the backbone model and add the weights β 1 and β 2 , with: and β 1 = 0.3 and β 2 = 0.5 for the deterministic models.The probabilistic model's summarized loss is calculated, with β 1 = 0.1 and β 2 = 0.9 and: Note, that the loss of the models predicting GHI, DNI and DHI simultaneously are equally weighted among the irradiance components, as with the backbone models.
Figure 5 shows the training pipeline of the forecasting ANN model.The images are again preprocessed through the Tensorflow ResNet50v2 preprocessing function, as with the backbone model.The external variables are extracted from the backbone models output, or calculated via the clear-sky model by Perez et al. (2002):

Evaluation metrics
In this study, we use the following evaluation metrics for the deterministic backbone and forecasting models: 1. Root Mean Squared Error (RMSE) and normalized RMSE (nRMSE), penalizing errors exponentially with their magnitude and defined as: 2. Mean Absolute Error (MAE) and normalized MAE (nMAE), penalizing errors linearly with their magnitude defined as: 3. Mean Bias Error (MBE) and normalized MBE (nMBE), in order to rule out a systematic bias in the models output, defined as:  with y max being the maximum irradiance value over the entire data set, ŷmean the mean value over all predictions and y mean the mean over all irradiance values.For the stochastic backbone models, we additionally define the following metrics: 1. Empirical coverage of the distributions q α , by comparing selected quantiles α = 0.95 to 0.05, 0.9 to 0.1, 0.8 to 0.2, 0.7 to 0.3 and 0.6 to 0.4, with theoretical percentage of data points α within the quantiles, suggested by Gneiting et al. (2007) and implemented by Xiang et al. (2021) as: 2. The sharpness S α from each quantile range α, illustrated by a sharpness diagram, proposed by Bremnes (2004).
with ŷi,α,up being the upper forecasted quantile and ŷi,α,low the lower quantile of selected quantile range α. q represents the absolute mean of all q α .For the forecast models, we include a skill score metric used by Paletta et al. (2021) and Yang et al. (2021), which compares the forecasts to a smart persistence forecast, defined as: with ŷi,spf being the irradiance forecast and g cs,t+n the clearsky irradiance forecast computed by the clear-sky model proposed by Ineichen and Perez (2002), with the forecast horizon n.The current irradiance measurements are denoted as g meas,t and the current clear-sky irradiance as g cs,t .
We then determine its nRMSE and compare it to the nRMSE by the ANN to calculate the forecast skill FS, with As an example for better illustration, Fig. 6 shows the estimations for 1 d by the probabilistic three-parameter backbone model and the prediction error as a scatter plot with the corresponding quantile range from 0.95 and 0.05 as a color gradient.It is important to note, while a full day would have 1440 data points, given the temporal resolution of the data set of 1 min, the data shown in this study is limited to timings between sunrise and sunset, hence the lower sample size N, as seen in Fig. 6 with N = 646.The full data set is evaluated in Sect.4.1 for the backbone models and Sect.4.2 for the forecasting models.
A sharpness diagram for the probabilistic three-parameter model is shown in Fig. 7.The diagram provides an estimate of how narrow the quantile ranges are over the population of the forecast distributions.The narrower the quantile ranges and whiskers, the more confident the model is.Note, that this diagram requires that q is within an acceptable range.High sharpness with high q means, that the model is overconfident, while very low q and low sharpness means that the model lacks confidence.
To determine the significance in the model's performance differences, a significance test proposed by Diebold and Mariano ( 2002) is performed.To determine significance between the distributions means within the sharpness diagram, we use independent trimmed t-test, proposed by Yuen (1974).This test is used to compare distributions with significantly difhttps://doi.org/10.5194/asr-20-129-2024 Adv. Sci.Res., 20, 129-158, 2024  fering variances.To test these variance differences, we perform Levene's test, proposed by Levene (1960).When variance does not differ significantly, Welch's t-test is performed (Welch, 1947).The p-value threshold for rejecting the null hypothesis is < 0.05.

Results
We used Tensorflow to design and train our ANN's (Abadi et al., 2015).To pre-process our image data, we used the imageIO and openCV libraries (Silvester et al., 2020;Bradski, 2000).Training was performed on the university's HPC cluster, with nodes consisting of 4 Nvidia HGX-A100 SXM4 GPUs with 80 GB memory connected by 600 GB s −1 NVLink, 2 AMD EPYC 7543 CPU, and 512 GB of system memory.Each node performed the training of two models simultaneously.Inference was performed by using 120 GB system memory and 32 CPU cores, allowing 4 models to be evaluated simultaneously.We summarize our results by using the metrics and statistical tests shown in Sect.3.4.

Backbone model
The backbone models show overall good performance in estimating the deterministic and probabilistic irradiance components.Figure 8 shows a scatter plot for the error between the observed and the estimated GHI for the deterministic GHIonly model.The model's error is in a reasonable range con-sidering the error of 2 % by the measurement device (Kern et al., 2023), with an RMSE of 58.06 W m −2 and MAE of 29.93 W m −2 .Additionally, the nRMSE of 0.04 and nMAE of 0.02 confirms, that the model produces errors close to the measurement error of the radiometer itself.The MBE of 2.11 W m −2 can be neglected with regards to the measurements error.The observed and estimated GHI correlate closely with R = 0.98.
For the simultaneous estimation of the GHI, DNI, and DHI, similar performance can be observed for the GHI estimation, shown in Fig. 9.The RMSE of 59.31 W m −2 and MAE of 30.28 W m −2 shows slightly higher errors, compared to the GHI-only model.MBE can be neglected with 1.43 W m −2 , and observation strongly correlates with the estimation (R = 0.98).Furthermore, the model can additionally estimate the DNI and DHI with RMSEs of 87.30 and 21.98 W m −2 and MAEs of 42.67 and 12.99 W m −2 , respectively.MBE remains low with 0.73 W m −2 for DNI and 0.89 W m −2 for DHI.DNI and DHI estimations correlate closely, with the observation: R = 0.97 and 0.98, respectively.However, a one-sided Diebold-Mariano significance test on the GHI estimations of both models showed that the GHI-only model performs significantly better than the threeparameter model with p-value < 0.05 (Diebold and Mariano, 2002).
The probabilistic estimation of the GHI results in similar error metrics to those of the deterministic approach, with RMSE 58.36 W m −2 , MAE 29.33 W m −2 , and MBE of −1.39 W m −2 .Figure 10 compares the estimation with the observation as a scatter plot, with q 0.95,0.05being colored for each estimation with its magnitude, visualizing the confidence of each estimation.The figure shows that the lower the error, the higher the confidence of the estimation is.When the probabilistic GHI-only model is compared with the deterministic model, a Diebold-Mariano significance test fails to reject the null hypothesis with p-value > 0.05, indicating that there is no significant performance difference between the two methods.A q of 13.84 % shows the mean percentage of estimations, which are not inside their respective theoretical quantiles α.
Generating three probability distributions for GHI, DNI, and DHI simultaneously shows a more pronounced error increase for all three irradiance components, when compared to the deterministic model counterpart.Figure 11 shows the scatter plots for all three irradiance components.The errors are higher for all irradiance components: RMSE 61.79 W m −2 , MAE 29.95 W m −2 , and MBE 6.21 W m −2 for GHI; RMSE 99.69 W m −2 , MAE 44.27 W m −2 , and MBE −1.02 W m −2 for DNI; and RMSE 23.99 W m −2 , MAE 12.75 W m −2 , and MBE 4.77 W m −2 for DHI.Correlation coefficients range from 0.96 to 0.98.q q0.95,0.05for each GHI estimation increases as the error increases, as observed for the GHI-only model.However, for DNI, this increase is only the case for higher irradiance values, while for DHI only for lower irradiance values.Furthermore, we can ob-serve a decreased q of 7.7 %, favoring the three parameter approach in this regard.
A Diebold-Mariano significance test confirms that the GHI-only model is significantly better than the three parameter approach with p-value < 0.05.The same results can be observed when testing for significance between the probabilistic and deterministic approaches, confirming that the deterministic three parameter approach performs significantly better than its probabilistic counterpart, with p-value < 0.05.
Further investigations of the probability distributions are illustrated in a sharpness diagram in Fig. 12, grouping the GHI-only model and the components of the three-parameter model into the estimations quantile ranges α.The model shows pronounced uncertainty regarding the estimation of the DNI, with the whiskers range reaching almost half of the possible DNI range for q 0.95,0.05 .Other radiation components are within more reasonable uncertainties.This difference may be the result of higher gradients compared to the DHI.A comparison between the model's GHI outputs shows an increased sharpness for the GHI-only model.Performing independent trimmed t-test, proposed by Yuen (1974), confirms significance in the difference between the means of the GHI-only model and the GHI of the three-parameter model.This significant increase is the case for each quantile range α, with p-values < 0.05.Furthermore, Levene's test confirms significant differences in variance for all quantile ranges α with p-values < 0.05 (Levene, 1960).The variance reduction observed with the GHI-only model is therefore significant and indicates that the model's confidence fluctuates less, compared to that of the three-parameter model.
Table 3 also indicates that training times are acceptable and inference is below one second.The three-parameter model also shows on both the probabilistic and the deterministic configurations lower training time per epoch and lower inference times.

Forecast model
Extending the deterministic estimation task to a forecasting task, as described in Appendix A2, we observe positive skill FS over all forecasting horizons with and without the addition of variability parameters.Training time decreases, compared to the backbone model.This decrease is mainly caused by the decrease of data samples resulting from needing three consecutive ASI image and irradiance data pairs and 1 h of consecutive ASI images for the irradiance estimation.Additionally, data points and images are occasionally missing and additional clear-sky data points are removed.A decrease of training time per epoch can be observed, when adding variability parameters to the training.Also, the model with variability parameters needs fewer epochs to trigger the early stopper.Inference times remain well below one second, although the computation of the variability parameters increases the inference time.Inference and training times are summarized in Table 6.    4. Evaluation metrics for the GHI-only deterministic forecasting model.Bold values: arrow up means the higher the better, arrow down means the lower the better and arrow to 0 means the closer to 0 the better.For the configuration without variability parameters, errors increase with the forecasting horizon, with RMSE ranging between 77.3 and 108.2 W m −2 and MAE between 41.73 and 64.55 W m −2 .RMSE are slightly lower for the horizons 5, 10, and 30 min when adding variability parameters to the model.MAE are higher for all forecasting horizons.A Diebold-Marino significance test fails to reject the null hypothesis with p-value > 0.05, showing that the variability parameters do not significantly impact the model's performance.MBE can be neglected, with −1.53 to −6.07 W m −2 .Forecasts highly correlate with observation, with R between 0.96 and 0.93, with forecast skills reaching up to 18.1 %.Similar results can be observed in the configuration with variability parameters.Metrics are summarized in Tables 4 and 5.The metrics confirm that the estimation task can be extended to a forecasting task for different prediction horizons, while maintaining a positive skill score.Figures 13 and 14 illustrate the errors as scatter plots for a 5 min forecasting horizon for both configurations.Table 5. Evaluation metrics for the GHI-only deterministic forecasting model, with variability parameters.Bold values: arrow up means the higher the better, arrow down means the lower the better and arrow to 0 means the closer to 0 the better.When forecasting all three irradiance parameters simultaneously, as shown in Fig. A3, a decrease of training time per epoch can be observed, when adding variability parameters to the training.Also, the number of epochs until triggering the early stopper decreases.Inference time remains, again, well below 1 s, although the calculation of the variability parameters increases the inference time, as with the GHI-only configuration.This increase is mainly caused by the additional calculation of variability parameters derived from estimated DNI values.Inference and training times are summarized in Table 7.A significant performance decrease compared to the GHI-only model can be observed, with p-value < 0.05 in a Diebold-Mariano test.Adding variability parameters to the model does not have any significant impact on the model's performance with p-value > 0.05.Evaluation metrics are summarized in Tables 8 and 9.
The probabilistic GHI-only architecture in Fig. A1 shows, that forecasting errors in Table 10 increase significantly, with p-value < 0.05, compared to the deterministic GHIonly model.When variability parameters are added to the model, no significant performance gain can be confirmed from the errors (see Table 11) with p-value > 0.05.
Figures 15 and 16 show the errors of the probabilistic models 5 min forecast as scatter plots.q 0.95,0.05 is colored achttps://doi.org/10.5194/asr-20-129-2024 Adv. Sci.Res., 20, 129-158, 2024     cording to its error magnitude for each forecast.The plots show, that q 0.95,0.05increases with forecast error.A comparison of q for both configurations shows that q for forecasting horizons 20 and 30 min are lower when variability parameters are added.The sharpness diagram in Fig. 17 reveals, that all quantiles of the α distribution means are lower in the configuration without variability parameters.The significance of the distribution's mean reductions is confirmed using an independent t-test for distribution pairs with similar variances and independent trimmed t-test for distributions with differing variances, with p-value < 0.05.
Table 12 shows inference and training times for the probabilistic GHI-only models.The inference time increases but still remains well below one second.The training time per epoch increases and the difference by adding variability parameters decreases.For further elaboration, example forecasting scenarios and their corresponding image sequences are illustrated and discussed in the Supplement.

Conclusions
In this research, we present a novel approach to training and designing pre-trained neural networks to generate probabilistic solar irradiance forecasts through ASI images.The rehttps://doi.org/10.5194/asr-20-129-2024 Adv. Sci.Res., 20, 129-158, 2024 Table 8.Evaluation metrics for the deterministic forecasting model for all irradiance components.Bold values: arrow up means the higher the better, arrow down means the lower the better and arrow to 0 means the closer to 0 the better.9. Evaluation metrics for the deterministic forecasting model for all irradiance components, with variability parameters.Bold values: arrow up means the higher the better, arrow down means the lower the better and arrow to 0 means the closer to 0 the better.search questions formulated in Sect. 1 are answered as follows: 1.An estimation of the GHI from ASI images can be performed by exploiting an ImageNet pre-trained ResNet50v2 as feature extractor.Table 11.Evaluation metrics for the GHI-only probabilistic forecasting model, with variability parameters.Bold values: arrow up means the higher the better, arrow down means the lower the better and arrow to 0 means the closer to 0 the better.3. Adding the DNI and DHI to the estimation task, significantly increases the error of the GHI estimation.This is true for both the deterministic and probabilistic model and an equal loss weighting for all three irradiance components.
4. It is possible to generate irradiance probability distributions of the GHI without a significant error increase, compared to the deterministic model.
5. Forecasts with positive skill are possible for deterministic and probabilistic GHI forecasts of up to 30 min.
Adding the DNI and DHI with equal loss weighting significantly increases the error of the GHI forecast but a positive skill score is maintained.The forecast skill of the GHI-only model decreases significantly when transitioning to a probabilistic forecasting task, but offers additional information about the model's confidence.7. Inference times remain below one second, ranging from 184.92 to 290.74 ms for the forecasting models and 63.07 to 77.94 ms for the backbone models.
The empirical coverage of the distributions q α by the probabilistic forecasting model might be reduced by utilizing calibration techniques on the neural network, such as defining q α as a loss function.Quantifying the sharpness diagrams properly might also be a good method to be incorporated within the loss function.The Brier-Score, proposed by Brier (1950), and/or combining these loss functions could also be an option to calibrate the model properly and thus maintain good empirical coverage and high sharpness.
The observed training time reduction through the set of variability parameters proposed by Schroedter-Homscheidt et al. (2018), shows that meteorological findings should not only be considered for assisting ANNs in performance increase but also for reducing the ANNs training time.Additionally, predicting variability parameters rather than irradiance values can illuminate whether an ANN can extract such information from the input image sequence and how it does so.
Regarding training time, the same is true for adding the DNI and DHI to the estimation task.While the maximum reported training time of roughly 1 d might not be a long period for training ANNs, this might be a critical factor when training on a larger set of ASI images and irradiance data from all https://doi.org/10.5194/asr-20-129-2024 Adv. Sci.Res., 20, 129-158, 2024 over the world.Furthermore, varying the loss weights among the GHI, DNI and DHI in the three-parameter model could also lead to different results, instead of treating all three parameters equally.This needs to be evaluated in future work.
Modifying the architecture of the LSTM and Dense layers for the forecast model could potentially enhance performance in predicting the target feature vector based on the current and past feature vectors.Experimenting with different configurations and sizes of these layers may reveal valuable insights into efficiency improvements and more accurate predictions.This approach involves exploring variations in layer parameters, such as the number of units in LSTM layers or the activation functions in Dense layers, to optimize the model's ability to learn and generalize from temporal data sequences.Additionally, incorporating a broader and more detailed series of input images could further refine the model's capacity to effectively learn and extract meaningful patterns from temporal data sequences.
Pothineni al. ( 2019) also reported an increase in accuracy of forecasts when training data from different geographical locations with differing atmospheric conditions are used.While this performance increase was only confirmed on deterministic forecasts, it is plausible that the same is the case for probabilistic forecasts.Therefore, we are currently gathering ASI images and irradiance data from Ghana within the framework of the EnerSHelF project to test our findings on a more unique data set (Meilinger and Bender, 2023;Yousif et al., 2022).While this study confirmed the possibility to use an ImageNet pre-trained ANN models for our application, it should also be possible to use the trained models as pre-trained backbones on more unique and sparse data sets as a transfer learning technique.The method also needs a crossvalidation on different sites, to determine the effectiveness of using the ASI as radiometer during deployment.
It is also important to emphasize that this study filtered out most clear-sky situations from the dataset to focus on nonclear-sky conditions.Despite the ability of our model to predict clear-sky situations, as shown in the supplementary material, predicting such clear-sky situations would be a proper approach to differentiate situations where more conservative methods are more suitable for predicting the irradiance.One approach would be using our method to replace the labels with a binary clear-sky label to train the model on a binary cross-entropy loss function.Labeling the images as clear-sky can be done via the clear-sky detection method used in this study.We have performed a preliminary study, using a pretrained ResNet50v2 classifier and modifying its output vector to output two instead of 1000 classes.These two classes represent clear-sky and non-clear-sky.The general principle is shown in Fig. A4.After training this model, we achieve an accuracy of 93.2 %.Extending the model to a forecasting model, similar to the approach proposed in this study, we achieve a binary accuracy of 93.3 % for a 5 min forecasting horizon, 93.2 % for a 10 min forecasting horizon, 91.6 % for a 20 min forecasting horizon and 90.6 % for a 30 min forecasting horizon.The architecture is shown in Fig. A5.
Leveraging the method to segment the ASI images through a U-Net to determine the presence of clouds, as proposed by Fabel et al. (2022), can also be a proper approach to differentiate between clear-sky and non-clear-sky conditions and apply proper labeling.Using other clear-sky products to label the data, like the Copernicus Atmosphere Monitoring Service McClear model (Lefèvre et al., 2013), can also be considered.Financial support.This research has been supported by the Bundesministerium für Bildung und Forschung (grant no.03SF0567A-G).
Review statement.This paper was edited by Maurice Schmeits and reviewed by two anonymous referees.

Figure 1 .
Figure 1.Architecture of the backbone ANN, which generates asymmetric probability distributions for all three irradiance components.

Figure 2 .
Figure 2. Principle of the backbone ANN architecture to generate the deterministic irradiances for all three irradiance components.

Figure 3 .
Figure 3. Training pipeline of the backbone ANN model.

Figure 5 .
Figure 5.General training pipeline of the forecasting ANN model.

Figure 6 .
Figure 6.Estimation of the GHI for 20 February 2016, using the sinh-arcsinh distribution's quantiles α.The estimations are generated with the probabilistic three-parameter model, and the scatter plot shows the quantile α = 0.95 to 0.05.

Figure 7 .
Figure 7. Sharpness diagram resulting from the of the quantiles α for 20 February 2016.

Figure 8 .
Figure 8. Scatter plot between observation and estimation of the deterministic GHI-only model over the whole test data set.

Figure 9 .
Figure 9. Scatter plots between observation and estimation of the deterministic three-parameter model, for GHI (a), DNI (b) and DHI (c) over the whole test data set.

Figure 10 .
Figure10.Scatter plot between observation and estimation of the probabilistic GHI-only model, with q 0.95,0.05being colored according to magnitude over the whole test data set.

Figure 11 .
Figure 11.Scatter plots between observation and estimation of the probabilistic three-parameter model, for GHI (a), DNI (b) and DHI (c) over the whole test data set.

Figure 12 .
Figure 12.Sharpness diagram, grouping GHI-only and three parameter approach into the estimations quantiles ranges α.

Figure 13 .
Figure 13.Scatter plot between observation and a 5 min forecast of the deterministic GHI-only model, without variability parameters over the whole test data set.

Figure 14 .
Figure 14.Scatter plot between observation and a 5 min forecast of the deterministic GHI-only model, with variability parameters over the whole test data set.

Figure 15 .
Figure 15.Scatter plot between observation and a 5 min forecast of the probabilistic GHI-only model, without variability parameters over the whole test data set.

Figure 16 .
Figure 16.Scatter plot between observation and a 5 min forecast of the probabilistic GHI-only model, with variability parameters over the whole test data set.

Figure 17 .
Figure 17.Sharpness diagram, grouping the 20 and 30 min forecasting configurations with variability parameter (w.var.) and without variability parameter (wo.var.) into the predictions quantile ranges α.

2.
With the training data used in our study, the training time of the deterministic backbone models are roughly up to 1 d.The probabilistic backbone models can be trained in even less time with fewer epochs and less training time per epoch.Using all three irradiance components for the estimation task, seems to decrease the training time per epoch.The deterministic forecasting models can be trained in well below 1 d.Adding variability parameters has been shown to reduce, both epoch count and training time per epoch.The probabilistic forecasting models train much faster with training times Table 10.Evaluation metrics for the GHI-only probabilistic forecasting model.Bold values: arrow up means the higher the better, arrow down means the lower the better and arrow to 0 means the closer to 0 the better.
6. Overall, the set of variability parameters proposed bySchroedter-Homscheidt et al. (2018) generally do not have a significant impact on the model's forecasting performance.

Figure A2 .
Figure A2.Architecture of the deterministic forecast model, forecasting the GHI.Blue -not trained, but part of training; green -trained; pink -not trained, and not part of training.

Figure A3 .
Figure A3.Architecture of the deterministic forecast model, forecasting the GHI, DNI, and DHI simultaneously.Blue -not trained, but part of training; green -trained; pink -not trained and not part of training; yellow -not trained, part of training, and optional.

Figure A4 .
Figure A4.Principle of the backbone clear-sky classification model, based on a ResNet50v2 classifier.The output vector has the dimension of 2, instead of the original 1000.The first output is for clear-sky (1 = True, 0 = False).The second output is for non-clear-sky (1 = True, 0 = False).The model can output values between 1 and 0, giving information about the models confidence.

Figure A5 .
Figure A5.Architecture of a probabilistic clear-sky forecast model.Blue -not trained, but part of training; green -trained; pink -not trained and not part of training; yellow -not trained, part of training, and optional.(a) uses the ResNet50v2 backbone for estimating the current and past GHI, as in Fig. A2.(b) and (c) use the ResNet50v2 classifier in Fig. A4.

Table 1 .
Training configurations for the backbone models.

Table 2 .
Clipping and normalization values for variability parameters.* -only for variability parameters derived from GHI.

Table 3 .
Inference and training times for all four backbone configurations.Bold values: arrow up means the higher the better, arrow down means the lower the better and arrow to 0 means the closer to 0 the better.

Table 6 .
Inference and training times for the deterministic GHI-only configurations.Bold values: arrow up means the higher the better, arrow down means the lower the better and arrow to 0 means the closer to 0 the better.

Table 7 .
Inference and training times for the deterministic three parameter configurations.Bold values: arrow up means the higher the better, arrow down means the lower the better and arrow to 0 means the closer to 0 the better. Table

Table 12 .
Inference and training times for the probabilistic GHIonly configurations.Bold values: arrow up means the higher the better, arrow down means the lower the better and arrow to 0 means the closer to 0 the better.
(Carreira Pedro et al., 2019)low code relevant for the probabilistic output layers is referenced inBarnes et al. (2021)(https://doi.org/10.48550/ARXIV.2109.07250),specifically in Appendices C-E.The remaining adheres to standard Tensorflow conventions and does not claim to be original in design.Essential configuration parameters are detailed within the paper.For further inquiries, requests or assistance for the code, please contact the corresponding author.Data availability.The images from the All-Sky Imager (ASI) and the associated irradiance values featured in this study are available for download at https://doi.org/10.5281/zenodo.2826938(CarreiraPedroetal., 2019).Supplement.The supplement related to this article is available online at: https://doi.org/10.5194/asr-20-129-2024-supplement.Author contributions.Conceptualization, SC, SH, SM; methodology, SC, SH, SM; software, SC; validation, SC; formal analysis, SC, SH, SM; investigation, SC; resources, SC, SM, SH; data curation, SC; writing -original draft preparation, SC; writing -review and editing, SC, SH, SM; visualization, SC; supervision, SH, SM; project administration, SM; funding acquisition, SM.Competing interests.The contact author has declared that none of the authors has any competing interests.This article is part of the special issue "EMS Annual Meeting: European Conference for Applied Meteorology and Climatology 2022".It is a result of the EMS Annual Meeting: European Conference for Applied Meteorology and Climatology 2022, Bonn, Germany, 4-9 September 2022.The corresponding presentation was part of session OSA1.1:Forecasting, nowcasting and warning systems.Disclaimer.Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper.While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.