In this research, we aim to predict the production of pulses in six countries (Afghanistan - Bangladesh - China - India - Nepal - Pakistan). In present studies these countries for selected for south Asian region on the basis of higher production. The study period extends (1961-2019 from
www.fao.org) an annual frequency. To forecast pulses production up to year 2027, we use three types of models Autoregressive Integrated Moving Average (ARIMA) - Nonlinear autoregressive neural network (NNAR) - Exponential Smoothing (ETS) first use the full forms followed by short forms) and compare their results. We will use data spanning from (1961-2015) for estimation and training by models and data from (2016 - 2019) to validate the models. Before that, our methodology goes through several stages:
DATA exploration
To visualize the data features (patterns, unusual observations, changes over time) we need to plot the data and then translate that through descriptive statistics and normal distribution of the data using the following statistic:
..........(1)
Where
n: Number of observations, S: Skewness, K: kurtosis.
DATA stationary
Time series that have a trend and volatility are not stationary, will affect the value of the time series at different time. Thus, the series cannot be predicted in the long run. One way to determine whether a time series is stationary or not is to use a unit root. The time series in Augmented Dickey-Fuller test is described by the equation (
Dickey and Fuller, 1981):
..........(2)
Where
c : constant, α: coefficient on a time trend, p: lag order of the autoregressive process. The ADF test is carried out under the null hypothesis δ = 0 (not stationary) against the alternative of d < 0 (stationary). If the null hypothesis is not rejected, we perform the first difference to make the series stationary:
..........(3)
Estimation of models
To forecast pulses production up to year 2027, following three types of models are used:
ARIMA model
ARIMA models are the most widely used statistical models for time series forecasting, this is done by describing the autocorrelation in the data
(Box et al., 2015). These models are divided into three parts, according to their nomenclature (Auto Regressive-Integrated-Moving Average) (p, d, q).
Autoregressive (p) refers to predicting a variable using a linear set of its preceding values, the model of order p can be written as:
..........(4)
Where
β
p: parameters of model, p: lag order of the autoregressive process, ε
t: error term. Integrated (d) refers to the degree of stationary of a variable that is determined using ADF test.
Moving average (q): uses past forecast errors in in regression. The equation will be in the form:
..........(5)
Where
β
q: Parameters of model, q: Lag order of the moving average, ε
t: Error term.
Whereas (d) is determined by ADF test, (p) and (q) are determined by the autocorrelation r(p) function and the partial autocorrelation function R(p), which are given according to the following:
..........(6)
..........(7)
NNAR model
Neural network autoregressive models are statistical models that allow complex nonlinear relationships to predict a variable using its lagged values. Where lagged values of the time series can be used as inputs to a neural network.
(Hyndman et al., 2012) previously suggested this method. These models are distinguished from ARIMA models by the presence of a hidden layer, in which the linear weighted input is modified by a nonlinear function before it is output:
..........(8)
In the hidden layer, this is modified using a nonlinear function:
..........(9)
Where
b
i and ω
i parameter of model are learned from the data. This model can be written as
NNR(
p, k) where
p lagged input and
k nodes in the hidden layer. Model is neural network with observations (y
t-p) used as inputs for forecasting the output y
t and with
k neurons in the hidden layer, with neglecting the effect of seasonality because the data is annual. The optimal number of lag
p as (
p, q) in ARIMA model is chosen using akaike information criterion (AIC), which is given as follow:
..........(10)
Where
maximum value of the likelihood function. We remind that this model does not assume restriction about the stationary and therefore the random part is included in the predictions,
Lama et al., (2021).
ETS Model
Whereas ARIMA model describe autocorrelation in the data, exponential smoothing model (ETS) are based on describing the trend in the data, which was suggested by (
Holt, 1957),
Mishra et al., (2021), Devi et al., (2021) and
(Winters, 1960).
ETS models are a systematic development in which exponential smoothing models (ETS) are combined into a nonlinear dynamic model. Analysis of these models using state-space based likelihood calculations, with support for model selection and calculation of forecast standard errors
Hyndman et al., (2002).
Interested in the model in three main component of time series: trend (T), seasonal (S), error (E). Reflects the trend term of the long-term movement of time series and the error term is the unpredictable component of the time series. In our case, do not care about the seasonal term because the data annual. The components we need are combined in our model, in various additive and multiplicative combinations to produce y
t. We have additive model y
t = T+E or multiplicative model like y
t = T. E. where the individual components of the model are given as follows:
E [A, M]
T [N, A, M, AD, MD]
S [N, A, M]
Where
N = None, A = Additive, M = Multiplicative, AD = Additive dampened and MD = Multiplicative dampened (damping uses an additional parameter to reduce the impacts of the trend over time). The models that we are interested in estimating can be written (after selecting S [N]) in the following Table A:
Performance indicators
To compare the prediction performance of the three models used, we first test the validity of the model by calculating mean absolute percentages error (MAPE) between the estimated data and the actual data during the period (2015-2019):
..........(11)
Then we evaluate the performance of the model by calculating root mean square error (RMSE) and (MAPE) between the estimated data and actual data during the period (1961-2015):
..........(12)
Where
t : the forecast value, y
t: the actual value, n: number of fitted observed. The last stage is to predict the pulses production for the countries of the study sample until 2027, the model that hasthe least values of (RMSE-MAPE) is the best and the uncertainty is included in the expectations 95% prediction interval is given by (Mishra
et al.,2021):
..........(13)
Where
T + h observation that will be predicted, Zα/2=1.96, sh: forecast variance.
Empiricial study
To know the development and trends in pulses production for (Afghanistan-Bangladesh-China-India-Nepal-Pakistan) we present the following Fig 1.
Fig 1 shows that both China and Nepal achieved significant growth in the pulses production until the year 2019 almost exponentially. Afghanistan has achieved this development to 2015 and then production decreased after this year. Bangladesh and Pakistan have a similar situation they achieved a development in production until 1995 and then production decreased after this period similar to a second-order polynomial. As for India, it notices the high volatility in pulses production during the studied period and as it was shown to us that production is down trend. Descriptive statistics show the most important values of changes in data through Table 1.
Table 1 shows us that the data for (Afghanistan- China -Nepal-Pakistan) are not distributed normally, as the probability (Jarque-Bera) is less than the level of significance 5%. Thus the estimator here (Mean and standard deviation) are useless because they are breakdown points. These countries experienced significant changes in pulses production during the studied period, as shown by the maximum-minimum values and skewnes and kurtosis coefficient that Pakistan has the greatest development, where production reached 414.4 thousand tons in 1989 before it decreased to 80.7 in 2014. We also note from the Table 1 that the data from India and Bangladeshshare distributed normally, the production of pulses in India and the largest producer among the six countries, changed from 1506 thousand tons in 1991 to 535.2 in 1992. In order to find out the effect of these volatility and trends on the stationary of the variables, we use ADF test and we get the results from Table 2.
Table 2 shows that (Afghanistan-India) is stationary in level with linear trend for Afghanistan and around constant for India as shown in Fig 1. For the rest of the countries, the high volatility during the studied period made it a stationary series at the first difference. With the aim of forecasting the pulses production for six countries up to 2027, we use (ARIMA-NNAR-ETS) models; we use the data during the period (1961-2015) to estimate using models (training) and (2016-2019) to verify the validity of the model (testing). The following table shows the results of estimating the ARIMA model for the six countries:
Table 3 shows that all selected b models have better out-of-sample prediction results, as the value of (MAPE-Testing) is less then (MAPE-Training) for all models except Afghanistan, where the model failed to predict lower values after 2015. The best model is ARIMA (1,1,0) for Nepal, which achieved the lowest values of (AIC, RMSE, MAPE) among the selected models. We note from table that there is no autocorrelation problem for the residual values in all models. We estimate NNAR model for all countries and get the following results:
Table 4 shows the prediction using NNAR models is better than the prediction using ARIMA inside the sample (Training), but the good performance decreases outside the sample. The best model among selected models is for India NNAR (1,1), Which is better than the corresponding ARIMA model
(Mishra et al., 2020). We estimate ETS model for all countries and get the following results:
Table 5 shows that the best ETS model among the estimated models is for Nepal (M,N,N), which has the lowest values of (RMSE-MAPE in-out of the sample), similar to the results of the ARIMA model. Among all selected models, we choose the best model for forecasting pulses production to 2027 for each country:
According to these models (Table 6), a forecast of pulses production for 2027 is obtained with the inclusion of uncertainty in the forecast, as shown in the following Fig 2.
Fig 2 shows that (Afghanistan, China, India) are expected to increase the quantities of pulses production until 2027, with relative stability in the volume of production for the rest of the countries. The following table shows the expected production quantities for the six countries until 2027 with prediction interval 95% (Table 7).
Table 7 shows that India is the largest producer of pulses among all six countries, where production is expected to reach 1088.778 thousand tons in 2027, with a growth rate 15.73% during the period 2020-2027. In addition, Afghanistan and China have an extreme growth rate of 25.19%, 11.95% respectively.