Secondary data on production of sugarcane in Assam have been collected from various publications of Directorate of Economics and Statistics, Govt. of Assam for this study. This study is based on year wise time series data on production of sugarcane for a period of 60 years from 1962-63 to 2021-22. The objective of this paper is to develop a suitable time series model to predict the future production of sugarcane in Assam.
Statistical modeling based on time series data has always been considered as an important aid of analyzing and estimating the future values of the study variable under the assumption that the past pattern will continue to remain in the future. The process of estimating the future values of the study variable based on its past values is generally known as forecasting. Again forecasting is an important aspect of policy making process in different fields like business, education, health, government
etc.
A time series is a set of observations, each one being recorded at a specific time
t (
Peter Davis, 2001). Mathematically , a time series is defined by the functional relationship.
yt = f (t)
Where,
y
t (=0,1,2,3,….) = Value of the variable under consideration at time .
In time series analysis, the recorded observations of the underlying variable and ordering of the observations possess equal importance.
One of the most widely used stochastic time series models for forecasting is Autoregressive Integrated Moving Average (ARIMA) model which is popularly known as Box-Jenkins methodology (
Box and Jenkins, 1978). ARIMA models have been extensively used time to time to forecast various agricultural productions across the globe. In India ARIMA models were used for forecasting India’s sugarcane productivity
(Kumar et al., 2017), eggs production (
Chaudhari and Tingre, 2015), milk production
(Mishra et al., 2020), rice production
(Mahajan et al., 2020) and fish production (
Paul and Das, 2010).
The basic assumption of time series forecasting is that the series under study need to be stationary. If the mean and variance of a time series is time invariant, the series is considered as stationary. An ARIMA model is a combination of Autoregressive (AR) and Moving Average (MA) terms representing its own past values and the past errors respectively. The general notation for denoting an ARIMA model is ARIMA (p,d,q) where p represent the number of autoregressive terms , d the number of times the series has to be differenced before it becomes stationary and q the number of moving average terms. ARIMA processes incorporate a broad range of non stationary series that reduce to ARMA (Autoregressive Moving Average) processes when differenced ûnitely many times. Mathematically, a process {y
t} is said to follow an ARIMA (p,d,q) if it satisfies a difference equation of the form:
follows WN (0, 𝛔
2)
Where,
φ(B) and q(B) = Polynomials of degrees p and q respectively.
d = Non negative integration parameter.
Box-Jenkins Methodology
i.e. ARIMA modeling of time series data consists of four steps
viz. identification, parameter estimation, diagnostic checking and forecasting. A time series need to be tested for stationarity at this identification stage. There are different tests available for testing the stationarity of time series. This can be done by plotting the values of the variable under study against time points on graph or by plotting the values of Autocorrelation and Partial autocorrelation function for a specific lags. The ACF of {y
t} at lag k is defined as:
Where,
y
t = Original series.
y
t - k = Lagged series at lag k.
m = Mean of the data set.
Partial autocorrelation function between yt and yt - k is the autocorrelation between y
t and y
t - k after adjusting for y
t - 1, y
t - 2,...y
t - k+1 Douglas et al., 2008).
If ACF plot is positive and shows a very slow linear decay pattern, then the data are considered as non stationary (
Nau, 2018). Statistical tests like Dickey-Fuller test, Augmented Dickey-Fuller test, Philips-Perron test are also available for testing the presence of staionarity in the time series. In our study we have applied ADF test for testing the stationarity of our data s
et along with ACF and PACF of the series (
Dickey and Fuller, 1979). If the series is found not stationary, it must be differenced d times in order to make it stationary. Once the time series becomes stationary we go for determining the order of AR and MA terms
i.e the values of p and q in ARIMA (p,d,q). This is done by examining the ACF and PACF values of the stationary series. After identifying the values of p and q in the ARIMA (p,d,q) model, the parameters are estimated by maximum likelihood estimation (MLE) at the stage of parameter estimation. Having chosen a particular ARIMA model and having estimated its parameters, the next step is to check whether the chosen model fits the data reasonably well or not. The process of validating the chosen model again consists of two measures of goodness of fit. One is based on ACF of residuals estimated from the model and the other is Akaike’s Information Criterion (AIC) (
Akaike, 1974). The ARIMA model with the smaller value of AIC is considered as a better fit of the data. The AIC is computed as:
Where,
= Maximum Likelihood estimate of the error variance.
k = Number of parameters in the model.
n = Sample size. In this study n is equal to 60.
As an alternative to AIC, Bayesian Information Criterion (BIC) is used to identify the best fitted model which is also called as Schwarz Information Criterion (
Schwarz, 1978). BIC is computed as:
BIC does well at getting the correct order in large samples, whereas AIC tends to be superior in smaller samples where the relative number of parameters is large (
McQuarrie and Tsai, 1998).
Another method of testing the randomness of the residuals is plotting the ACF of residuals against lags. If we find that about 95% of the sample autocorrelation are within the limits of ±1.96/√N where N is the number of observation that forms the model, then the model is considered as is a good fit. In addition to plotting the individual sample autocorrelation of the residuals (rk) we can test the joint hypothesis that all the sample autocorrelations coefficients up to certain lags are simultaneously equal to zero. This can be done by using the Q statistic developed by
Box and Pierce (1970) which is defined as:
Where
n = Sample size.
m = Lag length.
This Q statistic will be calculated for lag length m=10 in this research work. In large samples, Q is approximately distributed as c
2 (m)
i.e. Chi-square distribution with m degrees of freedom. We reject the hypothesis that all the P
k are zero if the computed Q exceeds the critical Q value from the Chi square distribution at the chosen level of significance.
A variant of the Box-Pierce Q statistic is the
Ljung-Box (LB) Statistics (1978) which is defined as:
or large samples.
Once the chosen ARIMA model is found reasonably good fit to the data, forecasting is executed using that model.
The best fitted ARIMA model from the data collected has been developed by executing the above stated procedure which is presented in the results and discussion section of this paper. For the entire analysis of the collected data, R software has been used.