Data
For the current study on annual milk production (in million tonnes) in India, the data series for the period from 1980 to 2019 has been collected and compiled from the various issues of ‘Basic Animal Husbandry Statistics’ published by the Department of Animal Husbandry and Dairying, Ministry of Fisheries, Animal Husbandry and Dairying, Government of India, New Delhi and from the website of Ministry of Agriculture and Farmers Welfare, Government of India. Methodologically, the whole data series is first divided into two sets, namely the training set and the testing set. The production data of 1980-2016 have been utilised for the model building purpose while retaining the last 3 years’ data for the post-sample evaluation.
ARIMA model
The time series variable in an ARMA model is considered to be a linear function of its past values and random shocks. It includes both autoregressive and moving average processes to obtain greater flexibility in the fitting of actual time series data. An ARMA (p, q) model can be specified as (
Box et al., 2015):
Where,
yt and et are the actual observation and random error at time t, respectively; fi (i=1,2,...,p) and qj (j=1,2,...,q) are the model parameters. p and q, being integers, are referred to as the order of the model. Random shocks et are assumed to be independent and identically distributed with zero mean and a constant variance s2.
A popular generalisation of ARMA models, which incorporates a wide class of non-stationary time series models, is achieved by introducing the concept of differencing into the model. An ARIMA model representing homogeneous non-stationary behaviour can be written as follows:
Where,
B is the backshift operator defined as Byt = yt-1 and d represents the order of differencing. In practice, d is usually 0, 1, or at most 2. The Box-jenkins methodology (
i.e., the ARIMA methodology) includes three iterative steps, namely identification, parameter estimation and diagnostic checking (
Durdu, 2010).
Identification
The first step in ARIMA model building is to check and ensure stationarity of the series being analysed as the estimation procedures are available only for stationary series. In the next step, based on autocorrelation and partial autocorrelation patterns, one or several potential models are identified.
Parameter estimation
At the identification stage, one or more tentative models that appear to provide adequate statistical representations of the available data are chosen. Once a tentative model is specified, model parameters are estimated by the method of maximum likelihood estimation (MLE).
Diagnostic checking
In this step, the white noise test for the residuals of the tentatively selected model is carried out. If residuals are not white noise, again a candidate model is selected and the same procedure is repeated until a valid model is found.
ARIMA-genetic algorithm approach
GA, developed by John Holland and his collaborators (
Holland, 1992), is an abstraction of biological evolution based on Charles Darwin’s theory of natural selection (
Darwin, 1964). It is one of the most extensively used evolutionary algorithms in terms of the diversity of its applications; from graph colouring to pattern recognition, from financial markets to multi-objective engineering optimisation problems,
etc. Holland is widely credited as being the first to utilise crossover, recombination, mutation and selection in the study of adaptive and artificial systems. As a matter of fact, the GA is incomplete as a problem-solving approach without these genetic operators. GAs have multiple edges over the classic optimisation algorithms. Parallelism and the ability to handle complex problems are two of its most remarkable features.
GAs basically rely on the survival of the best individuals. Over generations, the fitness function improves and the best solution is obtained finally. Building GA for the ARIMA problem, as presented in Fig 1, includes a number of steps (
Ding, 2011;
Ervural et al., 2016; Rathod et al., 2017; Yang, 2020).
String representation
Each chromosome consists of two parts to represent AR (p) and MA (q), with each dimension, equal to the length (p + q).
Initial population
The initial population is chosen at random. The number of chromosomes in each generation is referred to as the population size and it is a key parameter for improving the performance of GA. However, there is no specific standard to determine the size.
Fitness evaluation
To initiate the estimation procedure, an objective function should be defined for evaluation in terms of the fitness function. In the current study, the function is specified in terms of MAPE, such as:
GA encoding
In order to perform crossover and mutation, the real values have to be represented as binary strings (0 and 1). The number of bits (n) in each variable is provided by:
Where,
The quality value is often set to 0.001.
Selection
The fitness function is used to select chromosomes from the current population to create new offspring for the next generation. The higher the fitness, the more likely the chromosome contributes one or more offspring to the next generation. Choosing a small number of chromosomes restricts the number of offspring in the next generation while retaining too many chromosomes can lead to undesirable traits in the next generation. Hence, a minimum of 50% is kept in natural selection. Among the several methods available for the selection operation (
Haupt and Haupt, 2004), we have used the randomisation-based Roulette wheel selection in this study.
Crossover
Crossover provides new offspring for the next generation by exchanging information between two randomly selected parent chromosomes. Crossover substantially improves GA in terms of exploration and diversification abilities with a view to obtaining the global optimum point. However, in most cases, crossover is not performed on all of the selected chromosomes. In practice, the choice of crossover probability is made between 0.6 and 1.0.
Mutation
After performing crossover, this random search,
i.e., the mutation is applied to each offspring in order to avert a premature convergence. It can be depicted as a random bit with a small probability typically ranging between 0.1 and 0.001, which is selected at random from the total number of bits from the population matrix.
GA decoding
To carry out further fitness evaluation, the string values are converted into their equivalent real values by decoding. This process is performed by utilising the equation:
Where,
x and Xdec represent the real and the decimal decoded value of the gene, respectively. Xlower and Xupper accordingly indicate the lower and upper bound of x.
Fitness evaluation
Once the selection, crossover, mutation and decoding are performed, evaluation of the new offspring is carried out in the earlier fashion.
Replacement
After evaluation, the parents need to be replaced by the new offspring. The replacement operations can be categorised into two main types,
viz., generational/non-overlapping replacement and overlapping replacement. In the former one, the parent population is replaced by the offspring population except for the best individuals in parents, whereas in the case of the latter one, according to their fitness values, both the offspring and parent population compete to survive into the next generation.
Termination of GA
Once the convergence criterion is met, such as the maximum number of iterations is reached or the desired fitness value is obtained, the GA is terminated. Otherwise,
i.e., if not met, the entire algorithm is repeated until the desired fitness value is obtained.
Assessment of forecasting accuracy
The forecasting ability of both models is assessed in terms of the two widely used accuracy measures,
viz., RMSE and MAPE
(Wong et al., 2005; Gonzalez-Vidal et al., 2019). RMSE measures the overall performance of a model and has the form:
Where,
yt and yt represent the tth actual and predicted value in the test data set, respectively and n denotes the size of the test data set. The second measure,
i.e., MAPE is a measure of per cent average error for each point forecast and is given by:
Where,
The notations have the same meanings as above.