Data collection and digitization
Data for Corriedale breed for the last 11 years (2011-2021) was collected from Mountain Research Station for Sheep and Goat, J & K, India. Economically important traits like birth weight, weaning weight, 6-month weight, 9-month weight, 12-month weight as well as morphometric measurements at various ages were taken into consideration for the research. Pedigree data was also collected. For the prediction of body weight, features such as morphometric measurements and body weights at earlier ages and other relevant factors like breed and sex were acquired. These constituted a total of 64 features.
Data cleaning
The raw data was manually cleaned and duplicate values were removed. Unreliable data points like those having the same animal as dam and sire were removed. Removed noisy rows. Identified and removed rows that contain duplicate observations for the animal identification numbers. Data cleaning was done in both Python and R.
Iterative imputation
Missing data cause bias to creep into the analysis thus making it arduous to analyse
(Barnard and Meng, 1999). Rows with too many missing variables were removed completely. For the rows within the dataset where the number of missing values were less, imputation was used for filling up the missing variables in the dataset. The missing values were treated as MAR (missing at random) values
(Wu et al., 2004). Data imputation for the current dataset was done iteratively in Python using the Scikit-learn open-source machine learning library
(Pedregosa et al., 2011) using Bayesian ridge regression
(Tipping, 2001; MacKay, 1992).
Winsorization
Outliers were detected using boxplots and histograms using matplotlib library in Python
(Hunter, 2007). To handle the outliers in the dataset, winsonization technique was used. Winsorization was done in Python using the SciPy library
(Gerard-Marchant, 2007).
Data types and feature encoding
For the machine learning models two types of encoding were performed.
Label encoding
This type of encoding was done for variables that had too many categories or values. One Hot Encoding: One hot encoding technique was applied to nominal data present in the data set.
Data normalization/standardization
Normalization was done in Python
(Pedregosa et al., 2011). For multivariate data, this was done feature-wise
i.
e., independently for each feature of the data.
Dimensionality reduction
In order to reduce the number of input variables in the dataset, dimensionality reduction was performed in Python. Two methods were used for this purpose.
1. Principal Component Analysis
(Pearson, 1901): The PCA was fit on the training set and mapping, or transformation was applied to both the training and test set.
2. Feature selection: Through feature selection, an optimal feature subset was selected based on the one that optimized the scoring function which was done using sklearn
(Pedregosa et al., 2011). Feature selection in Python was done based on an F-test estimate of the degree of linear dependency between two numerical variables: the input and the output. This was treated as a regression predictive modelling problem
(Pedregosa et al., 2011).
Feature selection was done both for the original datasets as well as the extracted features from PCA. As a result, three separate datasets were created for the prediction of body weight from morphometric measurements: Principal Component analysis (PCA), feature selection (FS), PCA and feature selection (PCA+FS).
Multicollinearity was checked using pair plots in Python
(Waskom, 2021). The variance inflation factor was also checked for all variables before analysis.
Principal component regression
Principal component regression was performed on both the datasets
viz. PCA + FS and PCA. PCR was utilized by finding M-linear combinations (also known as principal components) of the p-predictors and least squares was employed to fit the linear regression model. In this model, principal components were used as predictors
(Sutter et al., 1992).
The scoring criteria for evaluating the models were mean squared error, mean absolute error, Coefficient of determination value and correlation coefficient. The following percentages of data were used for constructing the model:
1. Testing data (10% of the dataset), training data (90% of the dataset), validation data (10% of training data).
2. Testing data (20% of the dataset), training data (80% of the dataset), validation data (10% of training data).
3. Testing data (20% of the dataset), training data (80% of the dataset), validation data (20% of training data).
Ordinary least squares
A prediction equation for arriving at the 12-month body weight for both the datasets was derived using ordinary least squares in python. The models used had the following structure: