Indian Journal of Animal Research

  • Chief EditorK.M.L. Pathak

  • Print ISSN 0367-6722

  • Online ISSN 0976-0555

  • NAAS Rating 6.50

  • SJR 0.263

  • Impact Factor 0.4 (2024)

Frequency :
Monthly (January, February, March, April, May, June, July, August, September, October, November and December)
Indexing Services :
Science Citation Index Expanded, BIOSIS Preview, ISI Citation Index, Biological Abstracts, Scopus, AGRICOLA, Google Scholar, CrossRef, CAB Abstracting Journals, Chemical Abstracts, Indian Science Abstracts, EBSCO Indexing Services, Index Copernicus
Indian Journal of Animal Research, volume 55 issue 3 (march 2021) : 359-363

Utilization of boosted classification trees for the detection of cows with conception difficulties

D. Zaborski1,*, W. Grzesiak1
1Department of Ruminants Science, West Pomeranian University of Technology, Szczecin-71270, Poland.
Cite article:- Zaborski D., Grzesiak W. (2019). Utilization of boosted classification trees for the detection of cows with conception difficulties . Indian Journal of Animal Research. 55(3): 359-363. doi: 10.18805/ijar.B-1103.
The present study is planned to apply boosted classification trees for the detection of cows with potential conception problems. Nine hundred and eighteen artificial insemination records from Polish Holstein-Friesian Black-and-White cows were included. Each record consisted of nine predictor variables. The output variable was a conception difficulty class (good or poor). Sensitivity, specificity and accuracy on the test set were 82.93%, 84.46% and 83.91%, respectively. The most influential predictors of conception difficulty included calving interval, gestation length, body condition score, fat and protein content and age. Boosted classification trees could be applied as an on-farm decision support tool to identify cows with conception problems.
Artificial insemination (AI) is of great importance for cattle breeding and many factors affect its success or failure (Bhat and Bhattacharyya, 2012; Bhattacharyya and Hafiz, 2009; Klementováet_al2017). These include genetic causes, milk production level, nutrition, energy balance, calving course, reproductive tract disorders, infectious and other diseases (lameness, mastitis), sire fertility, AI technique, and farm management (Fenlon et al., 2017). Problems with conception may result in substantial financial losses (Grzesiak et al., 2010; Mokhtari et al., 2016). One way of detecting cows with such difficulties is the use of statistical methods, especially those from the field of data mining. Although several previous attempts have already been made to predict the probability of conception (Fenlon et al., 2017; Grzesiak et al., 2010, 2011; Hempstalk et al., 2015; Shahinfar et al., 2014), only a few studies on the use of ensemble tree-based methods can be found in the literature. Single decision trees may be incorporated into an ensemble system such as boosted classification trees. With a set of classification trees, the classifier can give a lower misclassification error and greater stability than a single decision tree (Wang et al., 2012). An important variation in this basic boosting strategy was introduced by Friedman (2002) and named stochastic gradient boosting.
        
Therefore, the aim of this study was to apply boosted classification trees to the detection of cows with potential conception problems and to indicate the most important predictors of conception difficulty
Nine hundred and eighteen AI records from Polish Holstein-Friesian Black-and-White cows were included in the study. The initial dataset contained 1236 AI cases, but was subsequently reduced after eliminating erroneous or incomplete data. The AI records were collected during two calving seasons only from healthy cows that completed their second lactation. The animals were housed in free-stall barns with outside runs accessible throughout the year. Average 4% fat-corrected milk yield (for a 305-day lactation) was 9958 kg. Each AI record consisted of the following predictor variables: X1 - HF - the percentage of Holstein-Friesian breed (%), X2 - LACT - lactation number; X3 - SEASON - AI season (spring-summer and autumn-winter), X4 - AGE - cow age at AI (in months), X5 - CLVI - an average calving interval (determined for the maximum number of three previous production seasons and expressed in days), X6 - GL - gestation length (in days), X7 - BCSI - body condition score index (in points), X8 - FCM - 4% fat-corrected milk yield (in kg) and X9 - FAT_PROT - an average milk fat and protein content (%). A body condition score index was calculated as a difference between BCS (assessed on a five-point scale) at AI and an average BCS during the previous production season. The raw body condition scores were subsequently transformed by assuming the value of 3.50 points as an optimum level and by subtracting the difference between 3.50 and higher BCS values from this optimal score. The output (dependent) variable Y was a conception difficulty class (CONC), defined as good (63.83% of all records), if a cow conceived after one or two AI, or poor (36.17% of the dataset), if more than two AI per conception were required. Descriptive statistics for continuous predictors and the distribution of categorical predictor variables are shown in Tables 1 and 2, respectively.
 

Table 1: Descriptive statistics for continuous predictors.


 

Table 2: Distribution of the categorical variables used in the study.


 
The whole dataset of AI records was randomly divided into two subsets: a training set (L) used to construct the boosted trees model (688 AI records, 75% of the whole dataset) and a test set (T) utilized for the verification of its predictive performance (230 AI records; 25% of the dataset). In addition, a validation set (V) randomly sampled from the L set (approximately 30% of training cases) was used to stop the training process at a minimum error in order to prevent over-fitting.
        
The parameters of the boosted trees algorithm considered in the analysis were the a priori probabilities of class membership (estimated from the L set), the misclassification costs (equal), the learning rate that defines the weight with which the single trees are added to the model (0.1), the proportion of randomly selected cases (AI records) for a single tree at each iteration of model development (50%) and the maximum number of single trees in the model (1000). The training process continued until reaching the lowest average multinomial deviance on the V set. The final boosted trees model was verified on the independent T set.
 
In order to assess its predictive performance, sensitivity (Se; the percentage of correctly detected cows from the poor category), specificity (Sp; the percentage of correctly classified cows from the good category) and accuracy (Acc; the percentage of correctly classified cows from both categories) were calculated (Zaborski et al., 2014). In addition, the posterior probabilities of true positive response (PSTP) and true negative response (PSTN), which quantify prediction reliability (the proportion of cows classified by the model to the poor or good category that really belonged to this category), were determined. Finally, a receiver operating characteristic (ROC) curve, showing graphically the relationship between sensitivity and the level of false alarms (1-specificity) at different cut-off values, was plotted (Fenlon et al., 2017). The curve going through the [0,0], [0,1] and [1,1] points indicates the best predictive performance [the area under the curve (AUC) equal to unity], whereas the diagonal line going through the [0,0] and [1,1] points shows a complete lack of discrimination abilities (AUC=0.5) (Keskin et al., 2005).
 
Importance analysis, which made it possible to order the predictor variables according to their relative contribution to the determination of the conception difficulty class, indicated the most influential factors affecting AI effectiveness (Yao et al., 2013). Data were analyzed statistically by using Statistica software (v 13.1, Dell Inc., Tulsa, OK, USA).
The final boosted trees model comprised 28 single-split trees. The classification matrix for the T set is presented in Table 3. Se (the percentage of correctly detected cows from the poor category), Sp (the percentage of correctly classified cows from the good category) and Acc (the percentage of correctly classified cows from both categories) on the independent T set (a randomly selected part of the original dataset used for the objective verification of the model predictive performance) were 82.93%, 84.46% and 83.91%, respectively. These values were relatively higher in comparison with other reports. This means that the developed model was quite effective in detecting cows with conception difficulties and generated a small number of false positives. In the study by Grzesiak et al. (2010) on the classification of AI outcomes in dairy cows using different types of models [discriminant function analysis, logistic regression, artificial neural networks and multivariate adaptive regression splines (MARS)], the values of Se, Sp and Acc on the T set amounted to 77.78 - 87.30%, 79.31 - 85.06% and 78.67 - 86.00%, respectively, so they were similar to those obtained in the present study. In their next study on the same subject using the naïve Bayes classifier and regression and classification trees (CART), Grzesiak et al. (2011) observed Se, Sp and Acc of 72.0 - 83.0%, 86.0 - 90.0% and 85.0 - 90.0%, respectively. In this case, Se determined in the present study was within the ranges reported by Grzesiak et al., (2011), whereas Sp and Acc were somewhat lower.
 

Table 3: Classification matrix for the test set (n=230).


 
On the contrary, Fenlon et al., (2017), who used the logistic regression model encompassing a relatively large number of predictor variables, obtained Sp on the T set amounting to 48.12% and 48.87% for the base (including the most significant variables from a univariate analysis) and final (including additional combinations of predictor variables) models, respectively. They further noticed the narrow central range of conception rates in the dairy herds. Also, the percentage of correctly classified cases (60.7 - 72.3% and 63.5 - 73.6% for primiparous and multiparous cows, respectively) in Holsteins (Shahinfar et al., 2014), resulting from the application of different machine learning algorithms (naïve Bayes, Bayesian networks, decision trees, and random forest), was generally lower than the accuracy observed in the present study.
 
The values of PSTP and PSTN (indicating the reliability of predictions made by the model) in the present study were 74.73% and 89.93%, respectively. This showed that the detection of cows with potential conception problems after AI was quite reliable and a relatively small number of false alarms (cows that did not experience any difficulties but were indicated as problematic by the model) was generated by the boosted trees classifier. The PSTP and PSTN values reported by Grzesiak et al., (2010) amounted to 73.13 - 80.88% and 83.13 - 90.24%, respectively, whereas those observed by Grzesiak et al., (2011) in their next study ranged from 72.46% to 76.79% and from 87.68% to 92.00%, respectively. So, PSTP and PSTN determined in the present study were within the ranges presented by the cited authors. However, the PSTN values observed by Fenlon et al., (2017) were 56.56% and 56.84% for the base and final logistic regression models, respectively, which means that only every second service classified by their predictive model as successful was really effective. These results were clearly inferior to ours due to the reasons mentioned above.
 
The ROC curve (showing the relationship between Se and Sp for the different cut-off values) with the AUC of 0.89 is shown in Fig 1. The AUC obtained in the present study confirmed quite good discrimination ability of the boosted trees model. The best cut-off point ensuring the highest sensitivity at the lowest level of false alarms was 0.45. The AUC values reported by Grzesiak et al., (2010) in their study on conception failure prediction in dairy cows ranged from 0.87 to 0.91. These results are in line with the results of the present study, whereas the AUC values observed by Fenlon et al., (2017) were 0.61 and 0.62 for the base and final logistic regression models, respectively, which reflected their significantly lower predictive performance. Some previous attempts at predicting conception success to a given insemination in an Irish population of Holstein-Friesian, Jersey, Norwegian Red and crossbred cows, using six different machine learning methods (C4.5 decision trees, naïve Bayes classifier, Bayesian networks, support vector machines, random forest and rotation forest) and a more traditional logistic regression model, were made by Hempstalk et al., (2015). In general, the greatest AUC values (ranging from 0.50 to 0.67 depending on the set of predictors and testing variant) were found by the cited authors for logistic regression, whose predictive performance was superior to that of all the remaining models. In particular, a random forest, which is similar to the boosted trees method, was characterized by a lower AUC value (0.49 - 0.68) than logistic regression in the cited study and the boosted trees model developed in the present study (AUC=0.89). However, Shahinfar et al., (2014) recorded a higher AUC value (averaged over five folds of cross-validation and equal to 0.75) for their random forest model that was clearly superior to all other tested machine learning methods.
 

Fig 1: A receiver operating characteristic (ROC) curve for the boosted classification trees model.



The most important predictor variables identified in the present study are shown in Fig 2. An average calving interval (CLVI) exerted the greatest influence on a conception difficulty class, followed by GL, BCSI, FAT_PROT and AGE, whereas the effect of the remaining predictor variables was much smaller. A significant influence of CLVI and GL on conception rate has already been described (Grzesiak et al., 2010). Grzesiak et al., (2010) reported that calving interval was the most important predictor of AI outcome indicated by all the classifiers investigated in their study (discriminant function analysis, logistic regression, neural networks and MARS). However, GL was usually ranked lower by the above-mentioned models (the third or fourth position depending on the classifier). The BCSI variable was also relatively significant in the cited study (the second position in the majority of cases), while FAT_PROT and AGE were indicated only by MARS (the sixth and fourth position, respectively). In the study on the application of the naïve Bayes classifier and CART to the detection of cows with conception difficulties (Grzesiak et al., 2011), calving-to-conception interval was the most influential factor affecting the class of conception, according to which the first split in the decision tree was made. The next two splits were based on calving interval (the most important variable in the present study) and BCSI (the third most significant predictor in the present study).
 

Fig 2: Predictor importance for the boosted classification trees model.


 
Fenlon et al., (2017) used the logistic regression model for the prediction of successful AI outcome and considered a much larger set of potential predictors at the first stage of research (a univariate analysis). The base model included six predictor variables (lactation number, days in milk, interservice interval, calving difficulty score and predicted transmitting abilities for calving interval and milk production), whereas the final one (showing the best predictive performance) additionally included BCS at service. From among these factors, the greatest effect on the probability of conception was exerted by days in milk (after logarithmic transformation, odds ratio equal to 2.81) and BCS at service (odds ratio equal to 28.35). Similar results were obtained by Hempstalk et al., (2015), who found that more days in milk and higher BCS were associated with a greater chance of conception, whereas increasing parity (corresponding to LACT in the present study) and the number of AI during the previous and current lactation resulted in a lower probability of successful AI outcome. The later months of the mating season (corresponding to SEASON) were associated with the lowest likelihood of conception. Shahinfar et al., (2014) also identified a herd average conception rate, the incidence of ketosis, the number of previous unsuccessful inseminations, days in milk at breeding and the occurrence of mastitis as the most influential predictors of AI outcome.
 
 The HF, FCM, LACT and SEASON variables had the smallest influence on the probability of conception. The first one was indicated as the fifth most important variable by the artificial neural network in the study by Grzesiak et al., (2010). Fat-corrected milk, on the other hand, was identified as the fifth most influential predictor variable by MARS in the same study. Lactation number was ranked second and seventh by the neural networks and MARS, respectively. Finally, SEASON was not indicated as an influential predictor by any of the models (Grzesiak et al., 2010). Also, in the study by Fenlon et al., (2017), parity (corresponding to LACT in the present study) and predicted transmitting ability for milk yield were included in the base and final prediction models.
The statistical method used in the present study (boosted classification trees) could be applied as an on-farm decision support tool after its implementation in a computer program in order to help farmers identify cows with potential conception difficulties in advance and thus limit the financial losses associated with decreased fertility.
This work was supported by the Polish Ministry of Science and Higher Education (grant number 517-01-028-3962/17).

  1. Bhat, F.A. and Bhattacharyya, H.K. (2012). Management of metritis in crossbred cattle of Kashmir using oxytetracycline, cephalexin and prostaglandin F2á. Indian J. Anim. Res., 46: 187–189. 

  2. Bhattacharyya, H.K. and Hafiz, A. (2009). Treatment of delayed ovulation in dairy cattle. Indian J. Anim. Res., 43: 209–210.

  3. Fenlon, C., O’Grady, L., Doherty, M.L., Dunnion, J., Shalloo, L., Butler, S.T. (2017). The creation and evaluation of a model predicting the probability of conception in seasonal-calving, pasture-based dairy cows. J. Dairy Sci., 100: 5550–5563. DOI: 10.3168/ jds.2016-11830.

  4. Friedman, J.H. (2002) Stochastic gradient boosting. Comput. Stat. Data Anal., 38: 367–378. DOI: 10.1016/S0167-9473(01)00065-2.

  5. Grzesiak, W., Zaborski, D., Sablik, P., Pilarczyk, R. (2011). Detection of difficult conceptions in dairy cows using selected data mining methods. Anim. Sci. Pap. Rep., 29: 293–302.

  6. Grzesiak, W., Zaborski, D., Sablik, P., ¯ukiewicz, A., Dybus, A., Szatkowska, I. (2010). Detection of cows with insemination problems using selected classification models. Comput. Electron. Agric., 74: 265–273. DOI: 10.1016/j.compag.2010.09.001.

  7. Hempstalk, K., McParland, S., Berry, D.P. (2015). Machine learning algorithms for the prediction of conception success to a given insemination in lactating dairy cows. J. Dairy Sci., 98: 5262–5273. DOI: 10.3168/jds.2014-8984.

  8. Keskin, M., Kurtoglu, S., Kendirci, M., Atabek, M.E., Yazici, C. (2005). Homeostasis model assessment is more reliable than the fasting glucose/insulin ratio and quantitative insulin sensitivity check index for assessing insulin resistance among obese children and adolescents. Pediatrics, 115: e500–e503. DOI: 10.1542/peds.2004-1921.

  9. Klementová, K., Filipèík, R., Hošek, M. (2017). The effect of ambient temperature on conception and milk performance in breeding Holstein cows. Acta Univ. Agric. Silvic. Mendel. Brun., 65: 1515–1520. DOI: 10.11118/actaun201765051515.

  10. Mokhtari, M.S., Moradi Shahrbabak, M., Nejati Javaremi, A., Rosa, G.J.M. (2016). Relationship between calving difficulty and fertility traits in first-parity Iranian Holsteins under standard and recursive models. J. Anim. Breed. Genet., 133: 513–522. DOI: 10.1111/jbg.12212.

  11. Shahinfar, S., Page, D., Guenther, J., Cabrera, V., Fricke, P., Weigel, K. (2014). Prediction of insemination outcomes in Holstein dairy cattle using alternative machine learning algorithms. J. Dairy Sci., 97: 731–742. DOI: 10.3168/jds.2013-6693.

  12. Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., Pinheiro, D. (2012). A boosted-trees method for name disambiguation. Scientometrics, 93: 391–411. DOI: 10.1007/s11192-012-0681-1.

  13. Yao, C., Spurlock, D.M., Armentano, L.E., Page, C.D., VandeHaar, M.J., Bickhart, D.M., Weigel, K.A. (2013). Random Forests approach for identifying additive and epistatic single nucleotide polymorphisms associated with residual feed intake in dairy cattle. J. Dairy Sci., 96: 6716–6729. DOI: 10.3168/jds.2012-6237.

  14. Zaborski, D., Grzesiak, W., Kotarska, K., Szatkowska, I., Jedrzejczak, M. (2014). Detection of difficult calvings in dairy cows using boosted classification trees. Indian J. Anim. Res., 48: 452-458. DOI: 10.5958/0976-0555.2014.00010.7.

Editorial Board

View all (0)