This paper uses agriculture raw data collected from district agricultural office at Krishnagiri. This raw data set is then preprocessed and analyzed by using R- tool. This paper describes about the progress of retrieving required data from given soil data sets. Decision tree algorithm are used for given soil data set with the following steps.
1. Agriculture data is used as input data for system.
2. Pre-process data is nothing but removing unwanted data or adding required information to make data mining easier.
3. This pre-processed data is used as input for further implementation.
4. Create two datasets from available data, one for training and other for testing.
5. Apply decision tree algorithm and c.50 algorithm.
The data used for this paper are obtained for the year 2017 from few Taluks of Krishnagiri district of Tamil Nadu in India. The preliminary data collection is carried out for all the Taulks of Krishnagiri district of Tamil Nadu in India. Micronutrients are essential for plant growth and play an important role in balanced crop nutrition
(Samundeeswari et al., 2017). They include copper (Cu), iron (Fe), manganese (Mn), zinc (Zn). They are as important to plant nutrition as primary and secondary macronutrients, though plants don’t require as much of them. A lack of any one of the micronutrients in the soil can limit growth, even when all other nutrients are present in adequate amounts.
Decision tree algorithm
Decision Tree algorithm is one of the supervised learning algorithm, decision tree algorithm can be used for solving classification and regression problems. Decision Tree algorithm is to create a training data set model which is used to predict crop yield inferred from prior data (training data). The decision tree algorithm is used to solve the problem, by using tree representation. The primary challenge in the decision tree implementation is to identify which attributes we need to consider as the root node and each level. We have different attributes selection measure to identify the attribute which can be considered as the root note at each level.
The popular attribute selection measures
· Information gain
· Gini index
C 5.0 algorithm
C 5.0 algorithm is the advance version of C 4.5 .C5.0 Decision Tree algorithm is one of the popular and effectively used classifier for crop yield in soil data set classification in present study.C5.0 is an algorithm used to generate a decision tree (
Hari Ganesh et al., 2015). The decision trees generated by C 5.0 can be used for classification 5.0 builds decision trees from a set of training data using the concept of information entropy. The training data is a set of already classified samples. At each node of the tree, C 5.0 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision.
R-programming
R [R Core Team, 2015b] is a free software environment for statistical computing and graphics. It provides a wide variety of statistical and graphical techniques. R can be easily extended with 7324packages available on CRAN3 (as of October 20, 2015).
To help users to find our which R packages to use, the CRAN Task Views 9 are a good guidance. They provide collections of packages for different tasks. Some Task Views related to data mining are:
➢ Machine Learning and Statistical Learning,
➢ Cluster Analysis and Finite Mixture Models
➢ Time Series Analysis,
➢ Natural Language Processing,
➢ Multivariate Statistics,
➢ Analysis of Spatial Data
R Studio
R Studio is an integrated development environment (IDE) for R and can run on various operating systems like Windows, Mac OS X and Linux. It is a very useful and powerful tool for R programming and therefore, readers are suggested to use R Studio. What you normally need is R Studio Desktop open source edition, which is free of charge. When R Studio is launched for the first time. There are four panels (
Kanjana Devi et al., 2016).
➢ Source panel (top left), which shows your R source code. If you cannot see the source panel, you can find it by clicking menu “File”, “New File” and then “R Script”. You can run a line or a selection of R code by clicking the “Run” bottom on top of source panel, or pressing “Ctrl + Enter”.
➢ Console panel (bottom left), which shows outputs and system messages displayed in a normal R console.
➢ Environment/History/Presentation panel (top right), whose three tabs show respectively all objects and function loaded in R, a history of submitted R code and Presentations generated with R.
➢ Files/Plots/Packages/Help/Viewer panel (bottom right), whose tabs show respectively a list of files, plots, R packages installed, help documentation and local web content
(Hemageetha et al., 2016).
It is always a good practice to begin R programming with an RStudio project, which is a folder where to put your R code, data files and Figures. To create a new project, click the “Project” button at the top-right corner and then choose “New Project”. After that, select “create project from new directory” and then “Empty Project”. After typing a directory name, which will also be your project name, click “Create Project” to create your project folder and files. If you open an existing project, RStudio will automatically set the working directory to the project directory, which is very convenient. After that, create three folders as below:
➢ Code, where to put your R source code.
➢ Data, where to put your datasets.
➢ Figures, where to put produced diagrams.
In addition to above three folders which are useful to most projects, depending on your project and preference, you may create additional folders below:
➢ Raw data, where to put all raw data.
➢ Models, where to put all produced analytics models and
➢ Reports, where to put your analysis reports.
Coding for soil data using decision treeand c 5.0
sdata <- read.csv(“d:/data/sdata.csv”,header=TRUE)
View(sdata) # view total data
summary(sdata) #data summary
dim(sdata) # dimension of the data
names(sdata) # display header of the data
library(Hmisc)
library(lattice)
library(survival)
library(Formula)
library(ggplot2)
range(sdata$CA) #range
mean(sdata$CA)
quantile(sdata$CA) #Quartiles and percentiles
var(sdata$CA)
hist(sdata$CA)
library(magrittr) ## for pipe operations
sdata$CA %>% density() %>%
plot(main=’Density of CA’)
library(dplyr)
sdata2 <- sdata %>% sample_n(50)
sdata2$crop %>% table() %>% pie()
tab <- sdata2$crop %>% table()
precentages <- tab %>% prop.table() %>% round(3) * 100
txt <- paste0(names(tab), ‘’, precentages, ‘%’)
pie(tab, labels=txt)
cov(sdata$CA, sdata$PH) # covariance
cor(sdata$CA, sdata$PH) # correlation
cov(sdata[, 2:10]) # correlation of 10 column
boxplot(PH ~ crop, data = sdata)
with(sdata, plot(CA, MG, col = crop, pch = as.numeric(crop)))
#pairs(sdata)
library(scatterplot3d)
scatterplot3d(sdata$N,sdata$P,sdata$K)
#Decision tree
dim(sdata)
head(sdata, 7)
set.seed(1234)
ind <- sample(2, nrow(sdata), replace=TRUE, prob=c(0.7, 0.3))
sdata.train <- sdata[ind==1,]
sdata.test <- sdata[ind==2,]
# train a decision tree
library(rpart)
myFormula <- N ~ P + CA + MG
sdata.rpart <- rpart(myFormula, data = sdata.train,
control = rpart.control(minsplit = 10))
fit <- rpart(myFormula, data = sdata.train, method = ‘class’)
predict_unseen <-predict(fit, sdata.test, type = ‘class’)
table_mat <- table(sdata.test$crop, predict_unseen)
table_mat
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
print(paste(‘Accuracy for test’, accuracy_Test))
print(sdata.rpart$cptable)
library(rpart.plot)
#rpart.plot(fit,extra=106)
rpart.plot(sdata.rpart)
#C 5.0 algorithm
library(C50)
# load data
# fit model
fit <- C5.0(crop~., data=sdata, trials=10)
# summarize the fit
print(fit)
# make predictions
predictions <- predict(fit, sdata)
# summarize accuracy
table(predictions, sdata$crop)
#Bagging CART
library(ipred)
# fit model
fit <- bagging(crop~., data=sdata)
# summarize the fit
summary(fit)
# make predictions
predictions <- predict(fit, sdata, type=”class”)
# summarize accuracy
table(predictions, sdata$crop)
summary(predictions)
confusionMatrix(sdata, reference, positive = NULL,
dnn = c(“Prediction”, “Reference”), prevalence = NULL,
mode = “sens_spec”, ...)
library(caret)
lvs <- c(“positive”, “negative”)
truth <- factor(rep(lvs, times = c(30, 2000)),
levels = rev(lvs))
pred <- factor( c(
rep(lvs, times = c(20, 10)),
rep(lvs, times = c(180, 1820))),
levels = rev(lvs))
xtab <- table(pred, truth)
print(confusionMatrix(xtab[2:1,2:1]))