Image dataset acquisition
The experimental field was located in Hebei, China, where soybean plants were imaged under natural lighting conditions. Images of soybean plants are collected with the help of local farmers and agricultural experts. The dataset includes images of both healthy and wilted soybean plants, all taken from a single field with the same crop variety and at a uniform growth stage. Wilting in the plants was observed under natural arid conditions caused by weather stress. No artificial drought or stress was induced during the experiment.
Acquisition of images for the training of convolutional neural networks to identify soybean wilting requires a systematic and well-organized dataset. The dataset used for this work includes a large number of images where each image is tagged with distinct labels that are defined independently. A total of 6704 photos are taken in soybean fields (Fig 1). These images consist of five distinct stages of wilting, each labeled with a number ranging from 0 to 4. In the dataset, class 0 signifies the presence of healthy leaves without any wilting. Class 1 captures leaflets folding inward without a loss of firmness in the petioles or leaflets. Moving on to class 2, a slight loss of firmness is observed in the upper canopy’s petioles of leaflets, whereas class 3 indicates a moderate loss of firmness in the upper canopy. Finally, class 4 denotes a severe loss of firmness throughout the entire canopy.
To categorize the images of healthy and wilted soybean plants, the data is divided into three parts: training, testing and validation. Using an 80:10:10 ratio, 80% of images were allocated for training, 10% for learning and the remaining 10% for ensuring the model`s ability.
Image preprocessing (annotation and labeling)
Uniformity in image size is an important factor for reliable input data and optimal computational efficiency during model training. Image preprocessing involves two primary operations: Resizing and rescaling. Resizing is a method to standardize all images to a consistent dimension. Because different dimensions can significantly impact the performance of the models. When dealing with the large number of image datasets in deep learning, it comes with some challenges. These challenges include needing more power consumption, taking longer to train, the risk of fitting the model too closely to the training data and the possibility of running out of memory on the computer’s graphics card (GPU). To overcome these challenges, the bicubic interpolation method is used to resize the images. This method is implemented using Python libraries like OpenCV. It helps to make the images smaller without losing information, making things more manageable for the computer. In this work, the images were scaled to a standard size of 224 × 224 pixels.
By employing image filtering techniques, like the median filter, various image noises were eliminated. This process addresses focus issues and restores undesirable portions captured during image acquisition. In addition, a low-pass filtering method removes the amplitude of high-frequency components, preserving low-frequency information. The images were transformed into grayscale. After that, the soybean plants were categorized based on their health conditions. Subsequently, each image is labeled in 0 to 4 classes.
Data augmentation
Python libraries, particularly, Keras, were employed for data augmentation in the training of CNN model. During training, each image inputted into the network is generated from the original image. This involves creating additional data from existing training samples to augment the number of training datasets.
In the process of augmentation, each class (from 0 to 4) exhibits a distinctive set of image statics. For Class 0, the original dataset contains 448 images and an additional 159 images have been introduced through augmentation, resulting in a cumulative total of 607 images. In Class 1, the initial set consists of 894 images, supplemented by 358 augmented images, bringing the overall count to 1252 images. Similarly, Class 2 starts with 1340 original images and experiences an augmentation of 357 images, culminating in a total of 1697 images. In the case of Class 3, the original dataset comprises 1788 images, augmented by 291, yielding a total of 2079 images. Lastly, for Class 4, the initial set of 2234 images is expanded with 289 augmented images, resulting in a final count of 2523 images. Consequently, the number of images after augmentation across all classes is equal to 9158 images (Fig 2).
Libraries and software tools
To determine the most suitable tool for implementing the CNN algorithm for categorizing soybean wilting images, an analysis of available software tools and their associated libraries is conducted. In the proposed CNN model, Python programming language was chosen and their Tensorflow, Keras, NumPy, Matplotlib and OpenCV libraries are utilized. These tools and libraries provide a reliable environment for the development and experimentation of the CNN algorithm to ensure efficient image processing and analysis within the context of wilting image datasets.
Performance evaluation matrices
The performance of the detection and classification model is measured by the number of test datasets true and false classified by the proposed model. Each classifier uses statistical measurements such as actual and predicted images and the rate of error. The performance evaluation matrices are defined as:
Accuracy
It is the ratio of correct predictions, including true positives and true negatives to the total number of predictions.

Precision
It measures the accuracy of positive predictions by dividing the total number of correct positive outputs by the predicted positive labels.
Recall
It expresses the ratio of true positive events to the total number of actual positive instances in the dataset.
F1-score
It signifies the balance between precision and recall, with 1 denoting perfect performance and 0 indicating total failure.

System design
System architecture
The flow chart adopted for the implementation of sequential CNN model for soybean wilting detection is presented in Fig 3.
Training model
A multilayer CNN model has been developed to detect and isolate soybean wilt phases. It uses convolutional layers to extract features. Max-pooling layers are used for downsampling. Dense layers handle the final classification. The model aims to efficiently identify patterns and characteristics in soybean images to differentiate between healthy and wilted plants.
•
Input layer: The input layer of the CNN model processes RGB (Red, Green and Blue) images with a pixel size of 256 × 256 to differentiate between two classes: wilted and healthy. This layer does the necessary computations and transfers data to the first convolutional layer.
•
Convolutional layer: The main function of this layer is to extract important data from the input images. This layer filters the dimensions of the incoming images using a special mathematical procedure to extract relevant information from them.
An input image (A) and a kernel (K) are used to represent a 2D convolution operation as follows:
Where m and n stand for the kernel (K) coordinates and I and j for the image (I) coordinates.
•
Max-pooling layer: Pooling layer: This layer creates a matrix based on the maximum value selected from the filtered image, reducing the input image’s size. It also assists with handling overfitting-related issues.
•
Fully connected layer (FCL): FCL referred to as dense layers, flattens the output from the model’s initial convolutional layers to obtain a one-dimensional array. There are five neurons in the classifier.
•
Output layer: The output layer comprises five neurons with a softmax activation. The choice of the softmax activation function in this layer is based on its recommendation for binary classification tasks.
Feature extraction by proposed model
A convolutional neural network (CNN) is created as a hybrid integration of two core modules: the use of convolutional and pooling layers to extract complex patterns, followed by fully linked layers to help in classification (Fig 4). Together, these elements give an input image a probability, which determines the image’s precise classification. One important parameter known as a color feature becomes significant when identifying soybean wilting. Color variance is a defining factor for crop wilting identification in convolutional approaches, where this parameter is carefully evaluated because of its noticeable impact. Deep learning (DL) achieves remarkable accuracy by automatically learning classifiers and disorders, in contrast to typical machine learning algorithms. The feature map is computed by.
Where,
b = Bias term.
wik = Filter that was applied to the input.
xk = kth channel of the input image.
CNN for classification with hyperparameters
A sequential CNN model is designed to classify healthy and wilted soybean leaves. It performs well with image datasets. The batch size is set to 64 pixels. It determines the number of training examples processed in each training iteration. A larger batch size can offer remarkable computational efficiency but may also increase memory requirements. The images were resized into 224 × 224 pixels with three color channels (RGB). The sequential model is constructed with layer-by-layer configuration for effective image classification. The initial layer starts with the VGG 16 architecture (Functional), generating a 3D tensor output of dimensions (7, 7, 512) and contributing 14,714,688 parameters. After this, a Conv2D layer performs 2D spatial Convolution, resulting in an output shape of (3,3,512) and 6,554,112 parameters. Subsequently, a MaxPooling 2D layer executes max pooling, producing an output shape of (1, 1, 512). Next, the Global Average Pooling 2D layer transforms this output into a 1D tensor with 512 dimensions. The model then consists of seven dense layers, each with varying output shapes and associated parameters: dense_1 (512, 262,656), dense_2 (256, 131,328), dense_3 (128, 32,896), dense_4 (64, 8256), dense_5 (32, 2080), dense_6 (16, 528) and dense_7 (5, 85). These layers serve as the classifier, applying linear transformations and activation functions.
Three dropout layers are dotted throughout, with no trainable parameters. These layers introduce randomness during training to reduce overfitting. Two Batch Normalization layers were used to normalize and scale input data, contributing 1,024 and 256 parameters. The model outlines a total of 21,707,909 parameters including 14,072,005 trainable and 7,635,904 non-trainable parameters.