The experiment was conducted from June 2024 to March 2026 at the Department of Computer Science and Engineering, Annamalai University, Tamil Nadu, India. This section depicts the modules involved in the proposed research The pipeline of the proposed approach is illustrated in Fig 1.
Dataset description
Plant village dataset
With 54,303 images total, the dataset includes both healthy and sick leaf varieties are shown in Fig 2, categorized into 38 sets that address 14 different crops and 24 different diseases are shown in Fig 2. Therefore, there are 52 classes in the customized dataset. The RGB format is represented by the three channels and each image in the dataset has dimensions of 256 × 256 × 3. After data pre-processing, totally 6646 images from plant village datasets are used for testing and validation set (3,323 images for test set and 3,323 for validation set). After augmentation, 200 images for each class, the resulting 10,400 images of 52 classes is used for training set. Totally 17046 images utilized for this research work. It is available at: https://www.kaggle.com/emmarex/plantdisease.
Image pre-processing
The initial database used in this approach will be color-coded in Red, Green and Blue (RGB) as well as it includes images with noisy values. To enhance the quality of the input images, the method carries out image pre-processing functions including scaling, edge detecting and noise removing. Data augmentation
(Adnan et al., 2023) is also carried out to enhance the learning of the model to increase the recognition of the image features and accurate prediction. It also helps to minimize the computational load and allows to process datasets more effectively.
Proposed MS-ViT
The general architecture of the proposed work would be shown in Fig 3. A multi-scale vision transformer-based architecture is proposed to detect diseases in plant leaves. It is an approach based on the YOLOx detecting head that is used to construct an FPN, extract multi-scale features and predict the location and type of disease on plant leaves. Three groups of input characteristics are used to carry out the self-attention processing: queries (Q), values (V) along with keys (K). Using the degree of similarity among the query along with key vectors, the weighted sum of the values of the vector is computed. The scaling dot product attention
(Vaswani et al., 2017) can be mathematically defined as Eq. (1):

Where,
d= The scaling factor.
Q= Made up of n
q query vectors.
K and V= Key and value vectors, respectively, with n
k.
Window based multi head self-attention (W-MSA) (
Liu and Zhang, 2025) is used to solve this problem; it executes self-attention inside localized, non-overlapping windows. Next, the patches are divided into windows, as well as only within every window is self-attention calculated after the initial partitioning. As a result, complexity is reduced without sacrificing the ability to successfully learn local features.
Overall architecture of MSViT
A learnable feature vector is used to encode and represent each patch in the input image of ViT in the proposed approach for plant leaf disease diagnosis. The input image is then separated into patches. The width (W), height (H), as well as the total number of channels (C') make up the input image, which is denoted as W × H × C'. The default value for the patch size is 4. Following the processes of patch partitioning as well as embedding, the initial input is transformed into

, with C being the inserting dimension. When this happens, the input length is

.
An YOLOx-inspired module called BCSP is shown, with its BConv and CSP
(Ge et al., 2021) making up its constituent parts. In a CSP, a BConv processes one of the channels, while another BConv as well as a multi-layer BottleNeck module
(Ge et al., 2021) process the other channel; this last module is basically a residual module. Then, the two outputs are combined and processed by a BConv. The input of the i
th stage is denoted as I
i, while the output is O
i. Furthermore, the ith step of the multi-stage ViT is represented by ViTS
i. Here is how the procedure of up-dimensioning and cross-channel information interaction is defined in Eq. (2) and Eq. (3):
Oi = ViTSi (Ii) ...(2)
Ii + 1 = BCSP (Oi, Ki, Si, Cin, Cout) ...(3)
Where the input and output channels are denoted by as well as , respectively and the convolution kernel size along with stride are represented by and , respectively. In the first three steps, is twice and the kernel size as well as stride are usually set to 1. The channel configurations for the four stages are {128, 256, 512, 1024}. Here is a definition of down sampling:
Mi = BCSP (Oi, Ki, Si, Cin, Cout) ...(4)
Equation (4) states that Mi is the i
th scale feature, that K = {1, 3, 5, 9}, S = {1, 2, 4, 8} are the convolution kernel sizes and strides and that C
in = C
out. As a consequence, resolution-based multi-scale feature maps are produced. The subsequent resolutions of the four scales are obtained after down sampling:

. Represented here as F
i, this is the result of the i
th FPN scale:
Fi = 5 * CBL [Concat (Mi, UP (CBL (Fi + l)] ...(5)
F4 = 2 * CBL [SPP (2 * CBL (M4)] ...(6)
The symbols used in Equation (5) are as follows: UP for up sampling, CBL represents a block that utilizes layer of convolution, layer of batch normalization, as well as leaky-ReLU function of activation along with SPP for a CBL block and repeated maximum-pooling operations in addition to concatenation
(Ge et al., 2021). The next step is to use a bottom-up fusion procedure. For the i
th scale particular detection module, let P
i stand for its input, which is specified by Eq. (7):
Pi = 3 * CBL [Concat (Fi BConv (Pi - 1)] ...(7)
Each scale specific feature is fed into an independent detection branch, which independently predicts disease classes, bounding regions and confident scores. The outputs of all detection branch are concatenated to form the final prediction result, enabling precise and resilience detection of plant disease across varying scales and symptom sizes.
Multiclass leaf disease classification using Efficient NetB3
This research presents the EfficientNetB3 to improve classification performance in multiclass leaf disease classification. An EfficientNet CNN system, of which the given model is a member, employs a compound scaling principle. By using the compound scaling method, the convolution unit is sized to match the target size. The network dimension is scaled consistently with a balance in width, depth and resolution through the application of a compound coefficient (
Thai-Nghe et al., 2021). In order to decrease the computational load by a factor f
2, where f is the filter size, the EfficientNetB3 model is constructed using Mobile-inverted bottleneck convolution (MBConv) units. These units utilize kernel dimensions of 3 × 3 and 5 × 5. The depth, width and resolution of the network are all amplified to the same extent by means of a compound scaling coefficient.
This model achieves greater learning of complicated characteristics and enhanced generalizability with 210 layers and an input shape of 300 × 300 × 3. An extra spatial attention strategy module is used in this model to identify crucial areas in the feature map. Two separate 2D maps, F'avg ∈ ℝ1 × H × W and F'max ∈ ℝ1 × H × W, are generated by combining channel information using average and max-pooling procedures. These maps are then convolved to generate a spatial attention map, F'S ∈ ℝH × W, which shows where to emphasize or hide features. The spatial attention is established by:
F'S = σ [f 7×7 (F'avg; F'max)] ...(8)
The sigmoid activation function is denoted by σ in Eq. (8) and the Conv layer with a 7 × 7 kernel size is represented by f 7×7. In addition, the model’s outputs are flattened after a global average pooling. Later on, a 256-neuron dense unit is integrated using a ReLU and L1 along with L2 regularization strategies. Overfitting can be prevented by employing a dropout rate of 0.4. Lastly, the classification process is finished by including a dense unit with neurons that are equal to the number of classes. The final classification stage, which includes predicting probabilities for every category, is performed using a softmax. The model’s efficiency is assessed using the categorical Cross-Entropy (CE) loss. The following is how it calculates the difference between actual values and expected probabilities:
In equation (9), the variables n, ti and pi represent the total classes in the dataset under consideration, the true label along with the forecast label, respectively, which is the softmax probability for the class.
Algorithm 1: MSViT-EfficientNetB3 for Multi-class leaf disease classification.
Input: 17046 images from 52 classes determined form plant village dataset.
Output: Detection of multi class plant diseases.
1. Begin.
2. Dataset preparation.
3. Load the customized dataset containing 150 images per class for 52 disease categories with size of 256 × 256 × 3.
4. Image pre-processing.
5. Apply normalization, edge detection and noise reduction to clean the input images.
6. Perform data augmentation (Rotation, flipping, zooming) to increase dataset variability and reduce class imbalance.
7. Multi scale feature extraction using MSViT.
8. Use four stage vision transformers layers with W-MSA to extract hierarchical features.
9. Use BCSP in each stage to down sample with multiple kernel sizes and strides to extract multi scale feature.
10. Fuse feature maps using FPN for better spatial resolution and disease localization.
11. Feed fused features into EfficientNetB3.
12. Classification layer.
13. Apply the FC and softmax layers to classify images into disease categories.
14. End.