Background: Early prediction in plant diseases enables timely management, reducing crop losses and promoting healthier yields. Deep learning (DL) models are effective tools for an early prediction of plant diseases. Although among these, EfficientNetB3 based models outperform for multi-class plant disease classification but is highly sensitive to noisy or poor-quality input data, such as images with background clutter, varying lighting conditions, or partially damaged leaves and may lead to incorrect classification results, reducing the model’s reliability in real-world agricultural scenarios.

Methods: To overcome these limitations, a multi-scale vision transformer (MSViT) model is proposed to capture multi-scale features for enhancing robustness against noise and background variations. A multi-stage ViT structure is implemented to extract hierarchical image features, while enhanced convolution layers are used to down sample the output of each stage, enabling feature extraction at multiple scales. The size and stride of the convolution kernel are dynamically adjusted according to feature dimensionality to ensure effective representation learning across scales. To reduce the computational complexity, window attention is introduced in each ViT stage with window sizes that vary in proportion to feature dimensions. The multi scale features obtained from different stages are then fused using a feature pyramid network (FPN) to strengthen the localization and representation of disease affected regions. For classification, a modified EfficientNet-B3 model is used to predict diseases. The complete framework is termed as multi-scale vision transformer for EfficientNetB3 (MSViT-EfficientNetB3).

Result: Finally, the proposed model achieves 94.2% of accuracy on Plant Village dataset outperforming existing models in multi class plant disease classification.

The provision of raw resources and nourishment for all forms of life is dependent on agriculture (Dethier and Effenberger, 2012). Plant diseases are the principal factor that substantially affects plant development which may occur due to several factors, including parasitic organisms, fluctuations in soil pH, frequent changes in environmental conditions and reduced moisture levels, all of which significantly affect crop output and growth (Keerthana et al., 2024; Hassan et al., 2022). Reducing the plant diseases is essential for sustaining agricultural output and economic development. But there exists a specific issue in identifying and classifying the kinds of plant illnesses (Fenu and Malloci, 2021) that substantially impact growth production.
       
Traditional prediction models need manual observation by agriculturists, resulting in increased costs, prolonged assessment periods and substantial laboratory setups when operating in extensive agricultural fields (Khakimov et al., 2022). Previously, image processing was extensively used to predict plant diseases by analyzing gathered photos of diseased plants (Shafik et al., 2023; Buja et al., 2021). However, this approach yields diminished prediction accuracy when handling huge pictures and significantly relies on human feature extraction.
       
There is an increase in the use of DL algorithms in the recognition as well as categorization of plant diseases. Disease predictions in cassava were made using a DenseNet model by (Zhong et al., 2022; Arshad et al., 2023) used a U-Net topology to predict a potato disease that can reliably identify disease zones in plant images. InceptionV3 and vision transformers (ViT) were used for prediction and classification. A hybrid model was developed (Tiwari et al., 2024) by a combination of ViT and deep neural network (DNN)for disease detection in tomato leaves. Various convolutional neural network (CNN) architectures developed (Pinaria et al., 2026) was developed for prediction task. However, these models focus on predicting single plant species and their diseases. To make the models better at detecting different kinds of plant diseases, multi-class detection will be best solution.  
       
A CAAR-UNet technique developed (Abinaya et al., 2023) which uses a cascading autoencoder with attention residuals for multiple plant diseases. An EfficientNetB3 adaptive Augmented Learning (EB3AADL) was designed (Adnan et al., 2023) to predict numerous plant disorders. This method increases the accuracy through pre-processing of the images to reduce noises and increases the diversity through augmentation of the images. Dual encoder method was used (Sharma et al., 2026) for multiclass disease detection. The above-mentioned DL models have shown promise in plant disease classification; they suffer from sensitivity to noisy or low-quality input data. Some models perform poorly due to background clutter, inconsistent lighting conditions or partially damaged leaves, which reduce classification accuracy.
       
To address these problems, a multi-scale context Vision Transformer (ViT) model is proposed to extract multi-scale features that make it resistant to noise and changes in the background. The proposed approach integrates a multi-scale vision transformer (MSViT) to extract hierarchical features at multiple spatial resolutions, along with an attention-enhanced EfficientNet-B3 to perform effective multi-class plant disease classification. The MSViT module addresses the limitations of standard ViTs by employing multi-stage feature extraction with enhanced convolution layers to down sample outputs, enabling rich multi-scale representation. Window attention is applied on a stage-by-stage basis with dynamically adjusted window sizes to capture essential spatial information while reducing computational complexity. The multi- scale features obtained from different stages are fused using a FPN to improve localization and representation of disease-affected regions. For classification, a modified EfficientNet-B3 model is used to automatically focus on disease-relevant regions, enhancing prediction accuracy.
The experiment was conducted from June 2024 to March 2026 at the Department of Computer Science and Engineering, Annamalai University, Tamil Nadu, India. This section depicts the modules involved in the proposed research The pipeline of the proposed approach is illustrated in Fig 1.

Fig 1: Pipeline of the proposed model.


 
Dataset description
 
Plant village dataset
 
With 54,303 images total, the dataset includes both healthy and sick leaf varieties are shown in Fig 2, categorized into 38 sets that address 14 different crops and 24 different diseases are shown in Fig 2. Therefore, there are 52 classes in the customized dataset. The RGB format is represented by the three channels and each image in the dataset has dimensions of 256 × 256 × 3. After data pre-processing, totally 6646 images from plant village datasets are used for testing and validation set (3,323 images for test set and 3,323 for validation set). After augmentation, 200 images for each class, the resulting 10,400 images of 52 classes is used for training set. Totally 17046 images utilized for this research work. It is available at: https://www.kaggle.com/emmarex/plantdisease.

Fig 2: Examples of leaf images from plant village database.


 
Image pre-processing
 
The initial database used in this approach will be color-coded in Red, Green and Blue (RGB) as well as it includes images with noisy values. To enhance the quality of the input images, the method carries out image pre-processing functions including scaling, edge detecting and noise removing. Data augmentation (Adnan et al., 2023) is also carried out to enhance the learning of the model to increase the recognition of the image features and accurate prediction. It also helps to minimize the computational load and allows to process datasets more effectively.
 
Proposed MS-ViT
 
The general architecture of the proposed work would be shown in Fig 3. A multi-scale vision transformer-based architecture is proposed to detect diseases in plant leaves. It is an approach based on the YOLOx detecting head that is used to construct an FPN, extract multi-scale features and predict the location and type of disease on plant leaves. Three groups of input characteristics are used to carry out the self-attention processing: queries (Q), values (V)  along with keys (K). Using the degree of similarity among the query along with key vectors, the weighted sum of the values of the vector is computed. The scaling dot product attention (Vaswani et al., 2017) can be mathematically defined as Eq. (1):


Where,
d= The scaling factor.
Q= Made up of  nq query vectors.
K and V= Key and value vectors, respectively, with nk.
Window based multi head self-attention (W-MSA) (Liu and Zhang, 2025) is used to solve this problem; it executes self-attention inside localized, non-overlapping windows. Next, the patches are divided into windows, as well as only within every window is self-attention calculated after the initial partitioning. As a result, complexity is reduced without sacrificing the ability to successfully learn local features.   

Fig 3: Overview architecture of multi scale-ViT with EfficientNetB3.



Overall architecture of MSViT
 
A learnable feature vector is used to encode and represent each patch in the input image of ViT in the proposed approach for plant leaf disease diagnosis. The input image is then separated into patches. The width (W), height (H), as well as the total number of channels (C') make up the input image, which is denoted as W × H × C'. The default value for the patch size is 4. Following the processes of patch partitioning as well as embedding, the initial input is transformed into  , with  C being the inserting dimension. When this happens, the input length is .
 
An YOLOx-inspired module called BCSP is shown, with its BConv and CSP (Ge et al., 2021) making up its constituent parts. In a CSP, a BConv processes one of the channels, while another BConv as well as a multi-layer BottleNeck module (Ge et al., 2021) process the other channel; this last module is basically a residual module. Then, the two outputs are combined and processed by a BConv. The input of the ith stage is denoted as Ii, while the output is Oi. Furthermore, the  ith step of the multi-stage ViT is represented by ViTSi. Here is how the procedure of up-dimensioning and cross-channel information interaction is defined in Eq. (2) and Eq. (3):
 
        Oi = ViTSi (Ii)         ...(2)
 
                       Ii + 1 = BCSP (Oi, Ki, Si, Cin, Cout)                    ...(3)
 
Where the input and output channels are denoted by  as well as , respectively and the  convolution kernel size along with stride are represented by  and , respectively. In the first three steps,  is twice  and the kernel size as well as stride are usually set to 1. The channel configurations for the four stages are {128, 256, 512, 1024}. Here is a definition of down sampling:
 
                                    Mi = BCSP (Oi, Ki, Si, Cin, Cout)                        ...(4)  
                                                                                   
Equation (4) states that Mi is the ith scale feature, that  K = {1, 3, 5, 9}, S = {1, 2, 4, 8} are the convolution kernel sizes and strides and that Cin = Cout. As a consequence, resolution-based multi-scale feature maps are produced. The subsequent resolutions of the four scales are obtained after down sampling: . Represented here as Fi, this is the result of the ith FPN scale:
        
                                           Fi = 5 * CBL [Concat (Mi, UP (CBL (Fi + l)]                              ...(5)         
       
                                    F4 = 2 * CBL [SPP (2 * CBL (M4)]                                   ...(6)
                                  
The symbols used in Equation (5) are as follows: UP for up sampling, CBL represents a block that utilizes layer of convolution, layer of batch normalization, as well as leaky-ReLU function of activation along with SPP for a CBL block and repeated maximum-pooling operations in addition to concatenation (Ge et al., 2021). The next step is to use a bottom-up fusion procedure. For the ith scale particular detection module, let  Pi stand for its input, which is specified by Eq. (7):
 
                                                      Pi = 3 * CBL [Concat (Fi BConv (Pi - 1)]                               ...(7)           
                                                     
Each scale specific feature is fed into an independent detection branch, which independently predicts disease classes, bounding regions and confident scores. The outputs of all detection branch are concatenated to form the final prediction result, enabling precise and resilience detection of plant disease across varying scales and symptom sizes.
 
Multiclass leaf disease classification using Efficient NetB3
 
This research presents the EfficientNetB3 to improve classification performance in multiclass leaf disease classification. An EfficientNet CNN system, of which the given model is a member, employs a compound scaling principle. By using the compound scaling method, the convolution unit is sized to match the target size. The network dimension is scaled consistently with a balance in width, depth and resolution through the application of a compound coefficient (Thai-Nghe et al., 2021). In order to decrease the computational load by a factor f2, where f is the filter size, the EfficientNetB3 model is constructed using Mobile-inverted bottleneck convolution (MBConv) units. These units utilize kernel dimensions of 3 × 3 and 5 × 5. The depth, width and resolution of the network are all amplified to the same extent by means of a compound scaling coefficient.
       
This model achieves greater learning of complicated characteristics and enhanced generalizability with 210 layers and an input shape of 300 × 300 × 3. An extra spatial attention strategy module is used in this model to identify crucial areas in the feature map. Two separate 2D maps,  F'avg  ∈ ℝ1 × H × W and F'max ∈ ℝ1 × H × W, are generated by combining channel information using average and max-pooling procedures. These maps are then convolved to generate a spatial attention map, F'S ∈ ℝH × W, which shows where to emphasize or hide features. The spatial attention is established by:
 
                                               F'S = σ [f 7×7 (F'avg; F'max)]                      ...(8)                                
      
The sigmoid activation function is denoted by σ in Eq. (8) and the Conv layer with a 7 × 7 kernel size is represented by f 7×7. In addition, the model’s outputs are flattened after a global average pooling. Later on, a 256-neuron dense unit is integrated using a ReLU and L1 along with L2 regularization strategies. Overfitting can be prevented by employing a dropout rate of 0.4. Lastly, the classification process is finished by including a dense unit with neurons that are equal to the number of classes. The final classification stage, which includes predicting probabilities for every category, is performed using a softmax. The model’s efficiency is assessed using the categorical Cross-Entropy (CE) loss. The following is how it calculates the difference between actual values and expected probabilities:  
                               
  
                                 
 
In equation (9), the variables n, ti and pi represent the total classes in the dataset under consideration, the true label along with the forecast label, respectively, which is the softmax probability for the  class.
 
Algorithm 1: MSViT-EfficientNetB3 for Multi-class leaf disease classification.
 
Input: 17046 images from 52 classes determined form plant village dataset.
 
Output: Detection of multi class plant diseases.
 
1. Begin.
2. Dataset preparation.
3. Load the customized dataset containing 150 images per class for 52 disease categories with size of 256 × 256 × 3.
4. Image pre-processing.
5. Apply normalization, edge detection and noise reduction to clean the input images.
6. Perform data augmentation (Rotation, flipping, zooming) to increase dataset variability and reduce class imbalance.
7. Multi scale feature extraction using MSViT.
8. Use four stage vision transformers layers with W-MSA to extract hierarchical features.
9. Use BCSP in each stage to down sample with multiple kernel sizes and strides to extract multi scale feature.
10. Fuse feature maps using FPN for better spatial resolution and disease localization.
11. Feed fused features into EfficientNetB3.
12. Classification layer.
13. Apply the FC and softmax layers to classify images into disease categories.
14. End.
Experimental setup
 
The experiments on multiclass plant disease classification using various DL models are presented in this section. The Python language is used to write the crucial codes. For this test, they utilized a Windows 10 64-bit laptop containing an Intel® Core TM i5-4210 processor running at 3GHz, 8 GB of RAM and a 1 TB hard drive. Following augmentation, 52 classes produced 17046 images. Of these, 3,323 images were divided evenly between validation and testing and 10400 images were used for training. On the dataset under consideration, the recommended approach is used with the proper hyperparameter values revealed in Table 1.

Table 1: Hyper parameters used in training of MSViT-EfficientNetB3 model.


       
The following assessment metrics may be used to evaluate the MSViT-EfficientNetB3 model’s classification efficiency:
 
Evaluation metrics
 
Accuracy is the ratio of accurately categorized leaf disease images out of the overall images observed.

    
In Eq. (10), TP describes the accurate detection of a positive class, such as a healthy leaf. Recognizing a diseased leaf when it genuinely is diseases is an example of a negative class that TN correctly depicts. FP happens when something that shouldn’t be positive is really positive (Like a diseases leaf being identified as healthy). FN happens when something good is mistakenly labelled as bad (such a healthy leaf being thought of as diseased).
 
Precision: This is calculated using Eq. (11).

                                                                                                                   
Recall: Equation (12) is used to measure it.

          
F1-score: It is determined by Eq. (13)

                                                                                                            
Fig 4 shows the resultant confusion matrix from testing the proposed technique on 52 classes. Fig 5 illustrates performance analysis of training and testing accuracy respected to the epochs to the PlantVillage dataset. The performance of the MSViT-EfficientNetB3 model in terms of improvement in training and testing as the number of epochs (from 0 to 100) increase is consistent. The proposed model demonstrates equal advances in training and testing accuracy implying effective learning and low overfitting.

Fig 4: Confusion matrix proposed MSViT-EfficientNetB3 model.



Fig 5: Performance evaluation of proposed model’s training and testing accuracy on plantvillage dataset.


 
Performance analysis of proposed MSViT-efficientNetB3 model with standard models
 
In this section, the performance of the MSViT-EfficientNetB3 classifier model on the plant village dataset is compared to popular pre-trained models such as EfficientNetB0, EfficientNetB1, EfficientNetB2, EfficientNetB3 and EfficientNetB3+ViT.
       
Using the modified plant village database for multiclass disease of leaves classification, Fig 6 shows how well different standard models perform. From the analysis, it is determined that the accuracy of the MSViT-EfficientNetB3 model exceeds that of EfficientNetB0, EfficientNetB1, EfficientNetB2, EfficientNetB3 and EfficientNetB3+ViT by 9.15%, 6.20%, 5.61%, 4.55% and 1.73%, respectively. This improvement can be attributed to the fact that the model is capable of smartly giving attention to the critical areas of the image that are of most importance when identifying plant diseases. By paying attention to these areas, the model will be more effective in the capture of subtle terms applying to the disease and, therefore, improved accuracy and reliable classification of plant diseases. A detailed comparison of the findings from the proposed model with those of previous studies is presented in Table 2.

Fig 6: Evaluation of various standard model with proposed model.



Table 2: Comparison of different existing models with proposed model.

To enhance accuracy and localization of features, this paper presents the MSViT-EfficientNetB3 model that can detect as well as classify diseases in numerous categories of plant leaves and diseases. The model is an MSViT and an attention-guided EfficientNet-B3. The MSViT module uses a hierarchical design that learns to extract feature at multiple spatial resolutions with the help of adaptive down sampling and window attention that has the effect of preserving important spatial information and dramatically lowers computational cost. The model effectively identifies local and global disease properties by integrating stage convolutions and FPN in order to merge features effectively. An EfficientNet-B3 classifier is further improved by introducing a spatial attention module that highlights disease relevant areas in the feature maps and makes more precise predictions. Such a combination enables the model to specify disease symptoms on a fine-grained level, meaning in different leaf types and different conditions. The results of extensive experiments on the PlantVillage dataset indicate that the MSViT-EfficientNetB3 model offers 97.12%, 97.42%, 98.21% and 98.7% of accuracy, precision, recall and f1 scores respectively, which are better than current methods (what are those?). Nevertheless, the model has a minor performance reduction when dealing with low contrast or highly noisy images. Future studies will discuss the incorporation of more sophisticated transformed based enhancement modules to enhance further accuracy in problematic imaging conditions. Further, the proposed model will be extended for diseases affecting other plant parts or for leaves exhibiting multiple diseases.
The present study was supported by the Department of Computer Science and Engineering, Annamalai University, Tamil Nadu, India. The authors would like to thank the department for providing the facilities and support to conduct this research.
 
Disclaimers
 
The views and conclusions expressed in this article are solely those of the authors and do not necessarily represent the views of their affiliated institutions. The authors are responsible for the accuracy and completeness of the information provided, but do not accept any liability for any direct or indirect losses resulting from the use of this content.
The authors declare that there are no conflicts of interest regarding the publication of this article. No funding or sponsorship influenced the design of the study, data collection, analysis, decision to publish, or preparation of the manuscript.

  1. Abinaya, S., Kumar, K.U. and Alphonse, A.S. (2023). Cascading autoencoder with attention residual U-Net for multi-class plant leaf disease segmentation and classification. IEEE Access. 11: 98153-98170. doi: 10.1109/ACCESS.2023. 3312345.

  2. Adnan, F., Awan, M.J., Mahmoud, A., Nobanee, H., Yasin, A. and Zain, A.M. (2023). EfficientNetB3-adaptive augmented deep learning (AADL) for multi-class plant disease classification. IEEE Access. 11: 45377-45392. doi: 10.1109/ACCESS.2023.3295678.

  3. Arshad, F., Mateen, M., Hayat, S., Wardah, M., Al-Huda, Z., Gu, Y.H. and Al-antari, M.A. (2023). PLDPNet: End-to-end hybrid deep learning framework for potato leaf disease prediction. Alexandria Engineering Journal. 78: 406- 418. doi: 10.1016/j.aej.2023.07.040.

  4. Buja, I., Sabella, E., Monteduro, A.G., Chiriacò, M.S., De Bellis, L., Luvisi, A. and Maruccio, G. (2021). Advances in plant disease detection and monitoring: From traditional assays to in-field diagnostics. Sensors. 21(6): 2129. doi: 10.3390/s21062129.

  5. Dethier, J.J. and Effenberger, A. (2012). Agriculture and development: A brief review of the literature. Economic Systems. 36(2): 175-205. doi: 10.1016/j.ecosys.2011.09.003.

  6. Fenu, G. and Malloci, F.M. (2021). Forecasting plant and crop disease: An explorative study on current algorithms. Big Data and Cognitive Computing. 5(1): 2. doi: 10.3390/ bdcc5010002.

  7. Ge, Z., Liu, S., Wang, F., Li, Z. and Sun, J. (2021). YOLOX: Exceeding YOLO series in 2021. CoRR. abs/2107.08430. doi: 10.48550/arXiv.2107.08430.

  8. Hassan, S.M., Amitab, K., Jasinski, M., Leonowicz, Z., Jasinska, E., Novak, T. and Maji, A.K. (2022). A survey on different plant diseases detection using machine learning techniques. Electronics. 11(17): 2641. doi: 10.3390/electronics11172641.

  9. Keerthana K.G., Prashanth S.J., Babu R.L. (2024). Tomato leaf curl disease: A review. Agricultural Reviews. 45(1): 132-136. doi: 10.18805/ag.R-2346.

  10. Khakimov, A., Salakhutdinov, I., Omolikov, A. and Utaganov, S. (2022). Traditional and current-prospective methods of agricultural plant diseases detection: A review. IOP Conference Series: Earth and Environmental Science. 951: 012002. doi: 10.1088/1755-1315/951/1/012002.

  11. Liu, W. and Zhang, A. (2025). Plant disease detection algorithm based on efficient Swin transformer. Computers, Materials and Continua. 82(2): 3045-3068.

  12. Pinaria, N.A.B., Ngangi, R.C. and Sompotan, S. (2026). Early detection of fungal diseases in tomatoes using convolutional neural networks. Agricultural Science Digest. 45(6): 1018-1022. doi: 10.18805/ag.DF-716.

  13. Salman, Z., Muhammad, A. and Han, D. (2025). Plant disease classification in the wild using vision transformers and mixture of experts. Frontiers in Plant Science. 16: 1522985. doi: 10.3389/fpls.2025.1522985.

  14. Shafik, W., Tufail, A., Namoun, A., De Silva, L.C. and Apong, R.A.A.H.M. (2023). A systematic literature review on plant disease detection: Motivations, classification techniques, datasets, challenges and future trends. IEEE Access. 11: 59174-59203. doi: 10.1109/ACCESS. 2023.3283124.

  15. Sharma, A. and Kumar, A. (2026). Dual-encoder variational autoencoder for detection and classification of plant leaf diseases. Indian Journal of Agricultural Research. 60(6): 912-918. doi: 10.18805/IJARe.A-6451.

  16. Thai-Nghe, N., Tri, N.T. and Hoa, N.H. (2021). Deep Learning for Rice Leaf Disease Detection in Smart Agriculture. In International Conference on Artificial Intelligence and Big Data in Digital Era: Springer. pp. 659-670. doi: 10.1007/978- 3-030-86192-3_55.

  17. Tiwari, M., Kumar, H., Prakash, N., Kumar, S., Neware, R., Tripathi, S. and Agarwal, R. (2024). Tomato disease detection using vision transformer with residual L1-norm attention and deep neural networks. International Journal of Intelligent Engineering and Systems. 17(1): 679.  doi: 10.22266/ijies2024.0229.

  18. Tonmoy, M.R., Hossain, M.M., Dey, N. and Mridha, M.F. (2025). MobilePlantViT: A mobile-friendly hybrid ViT for generalized  plant disease image classification. arXiv preprint. arXiv. 2503.16628. doi: 10.48550/arXiv.2503.16628.

  19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems. pp. 5998-6008.

  20. Zhong, Y., Huang, B. and Tang, C. (2022). Classification of cassava leaf disease based on a non-balanced dataset using transformer-embedded ResNet. Agriculture. 12(9): 1360. doi: 10.3390/agriculture12091360.

Background: Early prediction in plant diseases enables timely management, reducing crop losses and promoting healthier yields. Deep learning (DL) models are effective tools for an early prediction of plant diseases. Although among these, EfficientNetB3 based models outperform for multi-class plant disease classification but is highly sensitive to noisy or poor-quality input data, such as images with background clutter, varying lighting conditions, or partially damaged leaves and may lead to incorrect classification results, reducing the model’s reliability in real-world agricultural scenarios.

Methods: To overcome these limitations, a multi-scale vision transformer (MSViT) model is proposed to capture multi-scale features for enhancing robustness against noise and background variations. A multi-stage ViT structure is implemented to extract hierarchical image features, while enhanced convolution layers are used to down sample the output of each stage, enabling feature extraction at multiple scales. The size and stride of the convolution kernel are dynamically adjusted according to feature dimensionality to ensure effective representation learning across scales. To reduce the computational complexity, window attention is introduced in each ViT stage with window sizes that vary in proportion to feature dimensions. The multi scale features obtained from different stages are then fused using a feature pyramid network (FPN) to strengthen the localization and representation of disease affected regions. For classification, a modified EfficientNet-B3 model is used to predict diseases. The complete framework is termed as multi-scale vision transformer for EfficientNetB3 (MSViT-EfficientNetB3).

Result: Finally, the proposed model achieves 94.2% of accuracy on Plant Village dataset outperforming existing models in multi class plant disease classification.

The provision of raw resources and nourishment for all forms of life is dependent on agriculture (Dethier and Effenberger, 2012). Plant diseases are the principal factor that substantially affects plant development which may occur due to several factors, including parasitic organisms, fluctuations in soil pH, frequent changes in environmental conditions and reduced moisture levels, all of which significantly affect crop output and growth (Keerthana et al., 2024; Hassan et al., 2022). Reducing the plant diseases is essential for sustaining agricultural output and economic development. But there exists a specific issue in identifying and classifying the kinds of plant illnesses (Fenu and Malloci, 2021) that substantially impact growth production.
       
Traditional prediction models need manual observation by agriculturists, resulting in increased costs, prolonged assessment periods and substantial laboratory setups when operating in extensive agricultural fields (Khakimov et al., 2022). Previously, image processing was extensively used to predict plant diseases by analyzing gathered photos of diseased plants (Shafik et al., 2023; Buja et al., 2021). However, this approach yields diminished prediction accuracy when handling huge pictures and significantly relies on human feature extraction.
       
There is an increase in the use of DL algorithms in the recognition as well as categorization of plant diseases. Disease predictions in cassava were made using a DenseNet model by (Zhong et al., 2022; Arshad et al., 2023) used a U-Net topology to predict a potato disease that can reliably identify disease zones in plant images. InceptionV3 and vision transformers (ViT) were used for prediction and classification. A hybrid model was developed (Tiwari et al., 2024) by a combination of ViT and deep neural network (DNN)for disease detection in tomato leaves. Various convolutional neural network (CNN) architectures developed (Pinaria et al., 2026) was developed for prediction task. However, these models focus on predicting single plant species and their diseases. To make the models better at detecting different kinds of plant diseases, multi-class detection will be best solution.  
       
A CAAR-UNet technique developed (Abinaya et al., 2023) which uses a cascading autoencoder with attention residuals for multiple plant diseases. An EfficientNetB3 adaptive Augmented Learning (EB3AADL) was designed (Adnan et al., 2023) to predict numerous plant disorders. This method increases the accuracy through pre-processing of the images to reduce noises and increases the diversity through augmentation of the images. Dual encoder method was used (Sharma et al., 2026) for multiclass disease detection. The above-mentioned DL models have shown promise in plant disease classification; they suffer from sensitivity to noisy or low-quality input data. Some models perform poorly due to background clutter, inconsistent lighting conditions or partially damaged leaves, which reduce classification accuracy.
       
To address these problems, a multi-scale context Vision Transformer (ViT) model is proposed to extract multi-scale features that make it resistant to noise and changes in the background. The proposed approach integrates a multi-scale vision transformer (MSViT) to extract hierarchical features at multiple spatial resolutions, along with an attention-enhanced EfficientNet-B3 to perform effective multi-class plant disease classification. The MSViT module addresses the limitations of standard ViTs by employing multi-stage feature extraction with enhanced convolution layers to down sample outputs, enabling rich multi-scale representation. Window attention is applied on a stage-by-stage basis with dynamically adjusted window sizes to capture essential spatial information while reducing computational complexity. The multi- scale features obtained from different stages are fused using a FPN to improve localization and representation of disease-affected regions. For classification, a modified EfficientNet-B3 model is used to automatically focus on disease-relevant regions, enhancing prediction accuracy.
The experiment was conducted from June 2024 to March 2026 at the Department of Computer Science and Engineering, Annamalai University, Tamil Nadu, India. This section depicts the modules involved in the proposed research The pipeline of the proposed approach is illustrated in Fig 1.

Fig 1: Pipeline of the proposed model.


 
Dataset description
 
Plant village dataset
 
With 54,303 images total, the dataset includes both healthy and sick leaf varieties are shown in Fig 2, categorized into 38 sets that address 14 different crops and 24 different diseases are shown in Fig 2. Therefore, there are 52 classes in the customized dataset. The RGB format is represented by the three channels and each image in the dataset has dimensions of 256 × 256 × 3. After data pre-processing, totally 6646 images from plant village datasets are used for testing and validation set (3,323 images for test set and 3,323 for validation set). After augmentation, 200 images for each class, the resulting 10,400 images of 52 classes is used for training set. Totally 17046 images utilized for this research work. It is available at: https://www.kaggle.com/emmarex/plantdisease.

Fig 2: Examples of leaf images from plant village database.


 
Image pre-processing
 
The initial database used in this approach will be color-coded in Red, Green and Blue (RGB) as well as it includes images with noisy values. To enhance the quality of the input images, the method carries out image pre-processing functions including scaling, edge detecting and noise removing. Data augmentation (Adnan et al., 2023) is also carried out to enhance the learning of the model to increase the recognition of the image features and accurate prediction. It also helps to minimize the computational load and allows to process datasets more effectively.
 
Proposed MS-ViT
 
The general architecture of the proposed work would be shown in Fig 3. A multi-scale vision transformer-based architecture is proposed to detect diseases in plant leaves. It is an approach based on the YOLOx detecting head that is used to construct an FPN, extract multi-scale features and predict the location and type of disease on plant leaves. Three groups of input characteristics are used to carry out the self-attention processing: queries (Q), values (V)  along with keys (K). Using the degree of similarity among the query along with key vectors, the weighted sum of the values of the vector is computed. The scaling dot product attention (Vaswani et al., 2017) can be mathematically defined as Eq. (1):


Where,
d= The scaling factor.
Q= Made up of  nq query vectors.
K and V= Key and value vectors, respectively, with nk.
Window based multi head self-attention (W-MSA) (Liu and Zhang, 2025) is used to solve this problem; it executes self-attention inside localized, non-overlapping windows. Next, the patches are divided into windows, as well as only within every window is self-attention calculated after the initial partitioning. As a result, complexity is reduced without sacrificing the ability to successfully learn local features.   

Fig 3: Overview architecture of multi scale-ViT with EfficientNetB3.



Overall architecture of MSViT
 
A learnable feature vector is used to encode and represent each patch in the input image of ViT in the proposed approach for plant leaf disease diagnosis. The input image is then separated into patches. The width (W), height (H), as well as the total number of channels (C') make up the input image, which is denoted as W × H × C'. The default value for the patch size is 4. Following the processes of patch partitioning as well as embedding, the initial input is transformed into  , with  C being the inserting dimension. When this happens, the input length is .
 
An YOLOx-inspired module called BCSP is shown, with its BConv and CSP (Ge et al., 2021) making up its constituent parts. In a CSP, a BConv processes one of the channels, while another BConv as well as a multi-layer BottleNeck module (Ge et al., 2021) process the other channel; this last module is basically a residual module. Then, the two outputs are combined and processed by a BConv. The input of the ith stage is denoted as Ii, while the output is Oi. Furthermore, the  ith step of the multi-stage ViT is represented by ViTSi. Here is how the procedure of up-dimensioning and cross-channel information interaction is defined in Eq. (2) and Eq. (3):
 
        Oi = ViTSi (Ii)         ...(2)
 
                       Ii + 1 = BCSP (Oi, Ki, Si, Cin, Cout)                    ...(3)
 
Where the input and output channels are denoted by  as well as , respectively and the  convolution kernel size along with stride are represented by  and , respectively. In the first three steps,  is twice  and the kernel size as well as stride are usually set to 1. The channel configurations for the four stages are {128, 256, 512, 1024}. Here is a definition of down sampling:
 
                                    Mi = BCSP (Oi, Ki, Si, Cin, Cout)                        ...(4)  
                                                                                   
Equation (4) states that Mi is the ith scale feature, that  K = {1, 3, 5, 9}, S = {1, 2, 4, 8} are the convolution kernel sizes and strides and that Cin = Cout. As a consequence, resolution-based multi-scale feature maps are produced. The subsequent resolutions of the four scales are obtained after down sampling: . Represented here as Fi, this is the result of the ith FPN scale:
        
                                           Fi = 5 * CBL [Concat (Mi, UP (CBL (Fi + l)]                              ...(5)         
       
                                    F4 = 2 * CBL [SPP (2 * CBL (M4)]                                   ...(6)
                                  
The symbols used in Equation (5) are as follows: UP for up sampling, CBL represents a block that utilizes layer of convolution, layer of batch normalization, as well as leaky-ReLU function of activation along with SPP for a CBL block and repeated maximum-pooling operations in addition to concatenation (Ge et al., 2021). The next step is to use a bottom-up fusion procedure. For the ith scale particular detection module, let  Pi stand for its input, which is specified by Eq. (7):
 
                                                      Pi = 3 * CBL [Concat (Fi BConv (Pi - 1)]                               ...(7)           
                                                     
Each scale specific feature is fed into an independent detection branch, which independently predicts disease classes, bounding regions and confident scores. The outputs of all detection branch are concatenated to form the final prediction result, enabling precise and resilience detection of plant disease across varying scales and symptom sizes.
 
Multiclass leaf disease classification using Efficient NetB3
 
This research presents the EfficientNetB3 to improve classification performance in multiclass leaf disease classification. An EfficientNet CNN system, of which the given model is a member, employs a compound scaling principle. By using the compound scaling method, the convolution unit is sized to match the target size. The network dimension is scaled consistently with a balance in width, depth and resolution through the application of a compound coefficient (Thai-Nghe et al., 2021). In order to decrease the computational load by a factor f2, where f is the filter size, the EfficientNetB3 model is constructed using Mobile-inverted bottleneck convolution (MBConv) units. These units utilize kernel dimensions of 3 × 3 and 5 × 5. The depth, width and resolution of the network are all amplified to the same extent by means of a compound scaling coefficient.
       
This model achieves greater learning of complicated characteristics and enhanced generalizability with 210 layers and an input shape of 300 × 300 × 3. An extra spatial attention strategy module is used in this model to identify crucial areas in the feature map. Two separate 2D maps,  F'avg  ∈ ℝ1 × H × W and F'max ∈ ℝ1 × H × W, are generated by combining channel information using average and max-pooling procedures. These maps are then convolved to generate a spatial attention map, F'S ∈ ℝH × W, which shows where to emphasize or hide features. The spatial attention is established by:
 
                                               F'S = σ [f 7×7 (F'avg; F'max)]                      ...(8)                                
      
The sigmoid activation function is denoted by σ in Eq. (8) and the Conv layer with a 7 × 7 kernel size is represented by f 7×7. In addition, the model’s outputs are flattened after a global average pooling. Later on, a 256-neuron dense unit is integrated using a ReLU and L1 along with L2 regularization strategies. Overfitting can be prevented by employing a dropout rate of 0.4. Lastly, the classification process is finished by including a dense unit with neurons that are equal to the number of classes. The final classification stage, which includes predicting probabilities for every category, is performed using a softmax. The model’s efficiency is assessed using the categorical Cross-Entropy (CE) loss. The following is how it calculates the difference between actual values and expected probabilities:  
                               
  
                                 
 
In equation (9), the variables n, ti and pi represent the total classes in the dataset under consideration, the true label along with the forecast label, respectively, which is the softmax probability for the  class.
 
Algorithm 1: MSViT-EfficientNetB3 for Multi-class leaf disease classification.
 
Input: 17046 images from 52 classes determined form plant village dataset.
 
Output: Detection of multi class plant diseases.
 
1. Begin.
2. Dataset preparation.
3. Load the customized dataset containing 150 images per class for 52 disease categories with size of 256 × 256 × 3.
4. Image pre-processing.
5. Apply normalization, edge detection and noise reduction to clean the input images.
6. Perform data augmentation (Rotation, flipping, zooming) to increase dataset variability and reduce class imbalance.
7. Multi scale feature extraction using MSViT.
8. Use four stage vision transformers layers with W-MSA to extract hierarchical features.
9. Use BCSP in each stage to down sample with multiple kernel sizes and strides to extract multi scale feature.
10. Fuse feature maps using FPN for better spatial resolution and disease localization.
11. Feed fused features into EfficientNetB3.
12. Classification layer.
13. Apply the FC and softmax layers to classify images into disease categories.
14. End.
Experimental setup
 
The experiments on multiclass plant disease classification using various DL models are presented in this section. The Python language is used to write the crucial codes. For this test, they utilized a Windows 10 64-bit laptop containing an Intel® Core TM i5-4210 processor running at 3GHz, 8 GB of RAM and a 1 TB hard drive. Following augmentation, 52 classes produced 17046 images. Of these, 3,323 images were divided evenly between validation and testing and 10400 images were used for training. On the dataset under consideration, the recommended approach is used with the proper hyperparameter values revealed in Table 1.

Table 1: Hyper parameters used in training of MSViT-EfficientNetB3 model.


       
The following assessment metrics may be used to evaluate the MSViT-EfficientNetB3 model’s classification efficiency:
 
Evaluation metrics
 
Accuracy is the ratio of accurately categorized leaf disease images out of the overall images observed.

    
In Eq. (10), TP describes the accurate detection of a positive class, such as a healthy leaf. Recognizing a diseased leaf when it genuinely is diseases is an example of a negative class that TN correctly depicts. FP happens when something that shouldn’t be positive is really positive (Like a diseases leaf being identified as healthy). FN happens when something good is mistakenly labelled as bad (such a healthy leaf being thought of as diseased).
 
Precision: This is calculated using Eq. (11).

                                                                                                                   
Recall: Equation (12) is used to measure it.

          
F1-score: It is determined by Eq. (13)

                                                                                                            
Fig 4 shows the resultant confusion matrix from testing the proposed technique on 52 classes. Fig 5 illustrates performance analysis of training and testing accuracy respected to the epochs to the PlantVillage dataset. The performance of the MSViT-EfficientNetB3 model in terms of improvement in training and testing as the number of epochs (from 0 to 100) increase is consistent. The proposed model demonstrates equal advances in training and testing accuracy implying effective learning and low overfitting.

Fig 4: Confusion matrix proposed MSViT-EfficientNetB3 model.



Fig 5: Performance evaluation of proposed model’s training and testing accuracy on plantvillage dataset.


 
Performance analysis of proposed MSViT-efficientNetB3 model with standard models
 
In this section, the performance of the MSViT-EfficientNetB3 classifier model on the plant village dataset is compared to popular pre-trained models such as EfficientNetB0, EfficientNetB1, EfficientNetB2, EfficientNetB3 and EfficientNetB3+ViT.
       
Using the modified plant village database for multiclass disease of leaves classification, Fig 6 shows how well different standard models perform. From the analysis, it is determined that the accuracy of the MSViT-EfficientNetB3 model exceeds that of EfficientNetB0, EfficientNetB1, EfficientNetB2, EfficientNetB3 and EfficientNetB3+ViT by 9.15%, 6.20%, 5.61%, 4.55% and 1.73%, respectively. This improvement can be attributed to the fact that the model is capable of smartly giving attention to the critical areas of the image that are of most importance when identifying plant diseases. By paying attention to these areas, the model will be more effective in the capture of subtle terms applying to the disease and, therefore, improved accuracy and reliable classification of plant diseases. A detailed comparison of the findings from the proposed model with those of previous studies is presented in Table 2.

Fig 6: Evaluation of various standard model with proposed model.



Table 2: Comparison of different existing models with proposed model.

To enhance accuracy and localization of features, this paper presents the MSViT-EfficientNetB3 model that can detect as well as classify diseases in numerous categories of plant leaves and diseases. The model is an MSViT and an attention-guided EfficientNet-B3. The MSViT module uses a hierarchical design that learns to extract feature at multiple spatial resolutions with the help of adaptive down sampling and window attention that has the effect of preserving important spatial information and dramatically lowers computational cost. The model effectively identifies local and global disease properties by integrating stage convolutions and FPN in order to merge features effectively. An EfficientNet-B3 classifier is further improved by introducing a spatial attention module that highlights disease relevant areas in the feature maps and makes more precise predictions. Such a combination enables the model to specify disease symptoms on a fine-grained level, meaning in different leaf types and different conditions. The results of extensive experiments on the PlantVillage dataset indicate that the MSViT-EfficientNetB3 model offers 97.12%, 97.42%, 98.21% and 98.7% of accuracy, precision, recall and f1 scores respectively, which are better than current methods (what are those?). Nevertheless, the model has a minor performance reduction when dealing with low contrast or highly noisy images. Future studies will discuss the incorporation of more sophisticated transformed based enhancement modules to enhance further accuracy in problematic imaging conditions. Further, the proposed model will be extended for diseases affecting other plant parts or for leaves exhibiting multiple diseases.
The present study was supported by the Department of Computer Science and Engineering, Annamalai University, Tamil Nadu, India. The authors would like to thank the department for providing the facilities and support to conduct this research.
 
Disclaimers
 
The views and conclusions expressed in this article are solely those of the authors and do not necessarily represent the views of their affiliated institutions. The authors are responsible for the accuracy and completeness of the information provided, but do not accept any liability for any direct or indirect losses resulting from the use of this content.
The authors declare that there are no conflicts of interest regarding the publication of this article. No funding or sponsorship influenced the design of the study, data collection, analysis, decision to publish, or preparation of the manuscript.

  1. Abinaya, S., Kumar, K.U. and Alphonse, A.S. (2023). Cascading autoencoder with attention residual U-Net for multi-class plant leaf disease segmentation and classification. IEEE Access. 11: 98153-98170. doi: 10.1109/ACCESS.2023. 3312345.

  2. Adnan, F., Awan, M.J., Mahmoud, A., Nobanee, H., Yasin, A. and Zain, A.M. (2023). EfficientNetB3-adaptive augmented deep learning (AADL) for multi-class plant disease classification. IEEE Access. 11: 45377-45392. doi: 10.1109/ACCESS.2023.3295678.

  3. Arshad, F., Mateen, M., Hayat, S., Wardah, M., Al-Huda, Z., Gu, Y.H. and Al-antari, M.A. (2023). PLDPNet: End-to-end hybrid deep learning framework for potato leaf disease prediction. Alexandria Engineering Journal. 78: 406- 418. doi: 10.1016/j.aej.2023.07.040.

  4. Buja, I., Sabella, E., Monteduro, A.G., Chiriacò, M.S., De Bellis, L., Luvisi, A. and Maruccio, G. (2021). Advances in plant disease detection and monitoring: From traditional assays to in-field diagnostics. Sensors. 21(6): 2129. doi: 10.3390/s21062129.

  5. Dethier, J.J. and Effenberger, A. (2012). Agriculture and development: A brief review of the literature. Economic Systems. 36(2): 175-205. doi: 10.1016/j.ecosys.2011.09.003.

  6. Fenu, G. and Malloci, F.M. (2021). Forecasting plant and crop disease: An explorative study on current algorithms. Big Data and Cognitive Computing. 5(1): 2. doi: 10.3390/ bdcc5010002.

  7. Ge, Z., Liu, S., Wang, F., Li, Z. and Sun, J. (2021). YOLOX: Exceeding YOLO series in 2021. CoRR. abs/2107.08430. doi: 10.48550/arXiv.2107.08430.

  8. Hassan, S.M., Amitab, K., Jasinski, M., Leonowicz, Z., Jasinska, E., Novak, T. and Maji, A.K. (2022). A survey on different plant diseases detection using machine learning techniques. Electronics. 11(17): 2641. doi: 10.3390/electronics11172641.

  9. Keerthana K.G., Prashanth S.J., Babu R.L. (2024). Tomato leaf curl disease: A review. Agricultural Reviews. 45(1): 132-136. doi: 10.18805/ag.R-2346.

  10. Khakimov, A., Salakhutdinov, I., Omolikov, A. and Utaganov, S. (2022). Traditional and current-prospective methods of agricultural plant diseases detection: A review. IOP Conference Series: Earth and Environmental Science. 951: 012002. doi: 10.1088/1755-1315/951/1/012002.

  11. Liu, W. and Zhang, A. (2025). Plant disease detection algorithm based on efficient Swin transformer. Computers, Materials and Continua. 82(2): 3045-3068.

  12. Pinaria, N.A.B., Ngangi, R.C. and Sompotan, S. (2026). Early detection of fungal diseases in tomatoes using convolutional neural networks. Agricultural Science Digest. 45(6): 1018-1022. doi: 10.18805/ag.DF-716.

  13. Salman, Z., Muhammad, A. and Han, D. (2025). Plant disease classification in the wild using vision transformers and mixture of experts. Frontiers in Plant Science. 16: 1522985. doi: 10.3389/fpls.2025.1522985.

  14. Shafik, W., Tufail, A., Namoun, A., De Silva, L.C. and Apong, R.A.A.H.M. (2023). A systematic literature review on plant disease detection: Motivations, classification techniques, datasets, challenges and future trends. IEEE Access. 11: 59174-59203. doi: 10.1109/ACCESS. 2023.3283124.

  15. Sharma, A. and Kumar, A. (2026). Dual-encoder variational autoencoder for detection and classification of plant leaf diseases. Indian Journal of Agricultural Research. 60(6): 912-918. doi: 10.18805/IJARe.A-6451.

  16. Thai-Nghe, N., Tri, N.T. and Hoa, N.H. (2021). Deep Learning for Rice Leaf Disease Detection in Smart Agriculture. In International Conference on Artificial Intelligence and Big Data in Digital Era: Springer. pp. 659-670. doi: 10.1007/978- 3-030-86192-3_55.

  17. Tiwari, M., Kumar, H., Prakash, N., Kumar, S., Neware, R., Tripathi, S. and Agarwal, R. (2024). Tomato disease detection using vision transformer with residual L1-norm attention and deep neural networks. International Journal of Intelligent Engineering and Systems. 17(1): 679.  doi: 10.22266/ijies2024.0229.

  18. Tonmoy, M.R., Hossain, M.M., Dey, N. and Mridha, M.F. (2025). MobilePlantViT: A mobile-friendly hybrid ViT for generalized  plant disease image classification. arXiv preprint. arXiv. 2503.16628. doi: 10.48550/arXiv.2503.16628.

  19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems. pp. 5998-6008.

  20. Zhong, Y., Huang, B. and Tang, C. (2022). Classification of cassava leaf disease based on a non-balanced dataset using transformer-embedded ResNet. Agriculture. 12(9): 1360. doi: 10.3390/agriculture12091360.
In this Article
Published In
Agricultural Science Digest

Editorial Board

View all (0)