The model showed steady improvement in both training and validation accuracy across epochs (Fig 3). At Epoch 1, the training accuracy was 0.3100, while validation accuracy was 0.5100, with a high validation loss indicating early-stage learning of general patterns. By Epoch 10, the training accuracy increased to 0.8700 and validation accuracy reached 0.8900, accompanied by a clear decrease in both training and validation loss, indicating that the model was learning meaningful features. At Epoch 25, the training and validation accuracy both reached 0.9400, with a noticeable reduction in validation loss, showing stable performance and fewer misclassifications. By Epoch 50, the training accuracy was 0.9600 and the highest validation accuracy of 0.9600 was observed at Epoch 49. Early stopping restored the best-performing weights to prevent overfitting. The close alignment between training and validation accuracy suggests good generalization and controlled variance throughout training.
To strengthen reliability, the training process was repeated five times using identical hyperparameters and random initialization control. Across these runs, the model achieved a mean validation accuracy of 95.31%±0.42 and a mean test accuracy of 95.18%±0.47, demonstrating high stability and low performance variance.
The progressive improvement pattern indicates that transfer learning using EfficientNetB0 effectively captured discriminative features. Rapid gains in the initial epochs reflect the advantage of pretrained weights as a strong initialization point. The model began to stabilize around Epoch 25, where training and validation accuracy aligned, indicating neither underfitting nor overfitting. Beyond this stage, improvements were gradual, with only marginal benefit from extended training. The consistent decline in validation loss further confirms efficient optimization. The low standard deviation across repeated runs further suggests that the dataset balance and augmentation strategy contributed to stable and reproducible learning behavior.
The confusion matrix presented in Fig 4 provides a detailed evaluation of the classification performance of the EfficientNetB0 model across the four leaf disease categories. The model demonstrates strong discrimination ability, with the highest accuracy observed for the Healthy class, where 402 images were correctly identified and no samples were misclassified as Rust. Chocolate spot also showed high recognition performance, with 359 correctly classified samples, although a small number were incorrectly predicted as Gall (19), Healthy (13) and Rust (10). Gall-infected leaves achieved similarly robust performance, with 380 correct predictions and only minor confusion with Chocolate spot (15) and Healthy (3), indicating close visual similarities in lesion structure between these categories. Rust exhibited 390 correct classifications, with only a few instances mistakenly labeled as Chocolate spot (4), Gall (2), or Healthy (4). The slightly lower recall observed for Chocolate spot can be attributed to several factors: early-stage chocolate spot lesions often resemble gall or rust infections, overlapping necrotic textures under variable lighting conditions and partial occlusion by leaf veins or shadows in field images. These visual ambiguities increase confusion, particularly in borderline cases where disease symptoms are not fully developed. Overall, the matrix reflects excellent class-wise consistency, with misclassification rates remaining very low across all categories. The diagonal dominance of the matrix confirms that the EfficientNetB0 model successfully learned discriminative features corresponding to each disease type, demonstrating strong reliability and generalization when applied to real-world field images of faba bean leaves.
Table 1 reports confidence-aware performance statistics derived from repeated evaluations. The narrow confidence intervals observed across metrics further support the robustness of the proposed model.
The EfficientNetB0 classifier achieves an overall accuracy of 95.39%, demonstrating strong generalization on the unseen test images. Among the individual classes, Rust exhibits the highest precision (0.9701) and strong recall (0.9750), indicating that the model is highly effective in correctly identifying rust-infected leaves while minimizing false positives. The Healthy class achieves the highest recall (0.9950), with the model correctly recognizing nearly all healthy leaf samples, resulting in an excellent F1-score of 0.9734. The Gall class also shows balanced performance with precision and recall values of 0.9453 and 0.9500, respectively, reflecting stable detection capability despite certain visual similarities to Chocolate spot lesions. Although Chocolate spot has slightly lower recall (0.8953), its precision remains high (0.9472), suggesting that misclassifications are limited and occur mainly in borderline or visually overlapping cases. The macro and weighted averages, both close to 0.954, indicate consistent performance across all classes without any major imbalance effects. Overall, the results confirm that the EfficientNetB0 model provides reliable and robust classification of faba bean leaf diseases, capturing subtle morphological differences across disease types with high accuracy.
The ROC analysis further confirms the high robustness of the EfficientNetB0 classifier (Fig 5). All classes exhibit near-perfect separability, with Gall, Healthy and Rust achieving an AUC of 1.00, reflecting flawless sensitivity–specificity trade-offs across decision thresholds. Chocolate spot also performs exceptionally well, with an AUC of 0.99, indicating minimal overlap between positive and negative predictions. The tightly clustered ROC curves near the upper-left corner highlight the model’s strong capability to correctly detect diseased and non-diseased leaves with very low false-positive rates. The micro-average AUC of 1.00 reinforces the consistency of the model when aggregating predictions across all classes, aligning with the high precision, recall and F1-scores reported in the classification metrics.
The PR curve analysis (Fig 6) provides additional insight into the model’s behavior under varying recall thresholds, which is especially important for datasets with class imbalance. The EfficientNetB0 model maintains high precision across nearly the entire recall range, with Healthy and Rust classes showing AP values above 0.99, indicating that almost all retrieved samples are correct even at high recall levels. Gall also demonstrates excellent stability with an AP of 0.9904, while Chocolate spot remains strong at 0.9759 despite being the most challenging class in the ROC evaluation. The sharp plateau of all curves near the upper boundary reflects minimal precision loss as recall increases. The micro-average AP of 0.9911 confirms the model’s consistent ability to distinguish diseased from non-diseased leaves, reinforcing the robustness observed in the classification report and ROC analysis.
In addition to predictive accuracy, computational efficiency was evaluated to support real-world deployment claims. The EfficientNetB0 model contains approximately 5.3 million trainable parameters and requires ~0.39 GFLOPs per inference. On the NVIDIA RTX 3050 GPU, the average inference time per image was 7.6 ms, while CPU-only inference averaged 48 ms per image. These results confirm the model’s suitability for near real-time disease diagnosis on edge devices and mobile platforms.
Fig 7 provides a qualitative assessment of the EfficientNetB0 model’s performance by displaying sample predictions across all four classes. The model consistently assigns high confidence scores, typically above 0.97, indicating strong certainty in its predictions. Chocolate spot images exhibit dense reddish-brown lesions, which the model identifies with confidence values above 0.99, demonstrating its ability to learn the fine-grained textural patterns characteristic of this disease. Gall samples, marked by distorted and necrotic leaf tissue, are also classified correctly with confidence near 0.996, highlighting the model’s robustness in recognizing irregular morphological deformations. Healthy leaves, despite natural variations in lighting and leaf shape, are predicted accurately with confidence above 0.98, illustrating the model’s resilience to background noise. Rust images, identifiable by circular yellow-brown pustules, are similarly detected with high certainty (≥0.97). Failure cases, although limited, were observed primarily in samples exhibiting early-stage disease symptoms, where lesion development was incomplete and visual cues were subtle (Table 2). Additional misclassifications occurred in leaves showing mixed disease characteristics or strong shadowing effects caused by uneven natural sunlight, which altered color and texture distributions. These qualitative examples reinforce the quantitative results reported earlier, confirming that the model not only performs well statistically but also maintains reliable and interpretable predictions under diverse field-like conditions.
Fig 8 presents a comparative bar graph summarizing the performance of different convolutional neural network (CNN) models used for leaf disease classification across multiple studies. The authors, models, crops, dataset types and accuracies are shown together to highlight variation in methodological approaches and outcomes. Accuracy values range from 91.00% to 100%, showing strong performance across studies, but with notable differences influenced by model choice, crop type and dataset quality.
Abed et al., (2021) used DenseNet121 on manually collected bean leaves and achieved 98.31% accuracy.
Sahu et al., (2021) applied VGG16 on 1,296 bean leaf images and obtained 95.31%.
Joshi et al., (2021) developed VirLeafNet-1 for Vigna mungo leaves and reached 91.23%, one of the lower values in the comparison.
Singh et al., (2023) employed EfficientNetB6 on a dataset of 1,295 bean images and reported 91.74%.
Serttaş and Deniz (2023) achieved 98.33% accuracy using ResNet50 on 1,295 bean images, showing strong performance similar to DenseNet-based models.
Kursun et al., (2023) reported the highest accuracy (100%) with DenseNet201, although the dataset contained only 60 dry bean images, which may have influenced the outcome.
Among Faba bean studies,
Salau et al., (2023) used an end-to-end CNN model and achieved 98.14%.
Jeong and Na (2024) used a CNN on labeled leaf images, resulting in 91.00%.
Mostafa et al., (2025) used a sequential CNN and reached 98.92% accuracy with an expert-preprocessed dataset. The present study used EfficientNetB0 on 8,021 expert-labeled Faba bean images and achieved 95.39%. This performance is higher than several earlier works but slightly lower than models trained on smaller or more specialized datasets.
Real-world deployment challenges
Field-based disease diagnosis presents substantially greater complexity than laboratory-controlled image classification. Variations in illumination caused by uneven sunlight, shadows from overlapping leaves, camera angle differences and background clutter can significantly alter leaf appearance. Such lighting variation was observed to reduce model confidence, particularly for early-stage infections where visual symptoms are subtle and spatially localized. Occlusion caused by overlapping foliage further complicates detection, as partial leaf visibility limits the availability of discriminative texture and color features. In addition, mixed infections, where multiple diseases co-occur on the same leaf, introduce ambiguous visual patterns that challenge single-label classification frameworks. These factors collectively explain why field-acquired datasets typically yield lower accuracy than laboratory datasets, despite more realistic deployment relevance.