In this work, both DenseNet121 and DenseNet201 were trained for 50 epochs and showed smooth and stable convergence. DenseNet121 showed a gradual increase in accuracy, rising from low initial values in the first epoch (≈25% training and ≈39% validation accuracy) and surpassing 90% validation accuracy by the 12
th-14
th epoch. Its performance continued improving until it reached a peak of ≈96% validation accuracy, after which the curve stabilized. DenseNet201, being deeper and more expressive, converged faster. It achieved over 90% validation accuracy within the first 8-10 epochs and reached its maximum of ≈97% validation accuracy slightly earlier than DenseNet121. In both models, validation loss consistently decreased throughout training, indicating stable optimization without signs of overfitting. Early stopping and check pointing ensured that the best epoch was preserved for final testing, making the comparison fair and reproducible. Overall, DenseNet201 required fewer epochs to reach its optimal performance, whereas DenseNet121 showed a slower but steady learning trajectory.
Fig 4 presents the combined confusion matrix results for both DenseNet121 and DenseNet201 models. DenseNet121 achieved strong classification performance, correctly identifying 380 Chocolate spot, 379 Gall, 400 Healthy and 389 Rust samples. Misclassifications were low, with Chocolate spot occasionally predicted as Gall (8), Healthy (2), or Rust (11) and Gall sometimes confused with Chocolate spot (16), Healthy (1), or Rust (4). Healthy samples showed minimal errors (1 to Chocolate spot and 3 to Rust), while Rust displayed small confusion with Chocolate spot (10) and Gall (1). DenseNet201 showed slightly improved accuracy, correctly predicting 367 Chocolate spot, 379 Gall, 398 Healthy and 397 Rust samples. Errors remained minimal, with Chocolate spot misclassified as Gall (13), Healthy (2), or Rust (19), while Gall had limited confusion with Chocolate spot (17), Healthy (2) and Rust (2). Healthy samples recorded only four errors in total and Rust achieved very high precision with just three misclassifications. Overall, DenseNet201 demonstrated stronger class separability, fewer cross-class errors and slightly higher accuracy, whereas DenseNet121 performed consistently well but showed marginally higher confusion across categories.
Table 1 summarises the class-wise Precision, Recall and F1-Scores for both DenseNet121 and DenseNet201. DenseNet121 shows strong and balanced performance across all four classes. Chocolate spot achieves a Precision of 0.9337 and Recall of 0.9476, indicating reliable detection with few missed cases. Gall performs even better, with Precision 0.9768 and F1-Score 0.9619, reflecting high consistency in prediction. Healthy shows the highest scores (Precision 0.9926, Recall 0.9901) and demonstrates excellent discrimination. Rust also performs strongly with an F1-Score of 0.9641. DenseNet201 shows similar trends. Chocolate spot remains accurate (F1-Score 0.9315), while Gall maintains strong stability with an F1-Score of 0.9523. Healthy remains highly separable with an F1-Score of 0.9876 and Rust perform best with a Recall of 0.9925 and an F1-Score of 0.9683. Accuracy and averages for both models remain close, with DenseNet121 at 0.9645 and DenseNet201 at 0.9601, indicating very similar overall reliability. MCC values also support this consistency, with DenseNet121 scoring 0.9527 and DenseNet201 0.9470. Overall, both models show excellent stability and class-wise precision, with DenseNet121 slightly ahead in overall accuracy and MCC, while DenseNet201 achieves stronger Recall for Rust and maintains competitive performance across all metrics.
Fig 5 illustrates the ROC and Precision–Recall curves for both DenseNet121 and DenseNet201 in a one-vs-rest (OvR) setting. DenseNet121 shows excellent class-wise discrimination, with AUC values of 0.9937 for Chocolate spot, 0.9975 for Gall, 0.9998 for Healthy and 0.9972 for Rust. All curves stay very close to the upper-left boundary, indicating extremely low false-positive rates and strong separability across categories. The corresponding Precision–Recall curves also show high stability, with average precision (AP) values above 0.98 for all classes. Healthy and Rust achieve the highest AP values (0.9993 and 0.9945, respectively), showing near-perfect precision at all recall levels.
DenseNet201 demonstrates similarly strong performance. AUC values remain high, with 0.9936 for Chocolate spot, 0.9972 for Gall, 0.9997 for Healthy and 0.9971 for Rust. The ROC curves closely overlap with DenseNet121, confirming consistent model behavior. The Precision-Recall curves also remain sharply peaked near the top-right corner, with AP values above 0.98 for every class. Healthy again shows the strongest curve (AP = 0.9993), while Rust and Gall maintain highly stable precision even at high recall ranges. Overall, both models demonstrate exceptional discrimination ability, with DenseNet121 showing marginally higher AP for Rust, while DenseNet201 remains equally competitive across all classes.
Fig 6 presents qualitative prediction examples for both DenseNet121 and DenseNet201, showing correctly classified leaf images across all four categories. DenseNet121 achieves high confidence values for every sample, with confidence scores ranging from 0.9992 to 1.0000, indicating strong certainty in its predictions. The model accurately identifies the visual patterns associated with each disease: dark necrotic patches for Chocolate spot, swollen tissue formations for Gall, circular pustules for Rust and smooth, unaffected foliage for Healthy plants.
DenseNet201 displays similar reliability, producing confidence scores of 0.9998 to 1.0000 for all tested images. Its predictions remain precise across diverse lighting conditions and leaf textures, reflecting strong feature extraction and generalization. The clear visual differentiation in lesions, along with consistently high prediction certainty, shows that both models successfully learn distinctive disease characteristics. Overall, the qualitative results demonstrate excellent real-world applicability and confirm the strong classification performance observed in the quantitative evaluations.
The test dataset contained 1,605 images distributed across four classes and both DenseNet121 and DenseNet201 models were evaluated on the same independent test set under identical:
McNemar p-value = 0.499896876860327; Result = No statistically significant difference between models.
McNemar’s contingency analysis showed that 1,505 samples were correctly classified by both models, while 43 samples were correctly classified only by DenseNet121 and 36 only by DenseNet201; 21 samples were misclassified by both architectures. The resulting McNemar’s exact test yielded a p-value of 0.4999, indicating no statistically significant difference in classification performance between the two models (p>0.05). Although DenseNet121 exhibited marginally higher accuracy (96.45%) compared to DenseNet201 (96.01%), the dominance of jointly correct predictions and the minimal imbalance between discordant pairs (43 vs. 36) suggest highly similar decision boundaries and comparable generalization capability. These findings confirm that the observed 0.44% accuracy difference does not represent meaningful model superiority under the current experimental configuration. The application of statistical validation strengthens the scientific rigour of the comparison and prevents overinterpretation of minor numerical variations. Future investigations may incorporate full backbone fine-tuning, cross-validation and calibration analysis to further explore subtle architectural advantages and enhance robustness of performance assessment.
Previous studies have demonstrated the effectiveness of DenseNet architectures for plant disease detection across various crops and datasets.
Wei et al., (2022) reported 96.4% accuracy for DenseNet121 on PlantVillage, showing its consistent performance across multiple crops and devices. Table 2 presents a comparative evaluation of DenseNet-based models for plant leaf disease classification.
Bansal et al., (2021) obtained 96.25% accuracy using a DenseNet121-based ensemble for apple leaves, while
Odounfa et al., (2025) reported 96.25% classification accuracy for DenseNet121 on chili leaves.
Bajpai et al., (2023) achieved 97.78% accuracy by combining DenseNet201 with SVM, outperforming DenseNet121 alone (94%), highlighting the benefits of hybrid approaches.
Enhanced architectures, such as DenseNet201Plus for banana and black gram
(Mazumder et al., 2024) and hybrid CNN-ViT with DenseNet201 for apple and corn
(Aboelenin et al., 2025), achieved higher accuracies, while fine-tuned DenseNet121 on PlantVillage reached 99.81% (
Andrew et al., 2022). In comparison, this study applied DenseNet121 and DenseNet201 to a field-collected fava bean dataset, achieving 96.45% and 96.01% accuracy with MCC values of 0.9527 and 0.9470. DenseNet201 converged faster and showed slightly better recall for Rust, whereas DenseNet121 maintained higher overall stability. These results demonstrate that the proposed models provide a practical, field-ready solution for fava bean disease detection, enabling early intervention, minimizing yield losses and supporting precision agriculture practices.