The performance of the CNN model was evaluated based on two different data split ratios: 80:10:10 and 70:15:15. Fig 3 presents the training and validation accuracy as well as the corresponding loss curves over 200 epochs for both configurations. For the 80:10:10 split, the model achieved a training accuracy of 93.80% with a training loss of 0.1576, while the validation accuracy reached 84.38% and the validation loss was 0.3984.
This result indicates that the model learned well from the training data while maintaining reasonable generalization on the validation set. In comparison, the 70:15:15 split resulted in an improved training accuracy of 98.79% and a significantly lower training loss of 0.0466. The validation accuracy also improved to 92.68%; however, the validation loss increased to 0.6901. The higher validation accuracy suggests better generalization, while the increased validation loss may indicate some degree of overfitting. The larger validation set in this split may have contributed to a more robust model evaluation but also introduced greater variability.
Fig 4 illustrates the confusion matrices for both split ratios, detailing the model’s ability to correctly classify different fish diseases. The model trained on the 70:15:15 split demonstrates improved classification performance, particularly for "Healthy Fish" and "Gill Disease," showing higher true positive values. However, the 80:10:10 split achieves more stable predictions, particularly for "Aeromoniasis" and "White Spot Disease," with lower misclassification rates.
A noticeable observation is that the 70:15:15 model encounters more misclassifications for "Parasitic Diseases" and "Saprolengniasis," likely due to higher variance in training data distribution. The confusion matrices further confirm that although increasing the validation set size in 70:15:15 provides a better generalization estimate, it also introduces instability in certain disease classifications.
Table 1 and 2 present the classification metrics at data split ratios of 80:10:10 and 70:15:15, respectively. At an 80:10:10 split ratio, several classes (
e.g., Aeromoniasis, Gill disease, Parasitic diseases, Saprolegniasis and White spot disease) exhibit perfect or near-perfect precision (
i.e., 1.0).
However, recall for Parasitic diseases is relatively low (0.4286), suggesting the model struggles to correctly identify all positive cases for this class. Healthy Fish also shows a lower precision (0.7188) but a very high recall (0.9583), indicating the model is good at detecting most healthy fish but sometimes misclassifies other classes as healthy. Overall, the 82.81% accuracy indicates solid performance but also reveals potential issues with imbalanced predictions. The high macro-average precision indicates that, on average, the model avoids false positives; however, the lower macro-average recall highlights its tendency to miss some positive samples in specific classes.
When the validation set is larger (70:15:15), the overall accuracy improves to 86.46% and both macro and weighted averages also increase, reflecting better overall performance. Classes like Aeromoniasis, Gill disease and Healthy Fish achieve both high precision and high recall, demonstrating that the model distinguishes these conditions effectively. However, some classes, such as Red disease and Saprolegniasis, show perfect precision (1.0) but relatively low recall (0.6667 and 0.5556, respectively), indicating that while the model rarely mislabels other diseases as Red disease or Saprolegniasis, it does fail to capture a portion of true positive cases in those classes.
Fig 5 shows example outputs from the model’s classification of fish diseases, highlighting the actual disease label, the model’s predicted label and the confidence score for each sample. At 80:10:10 data separation, the fish with White spot disease is correctly predicted with 99.94% confidence, the healthy fish is identified with 100% confidence and the fish with Red disease is classified at 99.55% confidence. At 70:15:15 data separation, similar high-confidence predictions for White spot disease, healthy fish and Red disease, further illustrate the model’s capability to accurately recognize these conditions.
Fig 6 compares the ROC curves and AUC values for each disease class under the two data splits (80:10:10 vs. 70:15:15). The left plot (80:10:10) shows that each disease class achieves a high AUC, ranging roughly from 0.92 to 0.99. This indicates that, for every disease, the model is able to maintain a high true positive rate (TPR) while keeping the false positive rate (FPR) relatively low. In particular, Red disease exhibits the highest AUC at about 0.993, suggesting the model is especially adept at distinguishing this class from others. Conversely, Saprolegniasis and White spot disease have slightly lower AUC values but still remain above 0.92, reflecting robust performance.
In the right plot (70:15:15), the AUCs remain high, with most classes showing slight improvements over the 80:10:10 split. For instance, Aeromoniasis improves from about 0.9655 to 0.9772 and Parasitic diseases also show a notable increase (0.9500 to 0.9900). As with the 80:10:10 split, Red disease remains one of the best-classified conditions (AUC ? 0.9924). Although Saprolegniasis and White spot disease maintain strong AUCs, they still rank slightly lower compared to the other classes. Overall, both ROC plots confirm the model’s strong discriminatory power across all disease classes. The consistently high AUC values, well above the 0.5 random-guess threshold, illustrate that the classifier performs significantly better than chance. The slight variations between the two splits echo earlier findings: while the 70:15:15 split can yield higher accuracy and AUC for certain classes, it may also show signs of overfitting in other metrics (such as validation loss).
Fig 3 highlights that both splits show an upward trend in accuracy and a downward trend in loss. However, the 70:15:15 split exhibits more fluctuation in validation loss, hinting at greater sensitivity to the validation data. Although the 70:15:15 split yields a higher validation accuracy (92.68% vs. 84.38%) and a better overall accuracy (86.46% vs. 82.81%), the increased validation loss (0.6901 vs. 0.3984) suggests overfitting. This finding is further supported by the confusion matrices (Fig 4), where the 70:15:15 model correctly classifies more instances in some diseases but shows instability in others. Many classes achieve near-perfect precision, indicating the model’s caution in labeling diseases. However, certain classes (
e.g., Parasitic diseases, Saprolegniasis) have lower recall, meaning the model fails to identify some positive cases. This discrepancy may be due to class imbalance or overlapping visual features among diseases.
Despite the potential for overfitting, the higher validation accuracy under the 70:15:15 split suggests it provides better generalization for most disease categories. To further improve the model, data augmentation and balancing techniques could help address low recall in underrepresented classes. Additionally, regularization methods such as dropout, batch normalization, or weight decay can mitigate overfitting-particularly valuable in the 70:15:15 scenario. Finally, adjusting decision thresholds per class could help optimize the balance between precision and recall, especially for classes with skewed performance metrics. Overall, the results indicate that while the 70:15:15 split leads to higher accuracy and AUC values for most classes, the 80:10:10 split exhibits more stable validation loss. Selecting the optimal approach may depend on the specific application requirements (
e.g., whether minimizing false negatives or false positives is more critical) and the available dataset size. To further enhance model performance, implementing data augmentation techniques or gathering additional samples for underrepresented classes (
e.g., Parasitic diseases, Saprolegniasis) could improve recall. Moreover, employing regularization methods such as dropout, batch normalization, or weight decay may help mitigate overfitting, particularly in the 70:15:15 split. Adjusting decision thresholds for each class could also help balance precision and recall for those classes with skewed performance metrics.