This is our submission to the Vision Capsule Endoscopy Challenge 2024 hosted by MISAHUB. Capsule Endoscopy is a wireless endoscopy technique that results in 70,000 to 100,000 image frames. The doctor is then required to examine each frame to determine the ailment that the patient is suffering from. This task is therefore extremely monotonous, time-consuming and prone to human-errors. The VCE challenge aims to solve this problem by having the participants train a Deep Learning model that is able to determine the frames that are of interest to the doctor and also classify the frame as having one of the multiple types of diseases possible in the GI tract. There are total 10 classes of diseases covered in this challenge:
- Angioectasia
- Bleeding
- Erosion
- Erythema
- Foreign Body
- Lymphangiectasia
- Normal
- Polyp
- Ulcer
- Worms
For our solution, we attempted to train a CNN based model that is able to classify the frames into one of the 10 classes. Please check out our model weights and the corresponding paper below.
Do add the best_accuracy.ckpt from here
Fig share paper
arXiv paper
Our proposed model uses an EfficientNet B7 model from project MONAI as the backbone, followed by two hidden linear layers using PReLU activations. The output is a linear layer with softmax activation with 10 nodes.
Training and validation was done on the training and validation set provided by MISAHUB here The train set was extremely imbalanced, with over 28000 images for the largest class, while the smalles class (Worms) had just over 250 images. To address this issue we attempted using augmentation of the classes with lesser number of instances, and randomly sampled images from larger classes. We also attempted using Focal loss, and class weigts to address the imbalance issue, with poor results. The final model was trained with different augmentations for different classes.
- Erosion and Normal classes had 5000 images each (erosion was augmented, while Normal images were randomly sampled)
- Angioectasia and Polyp with 4000 images each (after augmentation)
- Worms with 1264 images (after augmentation)
- remaining classes had 3000 images each.
The loss and accuracy metrics on the training set were as follows:
Train accuracy vs train steps Train loss vs train step
The loss and accuracy metrics on the validation set were as follows:
Validation accuracy vs Validation steps Validation loss vs Validation step
The final results on the validation set can be seen through the confusion matrix below.
On the Validation set, we achieved a micro accuracy value of 0.845 and a macro accuracy of 0.643. The f1-score achieved on individual classes outperformed the VGG16 baseline model provided by Misahub on all the classes except Erythema, with an overall accuracy of 0.85 compared to the baseline accuracy of 0.71. The model performed poorly for the erythema class and confused it with erosion throughout the validation process. We also compared our results to the baseline model provided by MISAHUB which can be seen in the table below.| Class | VGG16 (by MISAHUB) | CapsuleNet | Support | ||||
|---|---|---|---|---|---|---|---|
| Precision | Recall | F1-score | Precision | Recall | F1-score | ||
| Angioectasia | 0.33 | 0.50 | 0.40 | 0.88 | 0.54 | 0.67 | 497 |
| Bleeding | 0.51 | 0.57 | 0.54 | 0.84 | 0.50 | 0.62 | 359 |
| Erosion | 0.29 | 0.40 | 0.33 | 0.43 | 0.84 | 0.57 | 1155 |
| Erythema | 0.13 | 0.37 | 0.19 | 0.91 | 0.03 | 0.06 | 297 |
| Foreign Body | 0.33 | 0.67 | 0.44 | 0.90 | 0.63 | 0.74 | 340 |
| Lymphangiectasia | 0.37 | 0.51 | 0.43 | 0.83 | 0.61 | 0.70 | 343 |
| Normal | 0.96 | 0.78 | 0.86 | 0.97 | 0.91 | 0.94 | 12287 |
| Polyp | 0.21 | 0.38 | 0.26 | 0.32 | 0.63 | 0.43 | 500 |
| Ulcer | 0.48 | 0.81 | 0.61 | 0.99 | 0.74 | 0.85 | 286 |
| Worms | 0.60 | 0.69 | 0.64 | 0.71 | 1.00 | 0.83 | 68 |
| Accuracy | 0.71 | 0.85 | 16132 | ||||
| Macro avg | 0.42 | 0.56 | 0.47 | 0.78 | 0.64 | 0.64 | 16132 |
| Weighted avg | 0.81 | 0.71 | 0.75 | 0.90 | 0.85 | 0.85 | 16132 |
Table 1: Result Comparisons





