1 Introduction
One of the most prevalent and deadliest diseases of the 21st century is cancer which is caused by the uncontrolled growth of human body cells. Statistically, this is the second leading cause of death worldwide, causing around 9.6 million deaths every year and about 1/6th of the total deaths of the population throughout the globe. Late diagnosis and prognosis of cervical cancer often lead to deaths without receiving adequate treatment, mostly in poor and middleincome countries where the living index is low and healthcare infrastructure is insufficient.
Papanicolaou smear or the Pap smear test is the most common and widely used method in cervical cytology for the screening of abnormal lesions and cervical cancer. However, the assessment of cervical cytology requires expert physicians, which is expensive and timeconsuming. Also, inter and intrahuman variability of assessment or incorrect prognosis due to human errors may worsen the patient condition and can even be fatal in some cases. To avoid ambiguity in diagnosis, people tend to take the opinions of multiple experts. In the present work, we have utilized that kind of strategy for the automated detection of cervical cancer, where multiple CNN classifiers have been used to generate predictions, and the decision scores have been ensembled to conclude the final predictions.
In literature, there are some simple fusion schemes nannia2020ensemble
, that combines CNN features mostly by late fusion of several feature sets, specifically, by using majority voting, weighted probability ensemble, etc. These schemes utilize different CNN features by simple addition, multiplication or averaging them. Besides, experiments with different fusion method reveal some of the optimum weights that have been used for a weighted mean of the decision scores obtained. Therefore, there remains an opportunity to optimize the fusion schemes of different CNN or machine learningbased classifiers by adaptively promoting the importance of every classifier for every single image. This can be done by conditioning the weightage of one classifier upon others before it, which is done in a fuzzy fusion method, which remains largely unexplored in the particular domain of cervical cytology. The proposed method performed superior to other popularly used fusion methods as described in Section
4.The rest of the paper is organised as follows: Section 2 provides a brief literature survey of the existing classification approaches and fusion methods in the domain of cervical cytology; Section 3
is the detailed description of the proposed method, where we have implemented transfer learningbased decision score generation followed by fuzzy fusion; Section
4 contains information about the datasets used and the results we have obtained along with the comparison of different existing methods; Section 6 is the brief description of the outcome of our experiments and the future improvements that can be made to enhance the classification performance further.1.1 Overview and Contributions
The high rate of cervical cancer cases in the world, especially in developing and underdeveloped countries is mainly due to inadequate screening. However, detection of cervical cancer is no easy feat, taking long hours to detect a single case, thus making regular populationwide screening an impossible task. This calls for the need for automation in the detection procedure, and thus in this paper, we propose a framework for reliable automated detection of cervical cancer employing deep neural networks and Ensemble Learning using Fuzzy Fusion. The overview of the proposed framework is shown in
Figure 1.The contributions of this paper are as follows:

The Sugeno Fuzzy Integral is introduced for the first time for cervical cell classification to fuse the decision scores of multiple CNN classifiers. Using it the performance of three popular individual CNN classifiers such as Inception v3 szegedy2016rethinking, ResNet34 he2016deep and DenseNet161 huang2017densely are improved using the ensemble technique and thus has been used in the present research to make accurate predictions on the smallsized available datasets. We have performed experiments with several ensemble approaches, but the Sugeno Fuzzy Integralbased ensemble outperformed the traditional methods nannia2020ensemble since it is capable of using adaptive weights based on the confidence of predictions by the individual classifiers for each test sample, leading to superior predictions.

The proposed method has been tested on two publicly available datasets: the SIPaKMeD Pap Smear image dataset plissiti2018sipakmed and the Mendeley Liquid Based Cytology dataset hussain2020liquid. The SIPaKMeD dataset has both whole slide images (WSI) and singlecell images (SCI), and hence both types have been used separately for evaluation. Promising results have been obtained by the framework, which is reliable for practical use.
Thus we have developed an automated framework for the classification of Pap stained and LiquidBased Cytology images using Deep Learning and a novel ensemble approach for the classification of cervical cytology images, which is otherwise a laborious task for cytotechnicians.
2 Literature Survey
For many decades extensive researches have been done to develop improved algorithms and methods for computeraided diagnosis of cervical cancer mitra2021cytology
. In the past few years, various machine learning algorithms have been proposed for the detection and classification of cancerous cell images such as Support Vector Machine used by Ashok et al.
ashok2016comparison and KNearest Neighbour classifier used by sharma2016classification, etc.Win et al. win2020computer
proposed a method in which nuclei were detected using a shapebased iterative method, and the overlapping cytoplasm was separated by a markercontrol watershed approach. Features were extracted from regions of segmented nuclei and a Random Forest classifier was used for feature selection. For classification, bagging ensemble classifier, which combined the results of LDA, SVM, KNN, boosted trees and bagged trees. They achieved 98.27% accuracy in twoclass and 94.09% accuracy in fiveclass classification on the SIPaKMeD dataset. Jia et al.
jia2020detectionproposed a new framework based on a strong feature Convolutional Neural NetworksSupport Vector Machine (CNNSVM) model to classify cervical cells. A method fusing the strong features extracted by GrayLevel Cooccurrence Matrix and Gabor filters with abstract features from the hidden layers of CNN was conducted, meanwhile, the fused ones were input into the SVM for classification. Basak et al.
basak2021cervicalused a deep learningbased method where in they extracted deep features from multiple CNN models and applied a twostep feature enhancement procedure using Principle Component Analysis (PCA) and Grey Wolf Optimizer (GWO) to reduce the dimensionality of the feature set for efficient classification.
Ensemble Learning is a popular technique to incorporate the salient features of multiple CNN models like Kuko et al. kuko2019ensemble proposed a method of applying Random Forest classifiers on another layer of ensemble learning based on the rotation of the image. Each image is rotated 8 times by 45 degrees and 8 Random Forest models were trained. After the classifications, an ensemble voting technique is used to tally all votes amongst the 8 models and the mostvoted class is selected as the final classification. This method achieved an accuracy of 90.37% on binary classification. Xue et al. xue2020application used an ensemble learning method the weighted voting based method. They have developed InceptionV3, Xception, VGG16, and ResNet50 based TL structures. Then, to enhance the classification performance, a weighted voting based EL strategy was introduced. An experiment for classifying the benign cells from the malignant ones is carried out on the Herlev dataset and obtains an overall accuracy of 98.37%. Sarwar et al. sarwar2015hybrid
used an ensemble system developed using Random subset space, Radial basis function network, Multiclass classifier, Random forest, Bagging, Rotation Forest, J48 graft, Ensemble of Nested dichotomies (END). Decorate, PART, Random Committee, Filtered Classifier, Decision Table, Multiple back propagation artificial neural network, and Naïve Bayes. The final classification decision is obtained by aggregating the output of all possible candidate trees for the multiclass problem. The overall accuracy of the system for the twoclass problem was 98.57% on the HErlev dataset.
3 Proposed Method
In the present study, we have used three different CNN architectures (Transfer Learning) for generating the confidence scores on the datasets: Inception v3, DenseNet161 and ResNet34. The decision scores from these classifiers have been fused using the Sugeno Fuzzy Integral to generate the final predictions. These steps are explained in detail in this section.
3.1 Inception v3
One of the most popularly used deep learning network for transfer learning technique is Inception v3 szegedy2016rethinking, which is consisted of several inception blocks. It takes an input image of size
and produces feature maps of different dimensions in different layers. The inception block of Inception v3 allows us to utilize the facilities of using different filters of feature extraction from a single feature map. These features with different filters are concatenated and passed on to the next layer for deeper feature extraction. In this study, we have evaluated Inception v3 with the ReLU activation function. In each case, the model is trained for 100 epochs with crossentropy loss which is optimized by an SGD optimizer with a learning rate of 0.001.
3.2 DenseNet161
DenseNet has been proposed by huang2017densely to address the problem of gradient vanishing for the case of deep neural networks. The building blocks of DenseNets are connected densely to each other. In this way, only fewer parameters are needed to be learnt by the network. These kinds of networks have very narrow architecture and add small sets of feature maps. This network also takes input image of size and similar to Inception v3, in our study, we have trained this model for 200 epochs and SGD optimizer with a 0.001 learning rate. For DenseNet161 also we have used the ReLU activation function.
3.3 ResNet34
ResNet he2016deep is also an advanced convolutional neural net with residual skip connection embedded in it. There are certain versions of ResNets, which are ResNet18, ResNet34, ResNet50 and ResNet152. Due to the embedding of the skipconnections, despite having such deep architecture, the gradient vanishing problem is already being taken care of. Similar to DenseNet, the standard image dimension which should be given as an input to any version of ResNet is . We have evaluated ResNet34 in this study. To maintain consistency, the number of epochs in training, the optimizer, learning rate etc. have been fixed to the values mentioned in the above CNNs.
3.4 Ensemble: Sugeno Fuzzy Integral
To leverage the ascendency of individual CNN classifiers instead of a single one, we propose an integration of multiple classifiers utilizing fuzzy fusion in this paper. As shown in Figure 1
, the confidence scores from multiple classifiers are treated as the input of the fuzzy fusion directly. It has been used previously in a pattern recognition task, specifically in classifier fusion
wu2016fuzzy; liu2009machinery, and has shown promising results. However, no such applications in the domain of cervical cytology have been found so far. The fusion scheme harnesses additional information of a classifier, which is the uncertainty of the decision scores. The generalization of aggregation operators for a set of confidence values is known as fuzzy measures, that uses some weights before each source.If be the set of classifiers, the fuzzy measure is the worth value of the set and, as introduced in sugeno1993fuzzy, can have values in the range of [0, 1] and can be represented by the function . represents that the classifier can be considered as consistent whereas represents that the classifier cannot be trusted and considered as the results are inconsistent. For all , the fuzzy measure can be characterized by the monotonic property as in Equation 1.
(1) 
Fuzzy density is defined as the fuzzy measure of set S when S contains a single element and is the measure of the worth value of individual classifiers. Some studies have used fuzzy density values predefined based on the experience of the researcher; however, that does not ensure superior integration of the classifiers. Instead, following the original work of tong2016speech, we have set the fuzzy density values the same as the classifier accuracy measure on the test set, to give weightage to the optimal classifiers and punish the inferior ones. Following the work of tahani1990information, Sugeno fuzzy measure can be conceptualized with an additional characteristic that if then it can be considered that there is always a such that:
(2) 
The value can be obtained by solving the following equation, where is greater than 1.
(3) 
where is the number of CNN classifier, which is 3 in our case. has the following characteristics:

when

when

when
Among all the existing methods of fuzzy integrals like Fuzzy minmax mesiar2008fuzzy, ordered weighted averaging operators like ordered weighted averaging OR (OWAOR) cheng2012combining and ordered weighted averaging and (OWAAND) cho1995fuzzy, Sugeno integral sugeno1993fuzzy, Choquet integral murofushi1989interpretation, we have implemented Sugeno and Choquet integrals in this work, and have selected the best result from these two. The steps for calculating the fuzzy integrals are described below.
First, the N classifiers are sorted according to their output scores:
(4) 
where represents the largest output value of the classifiers where , where N is the number of classifiers. Next, we calculate the Choquet and Sugeno fuzzy integrals by means of:
(5) 
(6) 
where is defined from the definition of Sugeno fuzzy measures as follows:
(7) 
Thus through fuzzy integrals, the robustness is experimentally found to be higher as compared to the previously obtained normalized softmax probabilities and the time complexity of the algorithm is found to be , where is the number of classifiers and are the number of classes. The pseudocode for computing the Sugeno Fuzzy Integral for the ensemble of CNN classifiers decisions is shown in Algorithm 1.
4 Results and Discussion
In this section, we first briefly describe the two publicly available datasets used. Then we evaluate the performance of the proposed framework on these datasets and compare the results to other popular approaches used in literature to justify the viability of the method used.
4.1 Description of Datasets
In the present study, we have used two publicly accessible datasets: the SIPaKMeD Pap Smear dataset by Plissiti et al plissiti2018sipakmed and the Mendeley Liquid Based Cytology (LBC) dataset by Hussain et al. hussain2020liquid which are briefly described below.
4.1.1 SIPaKMeD Pap Smear Dataset
Class  Category  WSI  SCI 

1  SuperficialIntermediate  126  831 
2  Parabasal  108  787 
3  Koilocytotic  238  825 
4  Metaplastic  271  793 
5  Dyskeratotic  223  813 
  Total  966  4049 
The SIPaKMeD dataset consists of 4049 images of isolated cells that have been manually cropped from 966 cluster cell images of Pap smear slides, which are also included. The cells are classified into five different classes by expert cytopathologists. Normal cells are divided into two categories (superficialintermediate, parabasal), abnormal but not malignant cells are divided into two categories (koilocytes and dyskeratotic) and the final category is benign (metaplastic) cells. Both the Whole Slide Images (WSI) and the Single Cell Images (SCI) have been used separately for the present study. The distribution of images in the dataset is shown in Table 1.
4.1.2 Mendeley Liquid Based Cytology Dataset
Class  Category  Number of Images 

1  NILM  613 
2  HSIL  113 
3  LSIL  163 
4  SCC  74 
  Total  963 
The Liquid Based Cytology (LBC) Dataset by Hussain et al. hussain2020liquid contains 963 images classified into four different classes. The Pap smear images were captured in 40x magnification which is collected and prepared using the liquidbased cytology technique from 460 patients. In this dataset 613 images belong to the normal cell category and 350 images belong to the abnormal cell category. The distribution of these images is given in Table 2.
4.2 Metrics for Performance Evaluation
For our work, we have used Accuracy, Precision, Recall and F1score for evaluating the performance of the proposed framework. True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) are the basic elements that help determine the values of these metrics, and they can be defined as follows:

True Positive: The predicted result is positive, while it is labelled as positive.

False Positive: The predicted result is positive, while it is labelled as negative. It calls Type I Error as well.

True Negative: The predicted result is negative, while it is labelled as negative.

False Negative: The predicted result is negative, while it is labelled as positive. It calls Type II Error as well.
Based on these 4 elements, we can calculate the metrics: accuracy, precision, recall, F1 score. For a multiclass system (
class), if we have a confusion matrix
, with the rows depicting the predicted class, and the columns depicting the true class, these evaluations metrics can be formulated as in Equations
8, 9, 10 and 11.(8) 
(9) 
(10) 
(11) 
4.3 Implementation
Dataset  Class  Precision(%)  Recall(%)  F1 Score(%)  Accuracy(%) 
Mendeley LBC  High Squamous Intraepithelial Lesion  100  93.75  96.77  93.75 
Low Squamous Intraepithelial Lesion  100  100  100  100  
Negative for Intra epithelial Malignancy  100  100  100  100  
Squamous Cell Carcinoma  88.24  100  93.75  100  
Aggregrate  99.08  98.95  98.97  99.48  
SIPaKMeD WSI  Dyskeratotic  91.67  97.78  94.62  97.78 
Koilocytotic  100  87.5  93.33  87.5  
Metaplastic  96.36  96.36  96.36  96.36  
Parabasal  100  100  100  100  
Superficial Intermediate  89.66  100  94.55  100  
Aggregrate  95.54  96.33  95.77  96.33  
SIPaKMeD SCI  Dyskeratotic  98.18  99.39  98.78  99.39 
Koilocytotic  98.75  95.18  96.93  95.15  
Metaplastic  95.73  98.74  97.21  98.74  
Parabasal  100  100  100  100  
Superficial Intermediate  100  99.4  99.7  99.4  
Aggregrate  98.55  98.53  98.53  98.54 
Dataset  Model  Accuracy(%)  Precision(%)  Recall(%)  F1Score(%) 
Mendeley LBC  Inception v3  97.96  97.95  97.56  97.75 
DenseNet161  98.44  97.06  98.44  97.63  
ResNet34  98.12  97.96  98.12  98.04  
SIPaKMeD WSI  Inception v3  88.27  88.63  89.97  89.13 
DenseNet161  93.08  92.38  93.08  92.69  
ResNet34  93.28  92.22  93.29  92.62  
SIPaKMeD SCI  Inception v3  94.34  94.31  94.38  94.31 
DenseNet161  97.28  97.28  97.29  97.28  
ResNet34  97.17  97.27  97.22  97.19 
The datasets used in the present study have been split into 3:1:1 ratio of train, validation and test sets. The three pretrained CNN models have been finetuned using the datasets by freezing the weights of the top 5 layers and training for 50 epochs. The probability distributions of the models have been saved and fused for the final classification using the Sugeno Fuzzy Integral. The confusion matrices thus obtained on the test sets of the respective datasets are shown in
Figure 2. Consequently, the classwise results and the aggregate results of all the class are tabulated in Table 3 for the three datasets used. The results obtained before the ensemble, that is the results obtained by the base classifiers are shown in Table 4.4.4 Verification of Complementarity of Features
Distribution P  Distribution Q  D(PQ)  D(PQ) 

Inception v3  DenseNet161  2.356  0.131 
DenseNet161  Inception v3  0.611  
Inception v3  ResNet34  6.055  0.115 
ResNet34  Inception v3  0.540  
DenseNet161  ResNet34  3.300  0.162 
ResNet34  DenseNet161  0.884 
To verify the complementary or dissimilar nature of the features of the pretrained models used to extract the confidence scores of the datasets, two statistical divergence metrics are used: the KullbackLeibler Divergence (KLD)
kullback1951information; kullback1997information and the JensenShannon Divergence (JSD) menendez1997jensen.Distribution P  Distribution Q  D(PQ)  D(PQ) 

Inception v3  DenseNet161  0.502  0.152 
DenseNet161  Inception v3  0.367  
Inception v3  ResNet34  0.584  0.156 
ResNet34  Inception v3  0.418  
DenseNet161  ResNet34  0.212  0.132 
ResNet34  DenseNet161  0.208 
The KLD is a nonsymmetric measure of dissimilarity between two probability distributions on the same probability space. Let there be a probability space , and two probability distributions on this space and for every discrete variable , such that . Then the discrete form of KLD defined from to is given as Equation 12, being the natural logarithm of .
(12) 
Distribution P  Distribution Q  D(PQ)  D(PQ) 

Inception v3  DenseNet161  0.355  0.129 
DenseNet161  Inception v3  0.160  
Inception v3  ResNet34  0.250  0.134 
ResNet34  Inception v3  0.337  
DenseNet161  ResNet34  0.178  0.118 
ResNet34  DenseNet161  0.211 
As , a symmetrical statistical divergence have been derived from the KLD, called the JensenShannon Divergence (JSD). JSD is effectively, a smoothed form of KLD. For the same probability distributions and as mentioned above, let be another probability distribution such that . Then the JSD (discrete form) is given by Equation 13.
(13) 
The KLD and JSD measures between the decision scores of each pair of CNN classifiers, are shown in Tables 5, 6 and 7 for the Mendeley, SIPaKMeD WSI and SIPaKMeD SCI datasets respectively.
4.5 Comparison with different backbone CNNs
It has been mentioned earlier that we have evaluated fuzzy measure over three popularly used pretrained CNNs such that Inception v3, DenseNet161 and ResNet34. The results obtained by these datasets on all three datasets is given by Figure 3. It can be observed that for SIPaKMeD SCI the maximum is achieved by combining the probability distribution of all three neural nets. Whereas for SIPaKMeD WSI and Mendeley LBC dataset ensemble of Inception v3 and DenseNet161 datasets achieve the most.
4.6 Comparison with Other Ensemble Approaches






Majority Voting  95.68  94.37  97.64  
Average  94.76  93.88  97.29  

98.96  95.11  98.03  
Product Rule  92.15  93.37  97.29  
Maximum Rule  92.15  94.89  97.54  

98.96  95.41  98.40  

99.48  96.33  98.54 
The same probability distributions have been used to compute the predictions on the datasets using some popular ensembling procedures. The results thus obtained have been tabulated in Table 8. Among the ensemble techniques used, the weighted probability average ensemble with weights {0.5, 2.0, 1.0} for {Inception v3, DenseNet161, ResNet34} (weights set experimentally), gave results closest to the Sugeno Fuzzy Integral ensemble. The fuzzy measures used for the Sugeno Fuzzy Integral are {Inception v3, DenseNet161} for Mendeley LBC and SIPaKMeD WSI datasets and {Inception v3, DenseNet161, ResNet34} for SIPaKMeD SCI dataset. The fuzzy measures have been set through extensive experiments on multiple runs of the framework.
4.7 Results with different fuzzy measures
Fuzzy Measures  Accuracy  
Inception v3  DenseNet161  ResNet34  
0.5  0.5  0.1  95.33 
0.5  0.1  0.5  95.36 
0.1  0.5  0.5  98.54 
1  0.5  0.5  97.52 
0.5  1  0.5  95.36 
0.5  0.5  1  97.57 
0.5  1  0.1  97.52 
Fuzzy Measures  Accuracy  
Inception v3  DenseNet161  
1  0.5  96.33 
0.5  1  90.31 
0.5  0.1  30.31 
0.1  0.5  94.36 
Fuzzy Measures  Accuracy  
Inception v3  DenseNet161  
1  0.5  99.48 
0.5  1  79.48 
0.5  0.1  79.58 
0.1  0.5  87.96 
4.8 Comparison with Existing Models
Work  Approach  Accuracy (%)  
SCI  WSI  
Kiran GV et al. gv2019automatic  Feature Extraction and PCA  99.63  96.37  
Shi et al. shi2019graph  Graph Convolutional Network  98.37    
Plissiti et al. plissiti2018sipakmed 

95.35    
Win et al. win2020computer 

94.09    
Sevi et al. sevihealth  CNNs  88.40    
Proposed approach  Sugeno Fuzzy Integral Ensemble  98.54  96.33 
Here in this section, we have given the comparative study of performances of our proposed approach with previously reported works. In the Mendeley dataset, no works have been reported so far, therefore the works on the SIPaKMeD dataset are presented for comparison purpose. In kiran2019automatic, the reported accuracy in SIPaKMeD WSI is 96.37% which is almost the same as ours. The comparative results in SIPaKMeD are given by 12. Shi et al shi2019graph achieves impressive result of 98.37% on 5class SIPaKMeD dataset. Kiran GV et al. gv2019automatic
extracted features from the ResNet34 CNN model using transfer learning and applied Principal Component Analysis in the penultimate feature layer of the CNN for the final feature set selection and classification. They achieved an accuracy of 99.63% on the SIPaKMeD SCI dataset and 96.37% on the SIPaKMeD WSI dataset employing 5fold crossvalidation. However, we have implemented a different approach that does not require extraction of features and takes the opinion of multiple experts (CNN models) making the performance robust for the different datasets used, and is computationally efficient while keeping classification performance at par with stateoftheart. It is seen that the proposed approach outperforms most of the works evolved so far which justifies the reliability of the model.
4.9 GradCAM Analysis
In this section we use the Gradient guided Class Activation Maps or GradCAM by Selvaraju et al. selvaraju2017grad to visually represent the distinguishing regions in the single cell and whole slide pap stained images that enables to make the base classifiers to make the predictions. The results for the same are show in Figure 4 for the three datasets used in this study. GradCAM computes the number of weights in feature map of the last convolution layer to calculate the contribution of the feature maps towards the class prediction made by the CNN classifier.
For all the three datasets, in Figure 4, it can be noted that the three classifiers focus on different regions of the corresponding original image. For example, in Figure 4(i), for the SIPaKMeD SCI dataset, the ResNet34 model (Figure 4(l)) puts attention solely on the nucleus of the cell, DenseNet161 (Figure 4(k)) focuses on the nucleus as well as on the cytoplasm of the cell. Inception v3 in Figure 4
(j) focuses on the outliers and on the nucleus. Clearly, the three models takes into account different aspects of the image. Thus, when the ensemble of these three models are computer, the prediction incurs the complementary information provided by these different classifiers, and a superior prediction is made. Similarly for the whole slide images of Mendeley LBC dataset (
Figure 4(a)) and SIPaKMeD WSI dataset (Figure 4(e)), different classifiers focus on different cells within the slide to compute the final predictions, which further enables the ensemble model to aggregate the discerning information from the base learners to compute a prediction.4.10 Error Analysis
The proposed framework shows robust and reliable performance in the cervical cytology image classification task. For example, Figure 5, shows examples of instances where some individual classifiers predicted wrong classes while the ensemble approach made correct predictions. Figure 5(a) shows a test image from the SIPaKMeD SCI dataset belonging to the class "Superficiel Intermediate", and predicted correctly by the fuzzy ensemble method despite the image containing multiple nuclei. But, this image was classified incorrectly by Inception v3 and ResNet34 (but correctly by DenseNet161). The confidence of DenseNet161 on its prediction of this instance was much higher than Inception v3 and ResNet34 on their predictions. This resulted in the ensemble method to give priority to the DenseNet161 model’s decision and predicting the sample to be "Superficiel Intermediate". Similarly, Figure 5(b) shows a sample from the SIPaKMeD WSI dataset which was predicted correctly by DenseNet161 to be "Metaplastic" but wrongly by Inception v3 as "Dyskeratotic". Figure 5(c) shows a correct prediction from the Mendeley LBC dataset as belonging to the class "NILM", where DenseNet161 made the correct prediction while Inception v3 predicted it to be "HSIL".
Figure 6 shows the only misclassified sample from the Mendeley LBC dataset. The image belongs to the class HSIL but was predicted to be of class SCC by the proposed model.
Figure 7 shows some misclassified samples from the SIPaKMeD WSI dataset. The most probable reason for the wrong classification, in this case, is the presence of several types of cells in a single image. For example in Figure 7(a), the number of "Metaplastic" cells is more than the number of "Dyskeratotic" cells, which led to the originally "Dyskeratotic" class image to be classified as "Metaplastic". The case is reversed for Figure 8(c), where most cells are of the "Dyskeratotic" class, and thus an originally "Metaplastic" class image is classified as "Dyskeratotic" class. These wrong predictions might be due to the improper placement of these images in the classes while creating the dataset.
Figure 8 shows some misclassified samples from the SIPaKMeD SCI dataset. The possible reasons for the misclassifications are the quality of the image resulting in unclearly visible nuclei like in Figure 8(a) and (b); and the presence of multiple nuclei of cells in the image which is not desired from a singlecell image dataset like in Figure 8(c) and (d).
5 Statistical Analysis: McNemar’s Test
McNemar’s Test  pvalue  

Compared with 




Inception v3  0.00012  1.88E08  0.0005  
DenseNet161  0  0  0.0455  
ResNet34  0.0217  0.0036  0.00433 
Results from McNemar’s Test: Null Hypothesis is rejected for every case
The McNemar’s test dietterich1998approximate has been performed to justify the viability of the proposed framework with respect to the constituent models in the ensemble. Table 13 shows the results from the test on all the three datasets and clearly, the value is lower than 5% in all the cases, and hence the null hypothesis can be rejected, proving that the proposed framework is significantly better than the individual models used to form the ensemble.
6 Conclusions & Future Work
In this paper, we propose a fuzzyfusion based CNN integration method to address the problem of classification of papsmear based cervical cytology images. The decision scores obtained from CNN classifiers are used as the input of the fuzzyintegral to perform the final classification. With classification accuracies of 99.48%, 96.33%, and 98.54% on Mendeley LBC, SIPaKMeD whole slide images (WSI) and SIPaKMeD singlecell images (SCI) datasets respectively, our proposed method has shown superior performance as compared to other simple fusion methods and has outperformed several existing methods on these datasets. The proposed Sugeno Fuzzy Integral based ensemble is the first such implementation in this domain, and its adaptive weighting system based on the confidence scores of contributing classifiers makes it perform better than the traditional ensemble schemes previously used in the literature as evident from Table 8.
Graph convolution networks (GCN) and attentiongated networks have also shown promising performance in several domains, which engenders our interest to experiment with fuzzy fusionbased methods on these testbeds in the future. The fuzzy measures are selected based on the individual classifier performance on test sets, which is not the optimal solution. Hence, we further plan to implement some evolutionary metaheuristic optimization algorithm for the selection of the fuzzy measures of the classifiers that might further improve the overall classification performance. We might also incorporate other CNN classifiers to form the ensemble in the future.
Comments
There are no comments yet.