GLER-Unet: An ensemble network for hard exudates segmentation

. The detection of hard exudation in diabetic retinopathy is a hot topic in medical image segmentation. Aiming at the irregular shape and different size of lesion area in Hard Exudates segmentation task and the common few-shot learning challenge in medical image segmentation task, a Global-Local Ensemble Robust U-Net is proposed. The network consists of a Global Contour Extraction network for extracting long-range semantics and hard exudates contour which use complete image for training, a Local Refined Feature Segmentation network for extracting local refined segmentation rules which use patch image for training, and a Feature Revise network for fusing the features extracted by the first two networks and generating binary masks. The proposed method obtains DICE, TPR and PPV of 0.8741, 0.8752, 0.8730 and 0.8960, 0.8964, 0.8956 respectively on E-Ophtha and IDRiD. At the same time, the proposed methods shows strong robustness in cross dataset testing, better than other baseline models.


Introduction
Diabetic retinopathy (DR) is a complication of diabetes that causes retinal blood vessels to swell and ooze fluid and blood, and can even lead to vision loss in advanced stages [1].Hard exudates is a common lesion in DR.Hard exudates appears as bright yellow spots on the retina, caused by plasma leakage, with sharp edges that can be found on the surface of the retina.As an important symptom for identifying DR, automatic segmentation of hard exudation has practical significance to improve the efficiency of DR discrimination and reduce the rate of artificial misdiagnosis.
Xue et al. [2] proposed an improved Mask R-CNN network to extract microaneurysms and exudates.Guo et al. [3] proposed a L-SEG network for simultaneous segmentation of exudates, haemorrhages and microaneurysms.In order to improve computing efficiency, many researchers [4][5][6] also use segmented patches as network input for training.However, this training method makes the network reduce the ability to extract long-distance semantics, resulting in a certain decline in training accuracy.
The irregular shape of the hard exudates lesion area makes it difficult to extract the features.In addition, the wide existence of few-shot learning in the hard exudates segmentation datasets makes the network less robust and difficult to be applied in real diagnosis.To solve these problems, this paper proposed a global-local ensemble network for hard exudates segmentation.

Method
The structure of GLER-Unet is shown in figure 1.It consist of three part: Global Contour Extraction (GCE) network, Local Refined Feature Segmentation (LRFS) network and Feature Revise (FR) network.

Global contour extraction network
Using complete fundus images for training, GCE is used to extract long distance semantics and complete contours of hard exudates to limit the area of extraction area.The different shape and size of hard exudates requires multi-scale and arbitrary shape feature extraction ability.At the same time, GCE also need to pay more attention to the features extracted from the shallow network such as the boundary.
The GCE network as shown in Figure 2 inherits the architecture of U-Net [7].For encoder, GCE replaces the normal convolutions with deformable convolutions [8] and inception blocks [9].Deformable convolution offers GCE the ability of transforming the receptive field adaptively into the shape-which is more obvious in shallow feature-of the target lesion.Inception block enhanced the ability of multi-scale feature extraction.For decoder, GCE use normal convolutions and deformable convolutions to restore the features.

Local refined feature segmentation
Using patches cropped from origin images for training, LRFS is used to extract local fine segmentation rules.LRFS use U-Net++ as backbone, and multiple dilated convolutions [10] of different sizes are stacked in sawtooth shape to form the dilated conv layer to replace the origin conv layer.Dilated conv layer can extract multi-scale features without increase the amount of computing.The structure of encoders and decoders in LRFS is shown in figure 3.

Feature revise
Used to fuse features extracted from GCE network and LRFS network，and mapped to a binary mask.Firstly, the features extracted by GCE and LRFS are weighted and fused by Sigmoid, and then mapped into a non-binary mask with original resolution through a fullconnection layer, then the non-binary mask is transformed into the final binary mask by a Residual Rebuilt Module (RRM).The RRM as shown in figure 4 is a simple network based on residual blocks.Each residual block is composed of a conv layer, a batch normalization layer and a PReLu activation layer.RRM offers FR the ability to adjust network layers adaptively and realize mask map mapping.

Training strategy
The training process of GLER-UNET network can be divided into two stages -feature extraction network training and FR network training.Feature extraction network training includes the training of GCE and LRFS, which is trained independently without interfering with each other.When the network state is optimal, freeze the parameters remove the classifier, then the FR network training is carried out.Therefore, GCE and LRFS only play a role of inference in the training of FR network.

Loss function
Semantic segmentation task itself is a binary classification task, and Binary Cross Entropy (BCE) Loss is a common loss function in semantic segmentation, which can be applied to both GCE and LRFS networks.Focal Loss can also be used to solve the problem of extremely uneven distribution of front and background samples in hard exudates images.Thus, the weighted combination of Focal Loss and BCE is used as the loss function of the network.The formula is as follows: L (1 )

Evaluation
The model structure of GLER-UNET is relatively complex, and the role of each subnetwork is different.Therefore, Dice, TPR, PPV, OR and UR are used to evaluate the model from all aspects.The formulas is as follows: X is the ground truth and Y is the prediction.True Positive (TP) describes the number of correctly predicted pathological pixels, False Negative (FN) describes the number of wrongly predicted background pixels, False Positive (FP) describes the number of wrongly predicted pathological pixels.
Dice coefficient describes the similarity of two samples.The True Positive Rate (TPR) describes the proportion of identified lesions to True lesions.Positive Predictive Value (PPV) describes the proportion of lesions identified as true lesions.Over Segmentation Rate (OR) and Under Segmentation Rate (UR) are respectively used to describe the ratio of pixels outside the actual prediction result and the ratio of pixels that the actual prediction structure lacks in the Ground Truth.

Ablation experiment
We conducted ablation experiments on each part of the integrated model to verify the effectiveness of the sub-networks, and the results are shown in the table.The results is shown in Table 1.It can be found that GCE tends to be under-segmented, its Dice coefficient is lower than that of LRFS, but PPV index is higher, which indicates that GCE network is more inclined to improve the accuracy of pixel classification in the segmentation region.On the contrary, LRFS tends to be over-segmented, its Dice coefficient is higher and TPR index is generally higher, which indicates that LRFS's segmentation strategy is to cover the real lesion region as much as possible, while the relative PPV index is lower.The combination of GCE and LRFS network further improves the network accuracy, while the OR and UR rate are more balanced and the value is relatively low, indicating that the ensemble network absorbed the advantages of GCE and LRFS network and get balanced.The main function of FR network is to fuse the feature, but numerically, FR network can also further balance the OR and UR rate.
Figure 5 shows the segmentation diagrams of each component.There are a lot of noises in the segmentation diagram of LRFS network, which classifies irrelevant areas into lesion areas, but this also makes the LRFS network successfully cover most lesion areas.In terms of the shape and number of lesion areas, the segmentation result of GCE network is closer to the real label with less noise.