Ultrasound image analysis using deep neural networks for discriminating between benign and malignant ovarian tumors: comparison with expert subjective assessment

Abstract Objectives To develop and test the performance of computerized ultrasound image analysis using deep neural networks (DNNs) in discriminating between benign and malignant ovarian tumors and to compare its diagnostic accuracy with that of subjective assessment (SA) by an ultrasound expert. Methods We included 3077 (grayscale, n = 1927; power Doppler, n = 1150) ultrasound images from 758 women with ovarian tumors, who were classified prospectively by expert ultrasound examiners according to IOTA (International Ovarian Tumor Analysis) terms and definitions. Histological outcome from surgery (n = 634) or long‐term (≥ 3 years) follow‐up (n = 124) served as the gold standard. The dataset was split into a training set (n = 508; 314 benign and 194 malignant), a validation set (n = 100; 60 benign and 40 malignant) and a test set (n = 150; 75 benign and 75 malignant). We used transfer learning on three pre‐trained DNNs: VGG16, ResNet50 and MobileNet. Each model was trained, and the outputs calibrated, using temperature scaling. An ensemble of the three models was then used to estimate the probability of malignancy based on all images from a given case. The DNN ensemble classified the tumors as benign or malignant (Ovry‐Dx1 model); or as benign, inconclusive or malignant (Ovry‐Dx2 model). The diagnostic performance of the DNN models, in terms of sensitivity and specificity, was compared to that of SA for classifying ovarian tumors in the test set. Results At a sensitivity of 96.0%, Ovry‐Dx1 had a specificity similar to that of SA (86.7% vs 88.0%; P = 1.0). Ovry‐Dx2 had a sensitivity of 97.1% and a specificity of 93.7%, when designating 12.7% of the lesions as inconclusive. By complimenting Ovry‐Dx2 with SA in inconclusive cases, the overall sensitivity (96.0%) and specificity (89.3%) were not significantly different from using SA in all cases (P = 1.0). Conclusion Ultrasound image analysis using DNNs can predict ovarian malignancy with a diagnostic accuracy comparable to that of human expert examiners, indicating that these models may have a role in the triage of women with an ovarian tumor. © 2020 The Authors. Ultrasound in Obstetrics & Gynecology published by John Wiley & Sons Ltd on behalf of International Society of Ultrasound in Obstetrics and Gynecology.

(International Ovarian Tumor Analysis) terms and definitions. Histological outcome from surgery (n = 634) or long-term (≥ 3 years) follow-up (n = 124) served as the gold standard. The dataset was split into a training set (n = 508; 314 benign and 194 malignant), a validation set (n = 100; 60 benign and 40 malignant) and a test set (n = 150; 75 benign and 75 malignant). We used transfer learning on three pre-trained DNNs: VGG16, ResNet50 and MobileNet. Each model was trained, and the outputs calibrated, using temperature scaling. An ensemble of the three models was then used to estimate the probability of malignancy based on all images from a given case. The DNN ensemble classified the tumors as benign or malignant (Ovry-Dx1 model); or as benign, inconclusive or malignant (Ovry-Dx2 model). The diagnostic performance of the DNN models, in terms of sensitivity and specificity, was compared to that of SA for classifying ovarian tumors in the test set.

INTRODUCTION
Ovarian cancer is the most lethal gynecological malignancy, with a global 5-year survival of 45% 1 . Almost 10% of asymptomatic postmenopausal women have an ovarian lesion, often detected incidentally, of which only 1% are malignant 2 . Over 50% of ovarian tumors occur in fertile women 3 , in whom unnecessary or extensive surgery may cause fertility loss. Thus, there is a need to determine precisely the risk of malignancy to individualize and optimize treatment. Benign masses can be managed conservatively with ultrasound follow-up or minimal invasive laparoscopy, while preserving fertility 3 . Women with suspected ovarian cancer should be referred directly to a gyneoncology treatment center, as surgical treatment of such patients by gynecological oncologists is associated with higher likelihood of complete tumor removal and improved survival rate 4 .
Expert ultrasound examination has become the main imaging technique for assessing ovarian lesions 5 . The diagnostic accuracy of ultrasound is better in the hands of experts than in the hands of less experienced doctors 6 , however, there is a shortage of expert examiners. Recent advances in computerized diagnostics have been powered by deep neural networks (DNNs), a class of machine-learning algorithms that can learn complex representations of data from compositions of many simple non-linear units. This approach is a paradigm shift, in which the input to the model is not hand-designed, as in the past, but raw data 7 . DNNs have been shown to be able to discriminate between benign and malignant tumors in other domains, such as computed tomography (lung cancer) 8 , photographic imagery (skin cancer) 9 and mammography (breast cancer) 10 , with performance on a par with that of expert radiologists. Using ultrasound images, DNNs have shown promising results in diagnosing breast and thyroid tumors [11][12][13] , although the field is still unexplored when it comes to ovarian tumors. Training DNNs requires large volumes of labelled data, which is often a scarce resource in the medical field. To overcome this limitation, transfer learning from ImageNet 14 , a large labelled dataset of roughly 1 million natural images, has become a standard practice for deep learning in medical applications 15 .
The aims of this study were to develop and test a DNN-based computerized ultrasound image analysis model for discriminating between benign and malignant ovarian tumors and to compare its diagnostic accuracy with that of expert subjective assessment (SA) 16 and IOTA (International Ovarian Tumor Analysis) simple rules 17,18 and simple-rules risk 19 .

Dataset
We included retrospectively 3077 (grayscale, n = 1927; power Doppler, n = 1150) ultrasound images from 758 women with ovarian lesions. All women had undergone structured expert ultrasound assessment prior to surgery, at the gynecological ultrasound departments of the Karolinska University Hospital (tertiary referral center) and Södersjukhuset (secondary/tertiary referral center) in Stockholm, Sweden, between 2010 and 2019. The examinations were performed by one of six examiners with substantial experience (7-23 years) in the assessment of adnexal lesions. All examiners were certified (having undergone both theoretical and practical assessment) as second-opinion expert sonographers, i.e. expert examiners, by the Swedish Society of Obstetrics and Gynecology (SFOG). Every case was assessed by a single examiner. Ethical approval was obtained by the local ethics committee (DNR 2010/145, 2011/343).
Eligible criteria were surgery within 120 days after the ultrasound examination (n = 634) or ultrasound follow-up for a minimum of 3 years or until resolution of the lesion (n = 124). None of the women undergoing follow-up was diagnosed with a malignant lesion, and thus, they were presumed to have a benign diagnosis. At the time of examination, a standardized protocol was filled out, in which the tumors were classified as benign or malignant using expert SA, the perceived certainty in the assessment was reported (uncertain vs certainly/probably benign/malignant), and the tumors were classified according to the IOTA simple rules. The IOTA simple rules include five benign and five malignant criteria. A lesion is considered as benign or malignant if only benign or malignant features, respectively, are present, and is classified as inconclusive if both benign and malignant features are present or none of the features is observed. Inconclusive lesions (approximately one in four cases) require a second-stage test, usually SA by an expert examiner. If SA is not available, these lesions are considered potentially malignant 17,18 . In order to be able to classify all lesions based on the simple-rules assessment, we applied retrospectively the IOTA logistic regression model simple-rules risk (SRR) to obtain a risk score for every case 19 . We used a cut-off of 0.2 for SRR to define women with suspected malignancy, based on the original study 19 .
All examinations were performed using high-end ultrasound systems, namely GE Voluson E8 or E10 with a 5-9-or 6-12-MHz transducer (GE Healthcare, Zipf, Austria) or Philips IU22 or EPIQ with a 3-10-MHz transducer (Philips Medical Systems, Bothell, WA, USA). Eligible cases were selected randomly from our ovarian tumor database. Benign and malignant cases were selected separately, to obtain a case-mix with at least 40% malignant cases (as the clinical incidence of malignancy among women undergoing expert ultrasound assessment prior to surgery was 40% at Södersjukhuset and 49% at Karolinska University Hospital during the study period). In the selected cases, representative grayscale and power Diagnosis of ovarian tumors using deep learning 157 Doppler ultrasound images (median, 3 (interquartile range, 3-5) images per case), in which the whole lesion was adequately shown, were selected and downloaded in jpeg format from the hospital image database systems. The images were deidentified and then cropped manually by the reviewer to the region of interest (ROI), which in the vast majority of images involved removing merely the outer borders and occasionally also excluding surrounding structures, such as the uterus ( Figure S1).
Following standard practice, our dataset was split into a training set (n = 508; 314 benign and 194 malignant), a validation set (n = 100; 60 benign and 40 malignant) and a test set (n = 150; 75 benign and 75 malignant, with three images per case). The training set was used to learn the parameters of the models; the validation set was used to estimate the prediction error for hyperparameter tuning and model selection; and the test set was used to assess independently the generalization error for the final chosen models, preventing possible overfit to the training data. To ensure accurate results, only patients with a histological diagnosis obtained by surgery were included in the test set. The histological outcome for all women who underwent surgery, and separately for those who were included in the test set, is shown in Table 1.

Data processing
Images with a shape of (224, 224, 3), i.e. square images measuring 224 by 224 pixels with three color channels (RGB), were used as input to the DNN models. Therefore, the images were downsampled and resized accordingly, using nearest-neighbor interpolation. Since the images were not at a uniform scale, the resulting physical resolution varied. Furthermore, the pixel values were standardized channel-wise to have zero mean and unit variance over the training dataset. The same values were also used to standardize the validation and test sets. Data augmentation was performed during training for model generalization by expanding the available training data and mimicking shifts in image properties in unseen domains. The transformations used in the augmentation process can be divided into three main categories 20 , based on the aspect of the image that is altered: image quality, spatiality and appearance. Low image quality is characterized mainly by blurriness and low resolution, caused by scanner motion, low scanner resolution or lossy image compression. We tried to imitate this by adding Gaussian noise (std, 0.02), jpeg compression (0-20%) and shift in sharpness (± 20%). Transformations related to the spatial shape of the images comprised horizontal and vertical flips, rotation (multiples of 90 • ) and cropping (0-10%). These transformations served the purpose of simulating the variability in the shape and position of organs, the size of patients and ROI-related cropping. The appearance-related transformations were shifts in brightness (± 20%), contrast (± 20%) and color (± 10%), which mimicked the differences between ultrasound systems and settings.

Model building
We used a transfer-learning approach on ImageNet 14 pre-trained deep-learning models VGG16 21 , ResNet50 22 and MobileNet 23 . Each model was fine-tuned on the dataset of transvaginal ultrasound images of ovarian tumors. The output probabilities were then calibrated independently using temperature scaling 24 , a post-processing technique used to better align the confidence scores with the underlying class probabilities ( Figure S2). The three models were then combined into an ensemble, using a soft voting scheme of averaging the probabilities from the models, to improve performance over the individual models. Temperature scaling ensured more reliable estimates of the probability of malignancy for each model, which is desirable when building an ensemble, as it reduces the problem of difference in (over-) confidence between the models. The ensemble was then used to estimate the probability of malignancy of a given case, by averaging the predictions from all representative images from that case. Using the prediction from the ensemble, tumors were classified as benign or malignant (Ovry-Dx1) or as benign, inconclusive or malignant (Ovry-Dx2), by setting thresholds on the predicted probability of malignancy. The performance of the DNN models was compared to that of SA, based on their sensitivity and specificity in discriminating between benign and malignant lesions in the test set. The probability threshold for Ovry-Dx1 was intended to be set so as to give an optimal balance between sensitivity and specificity, with a sensitivity close to that of SA. Since this value was near 0.5 and its uncertainty large, the threshold was simply set to 0.5, with cases above this threshold classified as malignant. For Ovry-Dx2, the probability thresholds were set to 0.4 and 0.6 (i.e. cases with a predicted probability of malignancy between 0.4 and 0.6 were classified as inconclusive), which was shown 158 Christiansen et al.
to result in a reasonable balance between performance and fraction of excluded cases in the validation set.

Training process
First, in each model, the original classifier for the ImageNet classes was replaced by a binary classifier, consisting of a fully connected layer of 1024 hidden nodes (512 for MobileNet), followed by ReLU-activation, dropout of 0.5 (0.2 for VGG16) and a final fully connected softmax layer with two nodes representing the benign and malignant outputs. In the first training step, the weights of the convolutional base of the original models were frozen, thereby training only the new binary classifier, using an initial learning rate of 0.02 (0.002 for VGG16). Then, the layers of the convolutional base were unfrozen and fine-tuned one by one, until no more improvement could be seen. An initial learning rate of 2 × 10 − 4 was used in this step. All models were trained using backpropagation 25 by stochastic gradient descent, with Nesterov momentum 26 of 0.9 and a batch size of 32 images, on an NVIDIA RTX 2080 graphics card. Imbalance in the number of grayscale and power Doppler images, from benign and malignant cases, was addressed by training with weighted binary cross-entropy loss. We used a learning-rate decay of 0.5 after every four consecutive epochs of no improvement in validation accuracy. In both steps of training, we used early stopping to monitor the performance of the model on the validation set and select the optimal point to stop training ( Figure S3). This is the point at which the model performance on the validation set stops improving, and any additional training overfits the training data at the expense of increased generalization error. The hyperparameters, such as the learning rate, dropout rate and the number of hidden nodes, were chosen to maximize performance on the validation set.
Similar to most modern deep-learning architectures, ResNet50 and MobileNet use batch normalization 27 to improve training performance and stability. When using transfer learning on a model with batch normalization layers, the difference between the source domain (natural images) and target domain (ultrasound images) must be addressed. If the parameters of these layers are frozen during training of the new classifier, the model will be using the exponential moving average (EMA) statistics from the source domain for normalization at test time. These statistics will differ significantly from the statistics of the batches during training of the new classifier in the target domain, leading potentially to a large drop in performance. To align the training and test behavior of the model, the EMA statistics of the batch normalization layers were updated continually during fine-tuning of the models in the target domain.

Evaluation of possible image bias
Caliper measurements were present in ∼80% of both benign and malignant images in the dataset. To rule out any potential bias introduced by the calipers, the final ensemble model (Ovry-Dx1) was evaluated on images with and without calipers. Since we were not interested in the absolute, but rather the relative performance, we used both the validation and test datasets to obtain a more accurate measurement of the effect. Furthermore, only the grayscale images were used, since very few power Doppler images contained calipers. On the images with and without calipers, the sensitivities (88.8% vs 89.6%) and specificities (79.1% vs 82.4%) of Ovry-Dx1 were similar, and the outputs of the network were statistically indistinguishable (P = 0.86 and P = 0.50). This indicates that the presence of caliper measurements does not play a significant role in assisting the model in predicting malignancy of ovarian tumors.
To rule out potential bias related to the use of power Doppler images in addition to using only grayscale images, the final ensemble model (Ovry-Dx1) was evaluated on the test set for each individual image alone. On the power Doppler and grayscale images, the specificities (85.7% vs 85.1%) were similar, while the sensitivity was higher for power Doppler images compared to grayscale images (88.2% vs 82.9%). Furthermore, the outputs of the network on the benign images were statistically indistinguishable (P = 0.61), while the outputs of the network on the malignant images were statistically higher for the power Doppler images (P = 0.03). Based on this, the model's response to benign images was similar for both modalities. There does appear to be a slight difference in the network's response towards malignant images in favor of the power Doppler modality, but the higher sensitivity may be explained by the addition of blood flow information. In practice, this means that cases with power Doppler images, in addition to grayscale images, will likely see an increase in sensitivity without a drop in specificity.

Statistical analysis
To compare the performance of the DNN models to that of SA in discriminating between benign and malignant tumors in the test set, the sensitivity, specificity, accuracy and area under the receiver-operating-characteristics (ROC) curve (AUC), with their 95% CI, were calculated. The 95% CI for the sensitivity, specificity and accuracy where estimated by the Jeffreys interval 28 , while the 95% CI for the AUC and the fraction of excluded cases were estimated by bootstrapping. Comparison of the sensitivity, specificity and accuracy of the DNN models with that of SA, simple rules and SRR was performed using McNemar's test for paired categorical data. Evaluation of possible image bias, from caliper measurements and power Doppler images, was based on the Mann-Whitney U-test for unpaired non-parametric data. Statistical analysis was performed using IBM SPSS Statistics for Windows, version 26.0 (IBM Corp., Armonk, NY, USA). All tests were two-sided and P-values < 0.05 were considered statistically significant.

RESULTS
The performance statistics on the test set of the three DNN models (VGG16, ResNet50 and MobileNet) and the ensemble of these models are shown in Figure 1 and the corresponding ROC curves in Figure 2. The ensemble model was the overall best performing model, achieving an AUC of 0.950 (95% CI, 0.906-0.985) when all cases in the test set were classified as benign or malignant (Ovry-Dx1), and an AUC of 0.958 (95% CI, 0.911-0.993) when excluding 12.7% (95% CI, 7.3-18.0%) of cases as inconclusive (Ovry-Dx2). In Table 2, we present the diagnostic performance of SA, simple rules, SRR and the DNN models (Ovry-Dx1 and Ovry-Dx2) for differentiating between benign and malignant ovarian lesions in the test set. At a sensitivity of 96.0%, Ovry-Dx1 had a specificity similar to that of SA (86.7% vs 88.0%; P = 1.0). Ovry-Dx2 had a sensitivity of 97.1% and a specificity of 93.7%, when designating 12.7% (95% CI, 7.3-18.0%) of the lesions as inconclusive. By complementing Ovry-Dx2 with SA in inconclusive cases, the overall sensitivity (96.0%) and specificity (89.3%) were not significantly different from using SA in all cases (P = 1.0). Use of the simple rules classified 27.3% of cases as inconclusive. At a sensitivity of 96.0%, Ovry-Dx1 had a significantly higher specificity (86.7% vs 66.7%; P = 0.003) and accuracy (91.3% vs 81.3%; P = 0.006) than the simple rules with inconclusive cases classified as malignant ( Table 2). At a cut-off of 0.2, SRR had the same performance as the simple rules with   inconclusive cases classified as malignant ( Table 2). The cut-off of 0.2 was chosen based on the results of the original study 19 , and furthermore, any cut-off between 0.153 and 0.325 would have resulted in the same performance, since no case was given a prediction in this interval. SRR achieved an AUC of 0.921 (95% CI, 0.876-0.958). Table 3 shows the histological outcome of cases classified as inconclusive by Ovry-Dx2 and simple rules and cases that were difficult to classify by SA. Of the 19 cases classified as inconclusive by Ovry-Dx2, 10 were also considered difficult to classify by SA (Table 2, Figure 3a,b). Figure 3c,d shows the other nine cases that were classified as inconclusive by Ovry-Dx2, of which two were also considered inconclusive by simple rules. Figure 4 shows the six cases that were misdiagnosed by Ovry-Dx2, two of which were also misdiagnosed by SA.

DISCUSSION
We have shown that ultrasound image analysis using DNNs can predict ovarian malignancy with a diagnostic accuracy comparable to that of human expert examiners. Furthermore, we found a substantial overlap between lesions difficult to classify by SA and by Ovry-Dx2, indicating that both the expert and the DNNs recognize features (e.g. papillary projections, multilocular lesions with > 10 loculi) known to signify diagnostic difficulties 29 .
While our work is unprecedented in applying deep learning for classification of adnexal masses, a few pilot studies with a limited number of cases have explored the use of handcrafted image descriptors, in combination with support vector machines (SVM), for classification. Two studies 30, 31 on 177 patients used local binary patterns (LBP) 32 and pixel intensity histograms as Table 2 Diagnostic performance of expert subjective assessment (SA) 16 , IOTA simple rules (SR) 17,18 , IOTA simple-rules risk (SRR) 19   texture features; the first reported an accuracy of 76%, while excluding 18.3% of the cases as inconclusive 30 , and the second a sensitivity of 77% and specificity of 77% 31 , for classifying ovarian masses. A later study 33 on the same dataset achieved an increased performance (sensitivity 91%, specificity 83%, AUC of 0.874) using Fourier transform-based feature descriptors and SVM for classification. While the latter results are promising, these studies all share the important limitation that they relied on manual segmentation of tumors by an ultrasound expert. The selection of a rectangular ROI in our own work is of an entirely different nature, as it requires
significantly less involvement and domain expertise by the operator. While we identified other related papers 34 , we have omitted these due to methodological shortcomings, e.g. reporting classification performance on the same data used for selection of image features. Finally, the key to the success of DNNs is their ability to learn highly representative features, on multiple scales and levels of abstraction, directly from large datasets of raw images. This leads to features of greater discriminative capacity compared to conventional handcrafted descriptors 35 .
An advantage of our model is that it is simple to use, as any center could upload a set of deidentified images directly from the workstation or hospital computer, to a cloud platform hosting the model, without the need to first assess subjectively the images or provide additional patient data. The greatest clinical benefit of a diagnostic DNN model for classification of ovarian tumors would be in the hands of non-expert examiners; however, it may also be useful to experts as a second reader. Many centers and private practitioners have limited access to a second opinion by an ultrasound expert; therefore, they might use simple rules, designating inconclusive cases as malignant, or use SRR in order to reach an acceptable sensitivity and be able to manage all patients. In this setting, the specificity could potentially be improved by 20 percentage points (from 66.7% for both simple rules strategies, to 86.7%) by instead using Ovry-Dx1 in all cases (Table 2). In addition to high sensitivity, a high specificity is also of utmost importance, as many lesions are detected incidentally in asymptomatic women. Unnecessary surgery misuses the resources of the healthcare system and may cause morbidity, and in some cases even fertility loss. In a large randomized screening study for ovarian cancer 36 , one-third of women with false-positive findings underwent adnexal surgery. Of these, 15% experienced at least one major complication, highlighting the importance of minimizing false-positive diagnoses in an asymptomatic population. By adjusting thresholds on the DNN predictions, as we have done for Ovry-Dx2, it is possible to optimize sensitivity, specificity and the fraction of inconclusive cases, depending on the setting (i.e. low-or high-resource) and the population (i.e. low or high risk of disease).
A strength of this study is that all cases were classified prospectively by expert ultrasound examination prior to surgery or long-term ultrasound follow-up, which enabled us to compare the performance of the DNNs to that of SA. The proportion of cases classified as inconclusive by simple rules (27.3%) is slightly higher than that reported in previous studies (22.5% 17 and 24.0% 18 ), indicating that the number of difficult cases in our test set is similar to or higher than that in other populations. While no direct comparison can be made, the sensitivity (96.0%) and specificity (88.0%) of SA in our test set, are comparable to the reported performance in a large IOTA multicenter study including 1938 cases (sensitivity 90.4%, specificity 92.7%) 17 .
We included a wide range of ovarian pathology with high-quality images, optimizing DNN model development. However, homogeneity of the image quality can be seen as a weakness of this study, as the majority of images were obtained by the same examiner using the GE Voluson E8 or E10 systems. Thus, it remains to be shown if the models perform equally well on images acquired by other expert centers, less experienced examiners or by examiners not using high-end equipment. Evaluating cases based on batches of images led to better diagnostic accuracy compared to using single images. A next step could be to explore if the use of two-dimensional videoclips or three-dimensional volumes could further improve evaluation performance. Additional data, especially from diagnoses with low prevalence, could allow training of a DNN for multiclass diagnosis-specific classification.
In this study, images were cropped manually, mainly by removing the outer borders; a task that would render itself suitable for auto-cropping. As the selection of an acceptable ROI requires only the ability to locate the tumor in the recorded image, which is a requirement for recording the image in the first place, this coarse cropping does not depend on expertise beyond that which is already inevitable for image acquisition. Still, despite the fact that manual ROI selection is performed in most studies on deep learning in the medical domain 37 , the necessity of manual cropping and the potential benefit of auto-cropping should be explored in future studies. Finally, our extensive spatial data augmentation during training is further likely to increase robustness and flexibility in the ROI selection.
We would like to emphasize the importance of external validation of the Ovry-Dx1 and Ovry-Dx2 models on images from other gynecological centers, in order to evaluate the limitation in generalization and the potential for multisite deployment. Thus, the next step is to validate externally these models in a large multicenter setting, which is already underway (ISRCTN51927471). It should be stressed that automated image analysis should only be used to assist in the triage of patients and not to make a final diagnosis. Nevertheless, our results clearly indicate that DNNs have the potential to be clinically useful in the triage of women with an ovarian tumor.

SUPPORTING INFORMATION ON THE INTERNET
The following supporting information may be found in the online version of this article:

Figure S1
Cropping of ultrasound images to standardized dimensions of 4:3 by selecting region of interest (ROI).

Figure S2
Reliability diagrams for VGG16-based model, before and after calibration, showing accuracy plotted against confidence of the model. Confidence is the predicted probability for the most probable class (benign or malignant). Since classification is binary, it will always be above 0.5. In an overconfident model, confidence exceeds accuracy.