Introduction to artificial intelligence in ultrasound imaging in obstetrics and gynecology

ABSTRACT Artificial intelligence (AI) uses data and algorithms to aim to draw conclusions that are as good as, or even better than, those drawn by humans. AI is already part of our daily life; it is behind face recognition technology, speech recognition in virtual assistants (such as Amazon Alexa, Apple's Siri, Google Assistant and Microsoft Cortana) and self‐driving cars. AI software has been able to beat world champions in chess, Go and recently even Poker. Relevant to our community, it is a prominent source of innovation in healthcare, already helping to develop new drugs, support clinical decisions and provide quality assurance in radiology. The list of medical image‐analysis AI applications with USA Food and Drug Administration or European Union (soon to fall under European Union Medical Device Regulation) approval is growing rapidly and covers diverse clinical needs, such as detection of arrhythmia using a smartwatch or automatic triage of critical imaging studies to the top of the radiologist's worklist. Deep learning, a leading tool of AI, performs particularly well in image pattern recognition and, therefore, can be of great benefit to doctors who rely heavily on images, such as sonologists, radiographers and pathologists. Although obstetric and gynecological ultrasound are two of the most commonly performed imaging studies, AI has had little impact on this field so far. Nevertheless, there is huge potential for AI to assist in repetitive ultrasound tasks, such as automatically identifying good‐quality acquisitions and providing instant quality assurance. For this potential to thrive, interdisciplinary communication between AI developers and ultrasound professionals is necessary. In this article, we explore the fundamentals of medical imaging AI, from theory to applicability, and introduce some key terms to medical professionals in the field of ultrasound. We believe that wider knowledge of AI will help accelerate its integration into healthcare. © 2020 The Authors. Ultrasound in Obstetrics & Gynecology published by John Wiley & Sons Ltd on behalf of the International Society of Ultrasound in Obstetrics and Gynecology.


Introduction
Artificial intelligence (AI) is described as the ability of a computer program to perform processes associated with human intelligence, such as reasoning, learning, adaptation, sensory understanding and interaction 1 . In his seminal paper published in 1950 2 , Alan Turing introduced a test (now called 'the Turing test') in which, if an evaluator cannot distinguish whether intelligent behavior is exhibited by a machine or a human, the machine is said to have passed the test 2 . John McCarthy coined the term 'artificial intelligence' soon after 3 . The Journal of Artificial Intelligence commenced publication in 1970, but it took several years for computing power to match theoretical possibilities and allow development of modern algorithms.
In simple terms, traditional computational algorithms are software programs that follow a sequence of rules and perform an identical function every time, such as an electronic calculator: 'if this is the input, then that is the output'. In contrast, an AI algorithm learns the rules (function) from training data (input) presented to it. Major milestones in the history of AI include the Deep Blue computer outmatching the world chess champion, Gary Kasparov, in 1997 and AlphaGo defeating one of the best players (ranked 9-dan) of the ancient Chinese game of Go, Lee Sedol, in 2016 4 .
Both chess and Go are games that require strategy, foresight and logic, all of which are qualities typically attributed to human intelligence. Go is considered much more difficult for computers than chess, because it involves far more possible moves (approximately 8 million choices for three moves as opposed to 40 000 for chess). The victory in Go represents the progress in computational algorithms, improved computing infrastructure and access to enormous amounts of data. The same evolution has led to several widely popularized AI consumer applications, including autocomplete on Google search, virtual assistants (such as Alexa, Cortana, Google Home and Siri), personalized shopping recommendations, the emergence of automatic self-driving cars and face recognition (for instance, searching by a face in Google photos).
In clinical medicine, the interest (and recent hype) in AI technologies stems from their potential to transform healthcare by deriving new and important insights from the vast amount of digital data generated during delivery of healthcare. Promising medical AI applications are emerging in the areas of screening 5,6 , prediction 7-9 , triage 10,11 , diagnosis 12,13 , drug development 14,15 , treatment 16,17 , monitoring 18 and imaging interpretation 19,20 . Several original studies published in this Journal have used AI methodology to evaluate adnexal masses 21 , the risk of lymph node metastases in endometrial cancer 22 , pelvic organ function 23,24 and breast lesions [25][26][27] , assess aneuploidy risk 28 , predict fetal lung maturity 29 , perinatal outcome 30 , shoulder dystocia 31 and brain damage 32 , estimate gestational age in late pregnancy 33 and classify standard fetal brain images as normal or abnormal 34 (Table 1). The number of AI-related papers is increasing; at the 29 th World Congress of the International Society of Ultrasound in Obstetrics and Gynecology (ISUOG) in 2019, there were 14 abstracts specifically mentioning AI, in comparison to a total of 13 abstracts in the preceding six ISUOG World Congresses (2013-2018).
As with any scientific discipline, the AI scientific community uses technical language and terminology that can be difficult to understand for those outside the area. This in addition to the rapid advancement in the field can make it challenging for other disciplines to keep abreast of developments in AI. Indeed, one of the key concerns that has been expressed regarding AI in medicine is that there are relatively few interdisciplinary professionals who work at the interface of AI and medicine and can 'translate' between the two 35 . A recent review of 250 AI papers emphasized the need for greater collaboration between computational scientists and medical professionals to generate more scientifically sound and impactful work integrating knowledge from both domains 36 .
To contribute to this discussion, this article aims to explain key AI-related concepts and terms to clinicians in the field of ultrasound in obstetrics and gynecology. For simplicity, we use the general term 'artificial intelligence (AI)', which is commonly used by others in the field, although most articles referring to AI in clinical medicine are based on deep learning, a subset of AI (Box 1, Figure 1). It is also important to appreciate that relatively few AI-based ultrasound applications have advanced the whole way from academic concept to clinical application and commercialization. Therefore, we also use examples from radiology, being our closest sister field.

Artificial intelligence and medical imaging
The current interest in AI in medical imaging stems from major advances in deep learning-based 'computer vision' over the past decade. The field of computer vision concerns computers that interpret and understand the visual world. Within computer vision, object recognition ('what can I see in this image?') is a key task which can be posed as an image classification problem. Researchers in this field use 'challenge' datasets to benchmark the progress in accuracy of image classification. One such challenge dataset, called the ImageNet project, is a database of more than 14 million images of every day (non-medical) objects that have been labeled by humans into more than 20 000 categories. This large database was first made available to the scientific community in 2010 to train algorithms for image classification. In 2015, the ImageNet annual competition reached a milestone when the error rate of automatic classification of images dropped below 5%, which is the average human error rate ( Figure S1) 17 . This was largely due to advances in deep learning, the branch of AI that learns from large amounts of data.
Deep learning excels in pattern recognition and we believe that medical professions which rely on imaging will be the first to see the benefits of this tool (Appendix S1). One of the largest driving forces behind AI in medical imaging is the enormous amount of digital data generated around the world that may be useful in training algorithms. As of May 2020, there are more than 50 deep learning-based imaging applications 37 approved by the USA Food and Drug Administration (FDA) or the European Union, spanning across most imaging modalities, including X-ray, computerized tomography (CT), magnetic resonance imaging, retinal optical coherence tomography and ultrasound. Approved AI applications are designed to provide increased productivity by performing automated screening, assisting in diagnosis or prioritizing a radiology study that needs to be 'at the top of the list'. Applications include identification of cerebrovascular accidents, diabetic retinopathy, skeletal fractures, cancer, pulmonary embolism and pneumothorax 37 . Recently, the first ultrasound AI application that guides the user received FDA approval; the software uses AI

Box 1 Glossary of commonly used artificial intelligence terms
Artificial intelligence (AI) refers to a machine or software performing tasks that would ordinarily require human brainpower to accomplish, such as making sense of spoken language, learning behaviors or solving problems ( Figure 1). This means that an AI program can learn from real-world data as well as experience, and encompasses the capacity to improve its performance given more data. Nevertheless, there is no accepted definition of AI, and therefore, the term is often misused 71 . AI can be broken down into general AI, which is human-like intelligence (i.e. ability to think, learn, reason) and narrow AI, which is the ability to perform a specific task (i.e. image detection, translation, chess-playing).
Convolutional neural networks (CNNs), also known as artificial neural networks, are computational algorithms inspired by the biological neural networks that constitute animal brains and consist of multilayered artificial neurons ( Figure 1). A CNN is displayed as a system of hidden connections between input and output. CNNs have the ability to determine the relationship between input (such as brain computerized tomography (CT)) and labels (presence or absence of hemorrhage). This is in contrast to traditional software, in which predetermined logic rules set the output to specific stimuli. In reality, there is little resemblance to human neurons.
Black box is the term often used to describe the process occurring inside the hidden layers of CNNs. For example, a new AI product is launched aimed at detecting intracranial hemorrhage. When this software reads a CT scan that has signs of intracranial hemorrhage, it will correctly output the result of evidence of intracranial hemorrhage to the care team, yet it may not report why it reached this conclusion. There is an ongoing effort aimed at providing 'explainability' to AI, to report the 'how' in addition to the result (Explainable AI).
Explainable AI is an emerging subfield of AI that attempts to explain how black box decisions of AI systems are made. Explainable AI aims to understand the key steps involved in making computational decisions. This should theoretically allow decisions taken by an algorithm to be understood by end-users.
Model, application or algorithm are all terms used interchangeably for the ready-to-use AI software/product.
Machine learning is a branch of AI, defined by the ability to learn from data without being explicitly programmed ( Figure 1). Machine learning can be understood as a statistical method that gradually improves as it is exposed to more data, by extracting patterns from data. Deep learning is a branch of machine learning ( Figure 1). In deep learning, the input and output are connected by multiple layers of hidden connections, also known as CNNs. Deep learning involves learning from vast amounts of data and performs especially well in pattern recognition within data; therefore, it can be particularly helpful in medical imaging. Deep learning is usually divided into two major classes: 1) Supervised learning, in which labeled (annotated) data are used as an input to a CNN (Appendix S1). For example, to build an application detecting brain hemorrhage on a CT scan, the CNN is first trained using labeled data, i.e. normal scans and scans with hemorrhage labeled with the correct diagnosis by a radiologist (label = hemorrhage present/absent). Following training using the training dataset, evaluation of the CNN is carried out using a test dataset that contains unlabeled data; these are new CT scans (not contained in the training dataset) with and without hemorrhage that do not have labels. The CNN outputs its prediction based on the test data. After validation of the prediction accuracy, the model is ready to use. For instance, the final model is a software that can read a brain CT scan (input = CT scan) and decide whether or not intracranial hemorrhage is present or absent (output = yes/no hemorrhage).
2) Unsupervised learning is a training process that does not require labeling. This saves the time-consuming, labor-intensive and expensive human labeling process. In the intracranial hemorrhage example, the learning input would be CT scans of patients with and without hemorrhage that are not labeled (i.e. the machine is never told if bleeding is absent or present). The CNN will learn by clustering scans that look similar to one another (learn from similarities and differences), which should result in classifying images to either hemorrhage or no hemorrhage.
Big data: In order to achieve good performance, supervised AI applications require a large volume of labeled training data (usually images) from which to learn. Establishing a clinically relevant, well-curated dataset that can be used to train an algorithm can be a very time-intensive process, and the accuracy of such curation determines the quality of the derived model.
to help the user capture images of acceptable diagnostic quality during adult echocardiography 38 . The market of AI applications in medical imaging alone is forecasted to top $2 billion by 2023 39 .
What about ultrasound? Ultrasound AI software needs to fit into the workflow differently from, for example, in the analysis of a CT scan; in ultrasound, real-time analysis at the point of acquisition is ideally needed,  while in CT, automated reading is only needed at the end of the examination. Compared to the image acquisition and analysis abilities of a sonologist, no known current AI method is generic enough to be applied on a wide range of tasks (e.g. an AI application designed for the second trimester is unlikely to be applicable to the first-trimester scan). For each ultrasound task, there are several image acquisition and analysis capabilities that can be met by an AI application, including classification ('what objects are present in this image?'), segmentation ('where are the organ boundaries?'), navigation ('how can I acquire the optimal image?'), quality assessment ('is this image fit for purpose to make a diagnosis?') and diagnosis ('what is wrong with the imaged object?'). Active academic research and emerging examples of AI-assisted applications for ultrasound include plane-finding (navigation) and automated quantification for analysis of the breast, prostate, liver and heart [40][41][42] . In obstetric and gynecological ultrasound, promising workload-changing advancements include automatic detection of standard planes and quality assurance in fetal ultrasound [43][44][45] , detection of endometrial thickness in gynecology 46 and automatic classification of ovarian cysts (Table 1).

Challenges
The introduction of AI into clinical practice offers many potential benefits, but there are also many challenges and uncertainties that may raise concerns.
The impact of AI on jobs is among the most widely discussed concerns [47][48][49] . Major technological advances frequently impact the job market, and the current wave of AI-based automation is no exception. However, this does not automatically imply technological unemployment; rather, it may trigger a transformation in the way we work, resulting in professional realignment. AI can enhance both the value and the professional satisfaction of sonographers and maternal-fetal medicine experts by reducing the time needed for routine tasks and allowing more time to perform tasks that add value and influence patient care 49,50 . An important advantage that machines have over humans is reproducibility: machines retain absolute consistency over time whereas the performance of a clinician varies depending on many factors, such as years of experience, fatigue or simple distractions, such as a late-running clinic or a ringing phone. Additionally, an AI application has higher capacity, theoretically being able to read thousands of scans, while a radiographer reads 50-100 scans per day 49 . Evidence in the 502 Drukker et al.
literature suggests that the first wave of AI applications is likely to constitute assistive technology, taking over repetitive tasks to improve consistency, such as reading radiographs 51 . Specifically in ultrasound, automation will assist in shortening the total scan duration by removing the need for some of the tiresome or 'simple' repetitive tasks, such as acquiring standard planes, annotation or adjustment of calipers (Table 1). This may allow more time to analyze additional scan planes or to communicate the results to patients 52 . Automation should also be seen in the context of a global shortage of imaging experts, including sonographers and radiographers, while demand for diagnostic imaging is rising 53 .
Applicability is another concern relating to the implementation of AI in clinical medicine. Imaging features alone are often not sufficient to determine diagnosis and management. Consider, for instance, an AI application developed to report on ovarian cysts that is designed to produce a binary outcome of malignant features being absent or present based on an ovarian imaging training dataset. Clinicians also take into account the clinical context, including many factors such as age, menopausal status and familial risk factors, when making a diagnosis. While it could be argued that the clinician may be biased by clinical information, this example highlights the importance of understanding when an AI solution is applicable and when it is not. AI models can account only for information 'seen' during training, so in this example, non-imaging clinical information is not taken into account by the AI model. Hence, an important emerging area of healthcare AI research focuses on building AI models that integrate imaging and electronic health record data for 'personalized diagnostic imaging' 54,55 .
Another fear, which is largely unwarranted, relates to adaptable systems, which are AI applications that continue to learn, adapt and optimize based on new data and hence may jeopardize the application's safety. Regulatory bodies, including the FDA, currently approve only AI applications with models that have 'locked' parameters 56,57 . This means that all current AI applications are static models that can no longer adapt, and therefore, the approved product does not change over time.
The 'black box' design of AI applications is attractive at one level, as there is no need to understand how the complex non-linear optimization works, but is also a source of concern as clinicians want to understand any associated bias and likely modes of failure. Most AI models are derived by using 'supervised learning', meaning that the model learns from data annotated by humans (Box 1). Since human involvement can potentially introduce bias to the learning process, the resulting model could also be biased. Understanding model bias is an important aspect of AI model design and an active area of research 58 . For example, as operators seem to be at risk of expected-value bias when acquiring fetal biometry measurements, an algorithm training to measure standard biometric planes by supervised learning might end up having a built-in bias when automatically calculating fetal biometry 59 . To better understand AI model bias, as well as to provide insights into how AI algorithms make decisions, 'explainable AI' is an emerging subfield of AI research aiming to demystify the black box design.
Deep learning excels in pattern recognition, but it is important to recognize that most methods are supervised (training data are manually annotated). Manual annotation is resource-intensive and is often subjective. Most academic publications use data annotated a single time by one or more human annotators, which means that the derived model will be biased by (or skewed towards) the human annotator's preferred method of annotation. If, instead, each image is annotated by multiple humans, then there need to be rules about how to agree on consensus if their annotations differ. There is no one way to do this. Thus, one can appreciate that the process of annotation and subsequent data cleaning is both resource-intensive and determines the success of model performance. Furthermore, traditional deep-learning methods require a considerable volume of data to build accurate models, which are not always available. There are some ways to address this limitation which are the subject of current deep learning in medical imaging research. These include using pre-trained models, which essentially allow initialization of the parameters of a new model with those of a model built for another problem, and allowing new data to update model parameters. Another issue deep-learning scientists have to consider is deployability, as traditional deep-learning models can have millions of parameters and take up lots of computer memory. Models can be reduced in size empirically, and there is an emerging area of interest in designing small deep neural networks, such as MobileNet and SESNet, as the backbone for deployable AI application models.
Unfortunately, there are high expectations of AI applications which have yet to be backed up by wide-scale convincing multicenter clinical studies and, when appropriate, randomized clinical trials. An interesting overview of the current standards of AI research in medical imaging is provided in a recent publication 60 . Indeed, most of the reported AI applications to date use data from a single site and focus on algorithm performance rather than looking at clinical utility or health economics 60 . It is particularly challenging to assess an AI model when the accuracy of a human expert for the same task is difficult to determine or is unknown 60 . It is important to appreciate that healthcare AI is an emerging technology and, as such, it will take time to determine the best ways to validate and regulate AI applications. Towards this goal, a recent multinational academic report addressing both medical and non-medical AI systems, entitled 'Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims' 61 , provides a list of measures and mechanisms for AI developers and regulatory bodies to ensure responsible AI development. Among the recommendations, the report calls for introduction of third-party auditing of AI systems, creating a system for reporting AI incidents and encouraging researchers in academia to verify claims made by industry.
No discussion about AI would be complete without mentioning ethics 62 . Recently, the classic theoretical 'trolley problem' experiment was applied to self-driving cars, as part of an online experimental platform designed to explore human perspective on moral decisions made by autonomous vehicles 63 . The question is whose safety should be prioritized in the event of an accident. Essentially, the problem asks: if your car brakes suddenly fail as you speed toward a crowded crosswalk, and you are confronted with the dilemma to veer right and crash into an old man or veer left and crash into a concrete wall and kill the car driver and passengers, what would you choose? Now, what if instead of an old man, it was a woman pushing a stroller or a homeless person crossing the road? Human drivers who are badly injured or die in a car crash cannot report whether they were faced with a dilemma. However, self-driving cars can be programmed to act in a certain way. Similarly, the use of AI-based solutions may raise several moral questions in medicine 64,65 : would we trust computers to screen for disease, prioritize treatment, diagnose, treat, discharge? Would we let a fully automated AI-based solution choose the patient to occupy the only available intensive care unit bed?
Ethical concerns also surround the issue of privacy 65,66 . Developing AI applications typically requires a large volume of data about patients and their diagnoses. Such personal data are usually collected by health authorities or hospitals. Under what conditions (if any) should hospitals be allowed to share patient data with developers of AI solutions, who may be commercial entities? If healthcare data are completely anonymized, does a patient need to expressly consent to their use for such improvements in healthcare? These questions, which relate to data governance and privacy, are not unique to healthcare AI and are currently being debated widely by regulators, policy-makers, technologists and technology end-users (including the public). An emerging technology area, called privacy-enhancing technologies, may offer data-sharing and analysis options to reduce some of the current barriers and concerns.
Potential professional liability for physicians using AI is another challenge 67 . Should hospitals and doctors be accountable for decisions that an AI application makes? Information provided by an AI application may be used to inform clinical management, diagnosis or treatment. However, algorithms, like humans, can err. Let us suppose that an AI algorithm classifies an ovarian cyst as most likely benign and recommends follow-up imaging in 6 months according to the standard of care; at the next appointment, the patient is diagnosed with metastatic ovarian cancer and retrospective image review suggests that the 'cyst' may have had malignant features previously. This raises the question: who is liable when AI-based diagnosis is incorrect? Questions of this kind are currently being considered by regulators, in consultation with legal professionals, medical professionals and AI developers in the industry.

Research in context
As we begin to see more interdisciplinary research related to AI in clinical medicine, difficulties arise when readers and reviewers with a clinical background attempt to critically assess the methodology of scientific AI papers in a field that is, for now, largely unfamiliar to many medical professionals. How can the clinical research community ensure that highly technical aspects of a scientific work have been conducted and presented correctly 68 ? Ultrasound professionals understand the full meaning of 'sonographer with 10 years of experience' or 'images were reviewed by two specialists', but may struggle with descriptions such as 'A feed-forward network of neurons consisting of a number of layers that are connected to each other was built.' 28 or 'To train the model, we first provided the sample input, x, to the first layer and acquired the best parameters (W, b) and activated the first hidden layer, y, and then utilized y to predict the second layer.' 30 . When assessing the clinical effectiveness and legitimacy of scientific work for publication, several crucial questions should be raised, including: which of the authors are AI scientists and what is their experience; how were training and test data acquired; what were the input variables; how was the algorithm trained; how was the algorithm evaluated and validated, and was the validation internal or external; are the results reproducible. We believe that one simple solution is to include in the Editorial Board of journals technical reviewers with expertise in AI who are able to ensure the soundness of the technical aspects of a paper and assess interdisciplinary research.
To facilitate reporting of AI trials, the CONSORT (Consolidated Standards of Reporting Trials) and SPIRIT (Standard Protocol Items: Recommendations for Interventional Trials) steering groups are expected to publish the first international consensus-based reporting guidelines for clinical trials evaluating AI interventions in 2020 69 .

Summary
AI uses data and algorithms to derive computational models of tasks that are often as good as (or better than) humans. AI is already a part of our daily life and is a prominent source of innovation in healthcare, helping to develop new drugs, support clinical decisions and provide quality assurance. Deep learning performs particularly well in image pattern recognition and solutions based on this approach can benefit healthcare professionals who depend heavily on information obtained from images, such as radiographers, pathologists and sonologists.
We have presented an overview of AI technology and some of the issues related to the introduction of this emerging technology into clinical practice, in the context of ultrasound in obstetrics and gynecology. At this stage, AI applications are in the early stages of deployment and a systematic review would be premature. In addition, performing a clinical systematic review in this area is challenging because most of the published peer-reviewed scientific articles appear in the engineering literature which usually focuses on the AI methodology and few studies 504 Drukker et al.
have assessed clinical applicability. Lastly, algorithms and results of approved AI applications are often not published in scientific journals due to commercial sensitivities.
In the past, advances in women's ultrasound have been largely achieved through better imaging, advances in education and training, adherence to guidelines and standards of care, and improvement of genetic technologies 70 . Despite all these advances, the fundamental way in which ultrasound images are acquired and interpreted has remained relatively unchanged. AI opens an opportunity to introduce in the patient-carer relationship a third 'participant' that is able to contribute to healthcare. Improved quality through automatic categorization or interpretation of images and ensuring images are fit for purpose can increase confidence in imaging-based diagnosis. In high-income settings, this could contribute to healthcare efficiency and workflow improvements in screening. In under-resourced settings, it opens the prospect of strengthening ultrasound imaging by replicating basic obstetric ultrasound where there is none which could allow, for example, gestational-age estimation or diagnosis of placenta previa. For this potential to be realized, interdisciplinary communication between AI developers and ultrasound professionals needs to be strengthened. A greater understanding of how AI methods work is important to enable clinicians to trust AI solutions. To ensure seamless integration of AI, medical professional organizations should start considering how AI affects them, recommend that physicians publish their experiences of using AI technologies, and consider appropriate guidelines or committees on aspects of AI.