Machine learning improves early prediction of small‐for‐gestational‐age births and reveals nuchal fold thickness as unexpected predictor

To investigate the performance of the machine learning (ML) model in predicting small‐for‐gestational‐age (SGA) at birth, using second‐trimester data.


| INTRODUCTION
Small-for-gestational-age (SGA) newborn is defined as birth weight below the 10th centile. 1 SGA can result from fetal growth restriction, where the fetus fails to achieve its individual growth potential due to placental disorder, maternal medical conditions, genetic factors, and external factors such as smoking. 1 Newborns with birth weights below the third centile experience 5-10 times higher rate of mortality 2,3 and exhibit greater risk of developing chronic diseases in adult life. [4][5][6] Identifying SGA fetuses before delivery is clinically important because in a population-based study by Gardosi et al., the main risk factor associated with stillbirth is un-identified SGA before birth. 7 Lindqvist and Molin also reported that if SGA fetuses were identified before birth and were given proper antenatal surveillance, their risk of adverse outcomes could be reduced by fourfold. 8 Early prediction of SGA will provide the opportunity to select patients for follow-up, allowing closer monitoring and better clinical management. Furthermore, new pharmacological treatments for SGA are being investigated, 9,10 and if successful in the future, predicting SGA at an earlier stage would be useful to identify subjects for treatment. Clinical risk predictive model, such as logistic regression, is commonly found in medical domains because it allows interpretation of model parameters. However, machine learning (ML) can learn data patterns in higher dimensions [11][12][13] and has significantly better predictive ability. 14 This explains its popularity in the biomedical domain.
In terms of predicting SGA, there have been studies that used ML models. Features used for prediction in these past studies include ultrasound biometrics measurements, 15,16 umbilical Doppler blood flow, 17,18 pregnancy risk factors, sociodemographic, maternal characteristic and medical history, 19,20 pregnancy associated plasma protein A and placental growth factor. 21 The objective of this study is to evaluate the performance of ML in predicting SGA at birth, using second trimester fetal ultrasound scans. In our study, we show that ML significantly improves the prediction of SGA at birth compared to the prediction via clinical guidelines. Moreover, our ML models unexpectedly reveals that nuchal fold thickness to be an important parameter in predicting SGA at birth.

| Subjects
The study protocol was approved by the National Healthcare Group (Singapore) Domain Specific Review Board. This was a retrospective study in women with singleton pregnancies who had routine antenatal care and delivered at the National University Hospital, Singapore (NUHS). All data collected were de-identified and a study number was assigned for every case. The percentile of birth weight was computed based on the INTERGROWTH-21st chart, 22 taking consideration of the gestational age at birth and sex. Cases where fetuses exhibited cardiovascular, structural or chromosomal abnormalities were excluded from the analysis. After filtering out the data with incomplete information, we had a total of 145 cases with birth weights above the 10th centile, 105 cases with birth weights between third and 10th centiles, 97 cases with birth weights below the third centile. Newborn babies with birth weights above the 10th centile were classified as appropriate-gestational-age (AGA).
Newborn babies with birth weights below the 10th centile were classified as SGA at birth. We categorized SGA at birth into two groups: (i) SGA-newborns with birth weights below the 10th centile and (ii) severe SGA-newborns with birth weights below the third centile.
Ultrasound measurements were performed in a standard manner by trained sonographers, following the guidelines by Ultrasound in Obstetrics Gynecology, ISUOG. 23 Pregnancies were dated by last menstrual period and confirmed by first-trimester ultrasound measurements of fetal crump-rump length. 24 Our analysis used only one set of measurements for each fetal subject. Patient characteristics are described in Table 1.
gestational age, we added gestational age (GA) as an additional input (the 16th parameter) in the model, to ensure that the model could account and adjust for this variability. This brought our total number of parameters to 16. In this study, we aimed to investigate the model's performance in predicting SGA and severe SGA infants at birth using the second trimester data, via two separate experiments. A model to predict SGA infants was trained in the first experiment, and a separate model was trained to predict severe SGA in the second experiment.
We randomly split the data into the training and validation set (80% of data) and the testing set (20% of data). In the testing dataset, In order to ensure that different types of data had similar orders of magnitude and that data was free of measurement units, data for every parameter were scaled to have a zero mean and unit standard deviation, according to the training data distribution. Next, training data in the class (healthy/disease) with the smaller sample size was randomly duplicated (oversampling) so that the number of samples in both classes was balanced. This was essential to improve the model's accuracy. 31 Note that, only the training data was oversampled for the model training process and the testing data was not.
During training, we performed five-and 10-fold cross-validation to determine the best hyper-parameters for every model. 32 Hyper-parameters were variables that determine the ML structures, and each models had different hyper-parameters to tune. Details of hyper-parameters for each model and their tuning process are given in the Supplementary Document. In the fivefold cross-validation, the training and validation dataset (80% of all data) was evenly divided into five sets: four sets were used for training and one set was used for validation. This process was repeated five times for each hyper-parameters optimization process. 33 We performed fivefold and 10-fold crossvalidation and results showed that there were very similar in model performance, judging by the area under the receiver operating curve, which is reported in Table S2 in the Supplementary Document. The results shown here were based on fivefold cross-validation.
In each fold, we used the Grid Search method 34 to test different sets of hyper-parameters. The hyper-parameters combination that yielded the highest average validation accuracy (average across kfold cross-validation) was used to train the model again using the training and validation datasets (i.e., the 80% of data, 116 AGA and The process of data preprocessing and model development is summarized in Figure 2. Analyses of RF and SVM were performed using Scikit-learn package v0.19 available in Python language (v2.7) 34 while analysis of MLP was performed using Keras package v2.0.6. 35

| Comparison between ML model and Doctor's diagnosis
Fetuses were diagnosed as at risk of SGA at the second trimester if one of the following criteria was met in accordance to the RCOG guidelines. 36 � Increased uterine arterial pulsatility index (Ut PI > 95th centile), or � Estimated fetal weight or abdominal circumference below the 10th centile These cases were deemed to be cases of the physician-predicted disease. Here, we focused on the predictive properties of fetal ultrasound markers, and as such, other risk factors in the guidelines, such as maternal medical history and blood markers were not considered. For a fair comparison between machine learning models and clinical diagnosis, we used the same input parameters (GA, EFW, AC, and Ut PI) to train the machine learning models using the similar approach described in Figure 2. The models' performances were evaluated using testing datasets and compared with the doctor's diagnosis using the same testing datasets.

| ML model performance
From the experiments using all 16 parameters, we extracted the relative importance of each parameter in predicting SGA and severe SGA via the RF importance score and SVM coefficient magnitude.
Both RF and SVM showed that nuchal fold thickness played a significant role in successful prediction (importance ranking in Table 3). We proceeded to test this further by comparing models with and without nuchal fold thickness.
It is known that biometric measurements, such as HC, AC, FL, EFW, and Doppler measurements (Uterine RI and PI) are the parameters correlated to SGA. As such, we selected the following combinations as models without NF.

| Statistical analysis
Statistical analysis was performed to test if there were differences between the three groups (BW > 10th centile, BW between third and 10th centile and BW < third centile) in terms of the 16 parameters, birth weight and gestational age at birth, using one-way ANOVA with the Tukey post hoc test. The data were deemed significantly different if p < 0.05.
We performed a multinomial logistic regression analysis to examine the contribution of each parameter (or independent variable) in affecting the outcome variables. 37 The Variation inflation factor (VIF) was computed for every independent variable. VIF was a measure of multicollinearity of independent variables in a multiple regression model, and was to detect variables that are correlated and explained the same variance within the datasets, so that redundant variables can be discarded. Independent variables with VIF more than 10 were excluded in the logistic regression analysis to ensure reliable estimate of regression coefficient. 38

| Patient characteristics
Patient characteristics are shown in Table 1. Low birth weight babies (Group B and Group C) were born earlier with significantly lower birth weight compared to AGA (Group A). Sonographic measurements such as abdominal circumferences and nuchal fold thickness were considerably lower in Groups B and C compared to Group A. The estimated fetal weights and transverse cerebellar diameters were significantly lower in Group B compared to Group A. For Group C, these parameters were lower than Group A, but the differences were not statistically significant. Uterine RI and PI were significantly elevated in Group C but not in Group B. Other parameters showed no significant differences among the three groups.
Logistic regression model showed that there were three significant predictors, which were nuchal fold thickness (odd ratio ¼ 0.53, p < 0.001), abdominal circumference (odd ratio ¼ 0.52, p ¼ 0.007) and uterine PI (odd ratio ¼ 1.56, p ¼ 0.004). Abdominal circumference and uterine PI were known to be related to SGA 1 , 39 but it was surprising to note that nuchal thin fold thickness also had a predictive relationship.

| Comparison between ML model and doctor's diagnosis
Comparisons between the predictive capabilities of clinical diagnosis and machine learning models (4-parameters version), based on their sensitivity, specificity and accuracy, are shown in Table 2. The accuracy of models in predicting disease was consistently higher than the clinical diagnosis. In the first experiment for SGA prediction, MLP had the best predictive ability, achieving an accuracy of 71%, which was approximately 7% higher than the clinical diagnosis, which that had an accuracy of 64%. In the second experiment for severe SGA prediction, SVM had the highest accuracy among all the models, 73%, which was 25% higher than the clinical diagnosis, which had an accuracy of 48%.

| ML model performance
With the 16-parameters models, the RF model provided an importance score that described how much the accuracy dropped when a variable was excluded from the model. 40 The SVM coefficients described the vector coordinates which were orthogonal to the hyperplane and can be interpreted similarly as the importance of the parameters. 41  increased to 0.78. For severe SGA prediction (Table 5), similar observations were obtained where models with nuchal fold thickness showed consistently higher accuracy compared to models without nuchal fold thickness.
The best model with the highest prediction accuracy for both SGA and severe SGA prediction was the SVM model, achieving an accuracy of 78% and 83%, respectively (Tables 4 and 5). For both SGA and severe SGA prediction, the input parameters combination that yielded the highest accuracy were gestational age, estimated fetal weight, uterine RI, uterine PI and nuchal fold thickness.

| DISCUSSION
In this study, we used birth weight as a surrogate outcome to identify the infants who were at increased risk of adverse perinatal outcomes because birth weight had been shown to have a direct relation with perinatal mortality and morbidity. 2,43,44 We utilized machine learning models in predicting SGA and severe SGA at birth, using second trimester data. We compared our model's performance to a baseline-clinical diagnosis performance, and we showed that machine learning could improve the prediction accuracy by approximately 10%-20% when compared to the clinical diagnosis (Table 2), using similar parameters. However, as Tables 3 and 4 would show, there were other sets of input parameters that enabled even higher machine learning accuracies, although increasing the number of parameters did not necessarily improve accuracy.
There were past attempts to use machine learning to predict growth-restricted fetuses using ultrasound scans alone. 15,17,18 These studies utilized measurements from later gestation compared to ours, resulting in an average accuracy of 80%-90%.
Our findings corroborated with theirs where we achieved an accuracy of 83%. Our combined findings thus suggested that machine learning is a promising approach in detecting SGA infants. By utilizing only the second trimester measurements, we showed that early prediction of SGA is possible. Past studies, however, placed more emphasis on umbilical Doppler, due to their known relationship with SGA, but did not include nuchal fold measurements.
Our results indicated that including it can improve the model's prediction performance. One previous study 15 demonstrated, interestingly, that the inclusion of measurements over a consecutive week of up to 4 weeks in the modeling could improve predictive performance. This approach is a good strategy to improve our current models in the future.
Our study had similarly found that uterine Doppler indices were important predictors of SGA, but this was unsurprising, as many previous studies showed uterine Doppler indices to be well correlated to adverse fetal outcomes and SGA, 39,[45][46][47] likely due to alterations in the placental circulation resistance during SGA. 48,49 The interesting part of our findings, however, was that utilizing only  Increased nuchal fold thickness is traditionally a sonographic marker for fetal chromosomal and structural abnormalities. 51,52 Interestingly, in our study, it was reduced nuchal fold thickness that was found to contribute to SGA prediction. There is scarce literature on reduced fetal nuchal fold thickness, but a recent study in adults found that nuchal fold thickness was positively correlated with body mass index. 53 We hypothesize that the reduced nuchal fold thickness in SGA is likely due to reduced nuchal adipose tissue, which would be a logical consequence of the reduced growth rates.
We propose that investigations on the physiology of nuchal fold thickness in SGA fetuses is well warranted, and could lead to interesting findings.
It has been consistently shown that SGA babies have higher rates of fetal death, preterm birth, hypertension or preeclampsia. [54][55][56] It was reported that there was a significant association between second-trimester SGA and third-trimester SGA, suggesting that growth restriction started earlier than the third trimester in pregnancy. 57 With our models, early prediction of SGA could be made, prospective controlled studies of prophylactic treatment (such as aspirin or antioxidants) and optimal nutritional supplementation 58 could be conducted in the high-risk population. Furthermore, efforts are underway to develop pharmacological 59 and other therapies 9,60,61 for growth-restricted fetuses, and having an early detection tool will be invaluable should a therapy proves effective, to help select subjects for the therapy. Much effort has also been made to understand this idiopathic disease using various tools, such as using computational modeling, 62,63 biomechanics analyses [64][65][66] and genetic analyses. 66 Early detection of SGA would allow clinicians to monitor at-risk fetuses more closely before obvious features developed, which could improve perinatal outcomes and could lead to the discovery of other underlying factors causing the disease which may otherwise have been missed.
This study, however, is only a first step in improving prediction of SGA babies at earlier gestation. There are several more ways to extend the current study, such as increasing the sample size and including other risk factors, such as pregnancy history, maternal blood pressure, smoking status, pregnancy complications such as preeclampsia, gestational diabetes mellitus, and pregnancy-associated plasma protein-A (PAPP-A) level. 36 With these additional data, it may be possible SAW ET AL. -513 to further increase the accuracy and the generalizability of the models.
To enable readers to test our machine learning algorithm, we provided an excel file in the supplementary material that can compute the probability of the babies being born healthy or SGA using the MLP model. Further, the python codes for our models are available in https://github.com/shiernee/Small-for-Gestational-Age_Prediction.
We emphasize that this should be used for reference, and not yet for diagnosis purposes.
A further limitation of this study was that we were unable to evaluate inter-and intra-observer variability of the measurements because it was a retrospective study. However, ultrasound measurements were performed by experienced sonographers in our hospital. Further, since the data had already been de-identified, there could be subjects in our study who had more than one pregnancy. Also, there was a small but significant difference in birth age of the three cohorts. While we had ensured that all subjects were not born preterm, and did not have detectable abnormalities other than SGA, we might not have ruled out all possible underlying abnormalities.

| CONCLUSION
We showed that machine learning can improve the accuracy of early prediction of SGA at birth and presented the combination of parameters that can optimize predictive capabilities. Furthermore, we showed that our machine learning modeling unexpectedly revealed a thin nuchal fold to be an important predictive parameter for SGA at birth, where we might neglect in the past. As such, it might be worthwhile to put attention on this parameter during the secondtrimester antenatal screening. Early detection of SGA is important because it can lead to improved outcomes via management strategies, enable the detection of underlying diseases and will be invaluable in the future when intrauterine therapeutic and preventive strategies become available.