External validation and clinical utility of prognostic prediction models for gestational diabetes mellitus: A prospective cohort study

Abstract Introduction We performed an independent validation study of all published first trimester prediction models, containing non‐invasive predictors, for the risk of gestational diabetes mellitus. Furthermore, the clinical potential of the best performing models was evaluated. Material and methods Systemically selected prediction models from the literature were validated in a Dutch prospective cohort using data from Expect Study I and PRIDE Study. The predictive performance of the models was evaluated by discrimination and calibration. Clinical utility was assessed using decision curve analysis. Screening performance measures were calculated at different risk thresholds for the best model and compared with current selective screening strategies. Results The validation cohort included 5260 women. Gestational diabetes mellitus was diagnosed in 127 women (2.4%). The discriminative performance of the 12 included models ranged from 68% to 75%. Nearly all models overestimated the risk. After recalibration, agreement between the observed outcomes and predicted probabilities improved for most models. Conclusions The best performing prediction models showed acceptable performance measures and may enable more personalized medicine‐based antenatal care for women at risk of developing gestational diabetes mellitus compared with current applied strategies.


| INTRODUC TI ON
Gestational diabetes mellitus (GDM) is a common condition during pregnancy. The prevalence increased over the last years and varies considerably between studies (2%-25%), as it depends on the population studied, the screening method employed and diagnostic criteria used. 1 GDM is a risk factor for maternal and perinatal complications such as preeclampsia, macrosomia, shoulder dystocia and neonatal hypoglycemia. 2 Long-term risks, ie, development of diabetes mellitus (DM) type 2 in both mother and offspring, primarily contribute to the global burden of disease. 3 Consequences of GDM are often already present at the time of diagnosis (ie, large-for-gestational-age [LGA] infant), as the disorder is mostly asymptomatic. 4 Therefore, early identification of pregnant women for GDM, usually by an oral glucose tolerance test (OGTT), is essential, as early diagnosis and clinical management improve pregnancy outcomes. 5 Internationally, however, there is no consensus about whether to screen all women for GDM (universal screening) or only women with prespecified risk factors (selective screening). 6 Universal screening has a high detection rate but may also lead to an increased burden for women as well as for healthcare resources. Although selective screening reduces the number of women to be screened, a drawback of current risk strategies is that cases are missed at an early stage. Current risk criteria lists are limited by the fact that risk indicators are used independently without taking into account the strength of the different risk factors in relation to GDM. 7,8 Furthermore, the risk factors are often treated categorically (ie, body mass index >30 kg/ m 2 ), leading to loss of information that could be obtained using continuous data information. 9 Prediction models may be more accurate in identifying women at high risk for GDM as multiple risk factors are combined in an algorithm, taking into account the risk-dependent weight of each risk factor and possible interrelations. 10 By calculating a probability on a continuous scale, a particular trade-off between sensitivity and specificity can be chosen. In addition, prognostic prediction models may also constitute a basis for personalized medicine-based medicine guiding planning of antenatal care and targeting preventive strategies. 11 A substantial number of prediction models for the risk of GDM have been developed, 12 but to our knowledge none of these is routinely used in clinical practice. Validation of prediction models in independent populations is a crucial step before implementation in clinical practice. 13 Only a few studies externally validated models for GDM, and most validated only up to five models. [14][15][16][17][18] A first comparison of multiple non-invasive early prediction models for the risk of GDM in an independent cohort was published in 2016. 19 Most of the prediction models showed acceptable discrimination and calibration.
In this study, we performed a fully independent validation study of all published first trimester prediction models, containing non-invasive predictors, for the risk of GDM in a Dutch prospective cohort study. In addition, and in contrast to the previous published external validation effort, we evaluated the clinical potential of the best performing models and compared it with the performance of current screening strategies.

| Selection of prediction models
We performed a systematic search in PubMed to identify prediction models, based on routinely collected parameters and applicable in the first trimester of pregnancy, for the risk of GDM. The search was updated until 13 April 2017. The search strategy and eligibility criteria have been published elsewhere. 20

| Validation cohort
Two population-based prospective cohorts of pregnant women were used for the validation sample: the Expect Study I and the PRIDE (PRegnancy and Infant DEvelopment) Study. Women with any type of preexisting DM were excluded from the analysis.

| Expect Study I
We performed a multicenter prospective cohort study with the primary objective to validate published first-trimester prediction models for adverse pregnancy outcomes. Six hospitals and 36 midwifery practices in the south-eastern part of The Netherlands recruited pregnant women less than 16 weeks of gestation and aged 18 years or older between 1 July 2013 and 1 January 2015, with follow up until 31 December 2015. Pregnancies ending in miscarriage, termination at <24 weeks of gestation or for which no outcome data were available, were excluded. Eligible pregnant women were invited to complete two web-based questionnaires (paperbased upon request), one at <16 weeks of gestation and one at ≥6 weeks after the due date. Medical records and discharge letters were requested from healthcare providers. A detailed description of the Expect Study I has been published in full elsewhere. 20

| PRIDE Study
The PRIDE Study is an ongoing large, Dutch prospective cohort study among pregnant women. Full details of the study have been published previously. 21 Pregnant women aged ≥18 years were asked to complete web-based questionnaires, paper-based upon request, at baseline (questionnaire 1; 8-12 weeks of gestation), during gestational weeks 17 (questionnaire 2) and 34 (questionnaire 3), and 2 (questionnaire 4) and 6 (questionnaire 5) months after the due date.
Permission was asked to obtain medical records.
Pregnancies enrolled between July 2011 and May 2016 were included in this study. We excluded pregnancies ≥16 weeks of gestation at completion baseline questionnaire, miscarriages, terminations at <24 weeks of gestation and pregnancies with no follow-up data on outcomes (questionnaire 4 or medical record). If women participated in both studies, the double pregnancy was removed from the PRIDE Study cohort (n = 3).
Medical records were obtained for women who gave permission (~75%) and who had an estimated due date before 1 March 2015.

| Predictor variables
The variables in the included prediction models for GDM were extracted from the web-based questionnaires: pregnancy questionnaire 1 (Expect Study I) and baseline questionnaire (PRIDE Study).
In both studies, blood pressure was measured according to routine antenatal care and self-reported in the questionnaire. In the Expect Study I, most predictor variables were defined according to the original articles. Although the primary goals of the PRIDE Study do not include prediction of pregnancy complications, most predictors were measured similarly. The original articles had different definitions for family history of DM. For comparison and because no distinction was made between the types of DM in the PRIDE Study, we defined two proxy variables for family history of DM: a first-degree relative with any type of DM and a second-degree relative with any type of DM. The latter predictor was imputed for PRIDE Study participants, as only family history of first-degree relatives was assessed. We also redefined the predictor poor obstetric outcome (model Teede et al) as history of antepartum hemorrhage, shoulder dystocia and neonatal death was not administered. A detailed description on predictor definition is provided in Table S1. In both cohorts, the outcome was present in case the postpartum questionnaire or medical record recorded a diagnosis of GDM. For PRIDE Study participants, we also examined questionnaires 2 and 3 for a diagnosis of GDM. In the Expect Study I, we contacted the obstetric care providers in case of discrepancies between the two data sources (n = 29). The postpartum questionnaire was used as reference standard to resolve discrepancies in the PRIDE Study (n = 2).

| Statistical analyses
There is no explicit rule for the required sample size for studies externally validating prediction models. Vergouwe et al recommends a minimum of 100 events and 100 non-events. 22 Missing data were imputed to prevent biased results. 23 Stochastic regression imputation with predictive mean matching as the imputation model was used to substitute missing predictor variables in the observed population.
We calculated the individual probabilities of developing GDM for all subjects using the original prediction model algorithms (Table   S2). The predictive performance of each model was quantified by measures of discrimination and calibration. We determined discrimination by the area under the receiver operating characteristic curve (AUROC) with 95% confidence interval (CI). Discrimination is the ability of the model to correctly separate women who develop GDM from those who will not. Calibration, the agreement between the predicted probabilities of the model and the observed outcomes, was assessed graphically by calibration plots and by calculation of calibration-in-the-large and the calibration slope. Calibration-inthe-large indicates whether predictions are systematically too high or too low. 10 The slope measures the average strength of the predictor effects. 10 The calibration plot should ideally follow the 45° line with an intercept of 0 (calibration-in-the-large) and a slope of 1. 10 The women were ordered with respect to their predicted probability and subsequently divided into 10 groups of roughly equal size.
We recalibrated the prediction models -adjustment intercept and slope -using the linear predictor as the only covariate. 24 We performed a subgroup analysis among nulliparous women.
For comparability of the models, we used the validation cohort with our inclusion and exclusion criteria. A sensitivity analysis was performed to assess the predictive performance of each model according to their additionally defined eligibility criteria. We also assessed the performance measures in the Expect Study I and the PRIDE Study separately.
The potential clinical utility was evaluated for the best discrim-

| Selection of prediction models
The search strategy identified 530 articles. We selected 18 articles that fulfilled the eligibility criteria. We excluded seven papers because the algorithm was not available (n = 3) or the model was already published in one of the included articles (n = 4) (File S1). Reference cross-checking yielded two additional studies, so 12 articles were included in this validation study. 17,18,[26][27][28][29][30][31][32][33][34][35] The models were published between 1997 and 2017, and were developed in nine different countries. Eight studies used a prospective cohort design, two studies a retrospective cohort design, and two studies were developed in a case-control study population. Almost all studies (n = 11) used universal screening to detect GDM, but the type of screening strategy differed between the studies. Five studies used a glucose challenge test, four studies a random glucose test, and three studies an OGTT.
Gestational diabetes was diagnosed by nearly all studies (n = 9) using a 2-hour 75-g OGTT; however, the diagnostic criteria varied between studies. The number of predictors in the published prediction models varied between two and nine.

| Validation cohort
The validation cohort included 5260 pregnancies, 2603 pregnancies The overall prevalence of an LGA infant in the validation cohort was 9.6%. Population characteristics are presented in Table 1. The imputed validation cohort did not materially differ from the observed cohort (with missing data) (Table S4).
We also evaluated the relatedness between the original cohorts and the validation sample (Table S5) Figure 2 shows the decision curve analysis of the four best performing models. These models had a positive net benefit compared with classifying all or no women as at high risk for GDM for a risk threshold ranging between 1% and 55%. Sensitivity, specificity, and positive and negative predictive values were estimated at different clinically useful risk thresholds for the model of Nanda et al (Table 3). At a low risk threshold (ie, 2%), we observed a high sensitivity and a high negative predictive value, suggesting a strong ability to rule out GDM in women who are indicated as low risk. At this high sensitivity, however, a lot of women will be unnecessarily indicated as having a high risk (high false-positive rate). A risk threshold above 5% leads to a drastically low sensitivity, so a large proportion of women that will develop GDM would be incorrectly classified as having a low risk.     GDM are well known and treatment is proven to be effective. 3,5 However, robust evidence is lacking on reduction of more serious maternal and perinatal complications as well as on the long-term benefit of treatment, such as reduced incidence of type 2 DM. 5

| D ISCUSS I ON
Moreover, a prognostic prediction model provides opportunities The main strengths of our study are the large sample size, sufficient number of cases and the multicenter prospective cohort design.
A cohort study represents the most powerful design for external validation, but selection may bias the generalizability of the results. 39  Another limitation to be mentioned is that the OGTT was only performed as a screening tool in women at high risk for GDM according to the Dutch national guideline. 7 Nevertheless, diagnosis of GDM was based on review of medical records and the postpartum questionnaire, which allowed us to detect all diagnosed cases of GDM, including late diagnosis of GDM. In our study, 65% of the women with a diagnosis of GDM fulfilled the Dutch criteria of screening, indicating that 35% of our cases were most likely detected outside of selective screening (ie, glucose measurement after sonographic diagnosis of fetal macrosomia or polyhydramnios). Still, cases of GDM may have been missed in asymptomatic women. False-negatives can lead to an underestimation of the c-statistic. 44 Nationwide data on the prevalence of GDM in the Netherlands are scarce, but estimated prevalence varies between 2% and 5%. A study of Van Leeuwen et al, in which universal screening with the same diagnostic criteria was performed in a fairly comparable Dutch pregnant population, showed a similar prevalence of GDM. 32 We recognize that this prevalence is low compared with other countries. A meta-analysis reported an overall prevalence in Europe of 5.4% (3.8%-7.8%), with lowest prevalence in Northern Europe. 45 Prevalence rates are affected by different screening and diagnostic criteria used as well as population characteristics. 46 Internationally there is no consensus regarding the optimal cut-off points for diagnosing GDM. Prevalence rates are substantially higher when using lower glucose levels as recommended by the International Association of the Diabetes and Pregnancy Study Groups (IADPSG). 47 Tran et al calculated the discriminative performance of the model for different diagnostic criteria and showed no substantial difference between the IADPSG and WHO 1999 criteria. 28 In the end, a head-to-head comparison, as performed in this study, allows for a fair comparison of the performance of prediction models in a particular population with specific screening and diagnostic criteria and is necessary before a model can be implemented in clinical practice.

| CON CLUS ION
The best performing prediction models showed acceptable performance measures and may enable more personalized medicine-based antenatal care for women at risk of developing GDM compared with current applied strategies. A next step is to investigate the impact of implementation of the best model with risk-dependent care in clinical practice.

CO N FLI C T O F I NTE R E S T
None.