International standards for fetal brain structures based on serial ultrasound measurements from Fetal Growth Longitudinal Study of INTERGROWTH‐21st Project

To create prescriptive growth standards for five fetal brain structures, measured using ultrasound, in healthy, well‐nourished women at low risk of impaired fetal growth and poor perinatal outcome, taking part in the Fetal Growth Longitudinal Study (FGLS) of the INTERGROWTH‐21st Project.


INTRODUCTION
In most settings, the anatomy of the fetal brain is assessed routinely as part of the mid-trimester anomaly scan at around 20 weeks' gestation, the main aims being to demonstrate anatomical integrity and diagnose abnormalities of the central nervous system (CNS). Measurement of intracranial structures forms part of the assessment, and includes the width of the atrium of the lateral ventricle measured posteriorly (PV) and cisterna magna (CM) 1,2 . On more advanced neurosonography, undertaken for indications such as a previous or suspected abnormality, other structures, e.g. the Sylvian fissure (SF), are examined either earlier in cases of a previous abnormality or late in pregnancy to assess gyration and sulcation patterns, which change with advancing gestational age [3][4][5][6][7][8] . Fetal brain structures can be evaluated by assessing their appearance subjectively or measured quantitatively, which is recommended whenever possible, as subjective assessment is associated with higher variability 2 . Currently, the normality of any measurements obtained is evaluated in relation to one of several reported reference charts for fetal brain structures 2 . However, many studies reporting reference charts have important methodological limitations 9 . There can also be a lack of consistency in the interpretation of ultrasound images of the fetal CNS, leading to inconsistent clinical management, if the same measurement from a fetus is plotted on two different charts. These issues are generic to the measurement of all fetal anatomical structures, as reported in systematic reviews of studies aimed at creating charts for fetal biometry and pregnancy dating 10,11 .
To overcome these issues with regard to ultrasound assessment of the fetal brain, we have followed, as before, World Health Organization (WHO) recommendations and adopted a prescriptive approach to the construction of international size standards for five fetal brain structures, as a secondary analysis of data collected in the Fetal Growth Longitudinal Study (FGLS), one of the key components of the INTERGROWTH-21 st Project (www.intergrowth21.org.uk) 12 . Three of the brain structures relate to clinical evaluation of cerebrospinal fluid, namely the PV, CM and anterior horn of the lateral ventricle (AV) 1 ; the two other structures are clinically relevant to the assessment of gyration and sulcation, namely the parieto-occipital fissure (POF) and the SF. The international standards produced complement those published previously for early and late pregnancy dating 13,14 , fetal growth and estimated fetal weight 15,16 , symphysis-fundal height 17 , gestational weight gain 18 , neonatal size and body composition 19,20 and postnatal growth of preterm infants 21 .

Study population
INTERGROWTH-21 st is an international, multicenter, population-based project, conducted between 2009 and 2016 in eight delimited geographical areas: Pelotas (Brazil), Turin (Italy), Muscat (Oman), Oxford (UK), Seattle (USA), Shunyi County in Beijing (China), the central area of Nagpur (India) and the Parklands suburb of Nairobi (Kenya). In the FGLS, serial two-dimensional (2D) and three-dimensional (3D) fetal scans were performed every 5 ± 1 weeks from 14 + 0 weeks' gestation to delivery 15 . Women participating in the study, who initiated antenatal care before 14 weeks' gestation, were selected based upon the WHO recommended criteria for optimal health, nutrition, education and socioeconomic status needed to construct international standards 22,23 . Hence, they had a low-risk pregnancy that fulfilled well defined and strict inclusion criteria at both population and individual levels 23 . Briefly, the individual inclusion criteria were maternal age between 18 and 35 years, body mass index ≥ 18.5 kg/m 2 and < 30 kg/m 2 , naturally conceived singleton pregnancy, normal pregnancy history without relevant past medical history, no evidence of socioeconomic constraints likely to impede fetal growth, no use of tobacco or recreational drugs and no heavy alcohol consumption. Women also had to have a known date of their last menstrual period (LMP), with regular cycles without the use of hormonal contraceptives or breastfeeding in the 2 months before pregnancy. Gestational age was LMP-based provided that standardized ultrasound measurement of crown-rump length between 9 + 0 and 14 + 0 weeks was in agreement within 7 days 24 .
In the FGLS all ultrasound scans were performed by sonographers who were trained, standardized and audited regularly 25,26 . The same type of commercially available ultrasound equipment (Philips HD-9; Philips Ultrasound, Bothell, WA, USA), with curvilinear abdominal 2D transducers (C5-2, C6-3) and a curvilinear abdominal 3D transducer (V7-3), was used for all growth scans. For the purposes of the INTERGROWTH-21 st Project, the manufacturer reprogrammed the machines' software to ensure that the measurement values did not appear on the screen during the scan in order to reduce operator 'expected value' bias. A detailed description of the ultrasound methodology has been reported previously 25 .
Infants from sites that participated in the follow-up study (Brazil, India, Italy, Kenya and the UK) were assessed at the ages of 1 and 2 years to obtain a detailed evaluation of growth, nutrition, morbidity and motor development. These data were collected by interviewing parents and assessment by a certified examiner. Achievement of milestones ('sitting without support', 'standing with assistance', 'hand-and-knees-crawling', 'walking with assistance', 'standing alone' and 'walking alone') were considered normal if the age at achievement was within the expected WHO windows (less than the 99 th centile for each of the expected windows) 27 .
The INTERGROWTH-21 st Project was approved by the Oxfordshire Research Ethics Committee ''C'' (ref: 08/H0606/139), the research ethics committees of the individual institutions and the regional health authorities in which the project was implemented; all the women involved gave written informed consent.

Structures measured and sample-size considerations
The fetal brain structures were measured on ultrasound images extracted from 3D volumes of the fetal head, acquired at all eight participating sites. The decision regarding which structures to evaluate was based on a combination of factors: an extensive scoping exercise and review of the literature demonstrating their clinical utility 9 ; structures that can be assessed in axial planes that are acquired routinely; and a pilot study involving 90 ultrasound volumes assessing feasibility and reproducibility.
The sample size was based on pragmatic and statistical considerations. The main pragmatic consideration was the considerable length of time required for volume upload, manipulation, plane extraction and measurement (20 min per volume on average). As a result, we decided to take a random sample from the entire FGLS cohort, bearing in mind the need for precision at the 5 th and 95 th centiles. A sample of 300 scans would obtain a precision of 0.1 SD at the 5 th or the 95 th centile 28 . Using conservative estimates, we assumed a possible 5% exclusion rate due to loss to follow-up in pregnancy or at birth, withdrawal of consent, miscarriage, stillbirth, maternal death, fetal or neonatal structural abnormality or severely abnormal outcome at 2-year follow-up, which was defined as any of the following: meningitis, hearing loss, blindness or major visual problems, seizures, cerebral palsy, neurological disorders, malignancy, malaria, tuberculosis, hepatitis, HIV/AIDS, cystic fibrosis or hemolytic conditions. We also assumed that, in up to 40% of cases, all five structures might not be measurable (based on a conservative estimate, as the actual upper limit of the confidence interval from the pilot study was 20%, primarily due to movement artifact). Based on these assumptions, we estimated that 451 3D volumes would lead to a minimum of 300 measurements for each structure. Therefore, we selected 451 3D volumes from the overall FGLS population using computer randomization, ensuring an equal distribution among the eight participating sites and of ultrasound volumes across pregnancy (range: 15-36 weeks' gestation). The random selection was performed using SAS software© (SAS Institute Inc., Cary, NC, USA).
The study was cross-sectional, as only one volume per pregnancy was included.

Volume acquisition, offline analysis and quality control
Detailed descriptions of the volume acquisition methods are provided elsewhere 25,29 . Briefly, head volumes were acquired at the level of the axial transthalamic plane. Five predefined quality-control criteria for the transthalamic plane had to be satisfied to acquire the volume (Table 1; Figure 1) 26 . Acquisition was undertaken with the volume data box and angle of sweep (usually 70 • ) adjusted to include the entire head during fetal quiescence, and with the mother asked to hold her breath and the transducer held steady. The real-time image was observed during acquisition to confirm that the sweep included the entire head with no maternal or fetal movement during the sweep, otherwise the process was repeated. All data were  then sent to the Ultrasound Quality Coordinating Unit in Oxford. Offline analysis was undertaken by four experienced sonographers at the coordinating unit. All were trained in neurosonography and their training specifically standardized for the purposes of this study in volume manipulation for plane reconstruction and measurement (Videoclip S1, Figure 1). The volume manipulations and measurements were performed using the software of the ultrasound machines' manufacturer or an open-source image analysis software program (Medical Imaging Interaction Toolkit MITK, version 0.12.2; German Cancer Research Center, Division of Medical and Biological Informatics, www.mitk.org) 30 . This was done because the open-source software was more 'user friendly'. Comparability of measurements between the manufacturer's software and the open-source image-analysis software program was confirmed (mean reproducibility was within 0.7 mm).
All sonographers were blinded to the measurements during the study. In addition, strict quality control was undertaken in the whole sample: image quality criteria were used to ensure the maximum possible score for each extracted plane ( Table 1) before measurement of the following five structures: the POF and SF in the transthalamic plane; the AV and PV in the transventricular plane; and the CM in the transcerebellar plane. The POF, SF, AV and PV were measured in the distal hemisphere of the respective plane (because of poorer visualization in the proximal hemisphere). Further details of volume manipulation and caliper placement are given in Appendix S1.

Reproducibility
Reproducibility was assessed in a subset of 90 volumes. The first sonographer uploaded the volume, manually extracted the three planes and measured the five structures twice (intraobserver reproducibility for plane reconstruction and measurement acquisition). A second sonographer re-uploaded the same volume and repeated this process (this second set of data was used to assess interobserver reproducibility for plane reconstruction and measurement acquisition). To assess the contribution of caliper replacement, the second sonographer replaced the calipers on still images and repositioned them to measure all structures in each plane stored by the first sonographer (interobserver reproducibility for caliper replacement on stored images). As in the main study, all sonographers were blinded to their own and the other sonographer's measurements during the reproducibility study.

Statistical analysis
We followed the modeling approach used previously by our group to construct fetal growth charts 15 . In summary, fractional polynomials that model the means and SD were used to model biometric measurements of brain structures as a function of gestational age. Our overall aim was to produce centiles that change smoothly with age and maximize simplicity without compromising model fit. Goodness of fit was assessed by Q-Q plots and a scatterplot of Z-scores by gestational age. Mean differences between the observed and fitted centiles were also calculated.
For the reproducibility study, Bland-Altman plots were used to quantify the level of agreement and variability in the measurements. Differences between and within observers were expressed in absolute values (mm). Analysis was performed using Stata 11 (StataCorp., College Station, TX, USA).

RESULTS
After exclusions, 442/451 (98.0%) volumes were used to reconstruct planes and create the fetal brain charts ( Figure 2). No congenital malformations were detected antenatally or postnatally in the selected fetuses, and no infants met the exclusion criteria set for the 2-year follow-up. As expected, given the random selection, maternal demographics and pregnancy outcomes were similar to those in the overall FGLS population, confirming a low risk of perinatal complications (Table S1).
Of the 442 infants, 297 (67.2%) were assessed by their parent(s) at 1 year of age; of these, 289 (97.3%) were also assessed by a certified examiner at a mean age of 12.3 months (range, 10.9-19.4 months). As reported by the parent(s), 99% of the infants had entirely normal motor development. Only three (1%) infants did not achieve the milestones 'sitting without support' and 'standing with assistance'; brain structure measurements for these children were within the 5 th and 95 th centile range. There was overall good agreement between the achievement of milestones as reported by the parent(s) and that found on assessment by a certified examiner (average agreement, 96% (range, 92-100%)). Reassuringly, in almost all cases in which disagreement between the two assessments was present, the examiner reported more precocious milestone achievement than did the parent(s), confirming the low risk for abnormal long-term outcome in our cohort. Follow-up at 2 years of age was available in 304 children; the findings of this detailed assessment demonstrate comparability with the morbidity reported in children from the overall FGLS cohort who underwent motor and neurodevelopment assessment (Table S2; Figure 3) 31 . The mean and SD of the children's weight, length and head circumference at 2 years of age were 12.3 ± 1.7 kg, 87.4 ± 3.7 cm and 47.7 ± 1.6 cm, respectively, and Z-scores were within the expected values of the WHO Child Growth Standards. Motor development for the two milestones not reached by the age of 1 year ('standing alone' and 'walking alone') was confirmed as normal at 2 years in 99% and 98%, respectively.
In total, 2439 measurements of fetal brain structures were acquired. On average, structures were measurable in a high-quality extracted plane in 90% of cases, the CM being the structure measurable the least frequently. After removal of outliers, measurements were available to create centiles for the POF, SF, AV, PV and CM in

364
Napolitano et al. 20 22 24 Gestational age (weeks) Parieto-occipital fissure (mm) 26  (a) 20 22 24 Gestational age (weeks) Sylvian fissure (mm) 26 20 22 24 Gestational age (weeks) AV (mm) 26  (c) 20 22 24 Gestational age (weeks) PV (mm) 26  (d) 20 22 24 Gestational age (weeks) Cisterna magna (mm) 26  The best fitting powers were provided by second-degree fractional polynomials and further modeled in a multilevel framework to account for the cross-sectional design of the study. The gestational-age-specific smoothed centiles for the POF, SF, AV, PV and CM are presented in Figure 4 and Tables 2-6. One fetus had a PV > 10 mm and four had a CM > 10 mm; all had normal perinatal outcome.
Both visual assessment of scatterplots of Z-scores by gestational age and goodness-of-fit tests, assessed by gestational-age-specific comparisons of empirical centiles with smoothed centile curves, showed good agreement.
The equations for the mean and SD from the fractional polynomial regression models for each structure measured are presented in Table 7, allowing for the calculation of any desired centile according to gestational age in exact weeks.
Results of the reproducibility study are shown in Table 8. All measurements were reproducible within less than 3 mm or 12% (all mean differences were less than 0.1 mm or 0.5%). The greatest proportion of variability was due to caliper replacement, accounting for more than 50% of the intra-and interobserver variability for measurements of all structures, as observed previously for fetal biometry measurements ( Figure S1) 32 .

DISCUSSION
We have produced international size standards for ultrasound measurements of clinically relevant fetal brain structures. The study population consisted of women at low risk of adverse pregnancy and perinatal outcomes 15 . Unlike previous studies reporting fetal brain standards, we followed up the infants and demonstrated satisfactory growth and development at 1 and 2 years of age, confirming that our initial selection criteria met the WHO requirements for constructing international growth standards 12,31 . The sequence and timing of the attainment of neurodevelopmental milestones and associated behaviors in early childhood were very similar to those reported previously by our group, i.e. we have demonstrated that there are similarities across diverse geographical regions, as long as nutritional and health needs are met 31 .
We performed a systematic review of the literature that assessed the methodology used to create fetal brain structure charts 9 . This showed that some studies did not strictly adhere to plane standardization. Using different planes for fetal head biometry can lead to significant measurement differences 33 . In some studies, landmarks for plane acquisition were not specified [34][35][36][37][38][39][40][41][42][43][44] , while in others, various oblique planes with numerous landmarks were proposed 45,46 . One of the strengths of our study is the use of standardized axial planes recommended in routine  clinical practice for biometry assessment (Table 1). We believe that this approach of using standardized planes improves reproducibility, a view that is supported by the findings of previous studies 46,47 . In our case, this led to a high proportion of structures that could be measured on stored volumes (90% on average) and resulted in reproducible measurements, with 95% limits of agreement within < 3 mm (or < 12%) ( Table 8). Studies involving experts in neurosonography report similar results in visualizing structures from volume analysis 48 . This is in contrast to previous studies on subjective assessment of brain fissures, which report variable results

366
Napolitano et al.  in terms of reproducibility (kappa coefficients varying from 0.56 to 0.95) 45,49 . Improving reproducibility was also one of the aims of our study, in order to move to quantitative assessment of fetal brain development 45,46,50 . To achieve our objectives, we used international guidelines to obtain measurements of the PV and CM 1,2 , and we provide detailed methods for AV, POF and SF measurements, based on existing publications (Appendix S1), as we were unable to find generally accepted guidelines.
Our study overcomes many of the methodological limitations of previous studies 9 . These include a high risk of bias in the selection of the population, ultrasound protocol and data analysis. For example, fewer than  Table 7 Equations for estimation of mean and SD (mm) of each fetal brain structure measurement, according to exact gestational age (GA) (weeks)

Structure Equation
Parieto-occipital fissure Ln, natural logarithm. 10% of previous studies reported on maternal and fetal inclusion/exclusion criteria, pregnancy outcome or ultrasound quality control. Goodness of fit of the model to create the charts was reported in only 35% of the studies.
Most importantly, no studies reported long-term infant outcomes, most probably owing to their retrospective descriptive design (30%); thus, data were often not collected specifically for the purpose of the study. Not surprisingly, these are some of the same challenges seen in previous studies to construct fetal biometry charts 10,11 . Nevertheless, some previous studies did have a relatively low risk of methodological bias, and the ranges of our observed measurements did not differ substantially from their findings 34,51-54 .

Strengths and limitations
A large number of sonographers were involved in this study; however, this more accurately reflects clinical practice 55 . In addition, the quality of the images obtained in the study was of a high standard and in accordance with a predefined protocol 25 . We set near-optimal conditions for scanning to minimize the potential contribution of confounding factors, which could also be seen as a strength.
It is possible that measurements acquired on planes extracted from 3D volumes are not equivalent to measurements made from 2D image acquisition. Although volumetry is associated with a high degree of variability if not standardized 50 , once rigorous methodology is adopted, 2D measurements from reconstructed planes can be as reproducible as measurements obtained in real time 29,37 .
A key strength of our study is that we adopted a prescriptive design, as recommended by the WHO. We identified urban regions in which women were at low risk of perinatal complications; participants were then enrolled within these regions based on their individual characteristics. All ultrasound measurements were taken specifically for the purpose of constructing international standards with standardization of all study sites, using centrally trained staff and specially adapted ultrasound equipment to allow masking of measurements. For the offline analysis, we developed a novel quality-control strategy. The most appropriate statistical methods were used to analyze the dataset.
It could be argued that only longitudinal data should be used to assess fetal growth. However, given the design of FGLS, in which women mostly had an equal number of visits during pregnancy and these visits were according to what was prespecified in the protocol, cross-sectional data were acquired in order to ensure a representative number of brain-structure measurements per gestational week. The fitted model took this into account.
The INTERGROWTH-21 st Project and WHO Multicentre Growth Reference Study have demonstrated previously the generalizability across geographically diverse international populations of anthropometric standards produced using the prescriptive approach 12,31,56 . Follow-up of infants in the FGLS cohort has also been reported, and demonstrates strong similarities across sites when assessed by variance components analysis and standardized site differences, showing that the sequence and timing of attainment of neurodevelopmental milestones and associated behaviors in early childhood are probably innate and universal 31 .

Conclusions
We report international standards for the size of five fetal brain structures throughout gestation. These standards use reproducible and highly controlled ultrasound measurements, and were created using a prospective cohort of fetuses that was followed up into childhood. Clinical use of such objective measurements may help to improve the screening and diagnostic performance of prenatal ultrasound. It should also allow a unified approach to fetal assessment by integrating with other standards from the same population and result in a common language when describing aberrations from expected norms 57,58 . The proposed standards should not replace currently accepted cut-off values for triggering referral or further investigation; for example, we do not propose that we should redefine the diagnosis of antenatally diagnosed ventriculomegaly. This is because previous studies on the association between infant outcome and antenatally detected congenital brain abnormalities cannot simply be replicated [57][58][59] .

ACKNOWLEDGMENTS
This project was supported by a generous grant from the Bill & Melinda Gates Foundation to the University of Oxford (Oxford, UK), for which we are very grateful. Aris T. Papageorghiou is supported by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC).
We thank the Health Authorities in Pelotas, Brazil; Beijing, China; Nagpur, India; Turin, Italy; Nairobi, Kenya; Muscat, Oman; Oxford, UK; and Seattle, USA, who facilitated the project by allowing participation of these study sites as collaborating centers. We are grateful to Philips Medical Systems, who provided the ultrasound equipment and technical assistance throughout the project. We thank MedSciNet U.K. Ltd for setting up the INTERGROWTH-21st website and for the development, maintenance and support of the online data management system.
Finally, we thank the parents and infants who participated in the studies and the more than 200 members of the research teams who made implementation of this project possible.

SUPPORTING INFORMATION ON THE INTERNET
The following supporting information may be found in the online version of this article: Videoclip S1 Demonstration of methodology for volume manipulation and caliper placement for measurement acquisition of fetal brain structures using MITK software.

Figure S1
Bland-Altman plots showing intra-(a) and inter-(b) observer reproducibility for volume manipulation and caliper placement for measurement acquisition of fetal brain structures, and interobserver reproducibility for caliper replacement on stored images (c).
Appendix S1 Detailed methodology for 3D volume manipulation and caliper placement