Incidences of obstetric outcomes and sample size calculations: A Danish national registry study based on all deliveries from 2008 to 2015

Abstract Introduction In high‐income countries the majority of pregnancies have a good outcome, and many adverse obstetric outcomes rarely occur. This makes demonstrating clinically relevant and statistically significant effects of new interventions a challenge. The objective of the study was to report incidences of important obstetric outcomes and to calculate sample sizes for tentative studies. Material and methods The study was a registry‐based study. Data were retrieved from the Danish Medical Birth Registry and included all deliveries in Denmark from 2008 to 2015. The total population included 465 919 deliveries. The study population comprised intended vaginal deliveries with a single fetus in cephalic presentation at term (n = 381 567). Incidences were reported for 20 outcomes considering the relevance for the patients and the severity of the outcomes. We calculated the sample sizes required in tentative obstetric studies to detect risk reductions of 25 and 50%, for tests at the 5% level, using a power of 80 and 90%. For the randomized controlled trials we calculated the sample size required for comparing two proportions with equal‐sized groups. For the cohort study we calculated the sample size also required for two proportions but with unequal sized groups. Outcome measures for sample size calculation were neonatal mortality, Apgar score <7 at 5 minutes and emergency cesarean section. Results The incidence of neonatal mortality, Apgar score <7 at 5 minutes and emergency cesarean section was 0.05, 0.58 and 10.5%, respectively. Using neonatal mortality as the outcome in a tentative randomized controlled trial with an expected risk reduction of 50% and power of 80%, our calculation showed a sample size of 195 036 deliveries. Using Apgar score <7 at 5 minutes or emergency cesarean section as the outcome, 16 254 and 818 deliveries, respectively, were required. In tentative cohort studies, the required sample sizes were larger due to the unequal proportion of exposed/non‐exposed women. Conclusions Most adverse obstetric outcomes occur rarely; thus, very large sample sizes are required to achieve adequate statistical power in randomized controlled trials. Multicenter studies, international collaborations or alternative study designs to randomized controlled trials could be considered.


| INTRODUC TI ON
In high-income countries the majority of pregnancies have a good outcome, and many adverse obstetric outcomes rarely occur. This makes demonstrating clinically relevant and statistically significant effects of new interventions a challenge.
Randomized controlled trials (RCTs) are considered the gold standard for establishing causal inference in healthcare interventions, and are therefore frequently applied as study designs. 1,2 If RCTs have adequate statistical power, the expectation is that significant differences between groups will be a result of the intervention. 1 To ensure the quality of scientific work, calculating and reporting a study's sample size is fundamental. 2 However, sample size calculations are sparsely reported in scientific papers, and many trials do not achieve the target sample size stated before starting the trial. 3 When a study is underpowered, there is a risk of not finding the true difference between the groups, affecting the quality of the study. 4 Two obstetric papers from 1997-2000 on the introduction of continuous electronic fetal monitoring in obstetric care and on the potential bias when comparing small and large maternity institutions when studying stillbirth rates, respectively, discuss the implications of study design, rare outcome measures and large sample sizes. 5,6 The two studies concluded that, for rare outcomes, very large sample sizes were needed to detect statistically significant differences between study groups. Even though these two studies emphasize the implications of sample size calculation in obstetric outcomes, many researchers still include rare outcomes in their study design without having sufficient power to do so.
Our study had three objectives: first, to report incidences of 16 obstetric outcomes; secondly, to calculate sample sizes for tentative studies using three selected outcomes: neonatal mortality, Apgar score <7 at 5 minutes and emergency cesarean section (ECS); and thirdly, to discuss the implications for study design in obstetrics when choosing outcome measures.

| Population and study design
The study was a registry-based study. Data were retrieved from the Danish Medical Birth Registry and included all deliveries in Denmark from 2008 to 2015. The Danish Medical Birth Registry contains information on all deliveries in Denmark, thus providing data on the mother, the child, the pregnancy and the delivery.
The annual number of deliveries in Denmark is approximately 60 000, with 96-98% of deliveries in public hospitals and 2-3% at home, mostly attended by midwives from public maternity departments. 7 Currently there are 23 public maternity departments in Denmark, all with access to specialists in obstetrics and anesthesiology. Midwives attend all deliveries and an obstetrician is only involved in the event of complications.
In this study we operated with two populations: the total population and the study population. We used the former, which included all deliveries in Denmark in the study period, to report incidences of important obstetric outcomes.
For calculating sample sizes for tentative studies, we used a study population that included all intended vaginal deliveries with a term (gestational age ≥37 weeks) singleton in cephalic presentation.
Stillbirths and homebirths were excluded. Incidences of the obstetric outcomes were also reported for the study population.

| Outcome measures
It is the scientific question raised that defines whether an obstetric event is an intervention, an outcome or even a population, not the event itself. In our study we have defined 16 relevant and used obstetric outcomes in the literature and are aware that these outcomes in other studies could be defined as interventions or constitute a study population.
We chose 16 obstetric outcomes for reporting incidences of the total population: preeclampsia, hemolysis, elevated liver enzymes, and low platelets (HELLP) syndrome, eclampsia, induction of labor, oxytocin augmentation, umbilical cord prolapse, shoulder dystocia, vacuum extraction, ECS, postpartum hemorrhage ≥1000 mL, manual exploration of the uterus, stillbirth, Apgar score <7 at 5 minutes, preterm delivery <37 weeks of gestation, low birthweight <2500 g and neonatal mortality. We chose these outcomes considering the trials. Multicenter studies, international collaborations or alternative study designs to randomized controlled trials could be considered.

K E Y W O R D S
emergency cesarean section, incidence, methods, obstetric outcome, obstetrics, pregnancy outcome, research design, sample size

Key message
The majority of obstetric outcomes occur with a very low incidence. Our sample size calculations showed that when using rare obstetric outcomes large sample sizes were required. Multicenter studies, international collaborations or alternative study designs to randomized controlled trials could be considered.
relevance for the patients and the severity of the outcomes. For the study population we report incidences of 14 obstetric outcomes.
For sample size calculation in tentative studies, we chose three outcomes from the core outcome set for key stakeholders in maternity care: neonatal mortality, Apgar score <7 at 5 minutes and ECS. 8 The chosen outcomes reflect different incidences: one extremely rare, one rare and one more common.
Neonatal mortality is defined as death before the age of 28 completed days after live birth. 9 Apgar score is used to assess the condition of the newborn at 1 and 5 minutes after birth and it is a validated predictor of neonatal survival. 10 The Apgar score at 5 minutes is the best predictor of neonatal survival. 10 Cesarean section is linked to a wide range of complications, such as uterine rupture and abnormal invasive placenta, which leads to higher risk of maternal and neonatal morbidity and mortality. 11

| Statistical analyses
The data have been used as part of another study. 12 Before analysis, the dataset was checked for logical errors. We recoded missing data for maternal weight and height with unrealistic values and checked whether there was consistency between the diagnosis of the delivery and the surgical intervention or procedure coded.
The selected outcomes are reported as incidences, both for the total population and for the study population. The incidences of neonatal mortality, Apgar score <7 at 5 minutes and ECS in the study population formed the basis for the tentative sample size calculations. We calculated the sample sizes for the comparison of two proportions, which necessitates a proportion of the outcome, and the researcher to consider the intervention effect and the desired maximum risk of statistical errors. The statistical tool is provided in Supporting Information Appendix S1.
We calculated the sample size necessary for tentative RCTs and cohort studies to be able to detect risk reductions of 25 and 50% at the 5% level with a power of 80 and 90%, respectively. For the RCTs we calculated the sample size required for comparing two proportions with equal-sized groups (i.e. 1:1 ratio), whereas for the cohort study we calculated the sample size also required for two proportions but with unequal sized groups. We calculated sample sizes for proportion of exposed women of 5, 10 and 25%. 13 Incidences were computed using IBM SPSS® version 24 (IBM Corp., Armonk, NY, USA) and sample size calculations were made using SAS® software package version 9.4 (SAS Institute, Cary, NC, USA). As missing data were rare, imputation was not applied.

| Ethical approval
Approval was obtained from the Danish Data Protection Agency (file no.: 2012-58-0004). As this was a registry-based study, ethical approval was not required according to the Danish Research Ethics Committee Law. 14 TA B L E 1 Sociodemographic characteristics of the total population and the study population in Denmark from 2008 to 2015 Characteristics Total population a n (%) 465 919 (100) The study population included all term singleton (≥37 weeks of gestational age) with intended vaginal cephalic delivery. There were missing data for Apgar score <7 at 5 minutes in 1260 deliveries (0.3%). There was no missing data for the variables neonatal mortality or ECS. Table 1 shows the sociodemographic characteristics of the total population and the study population. In general, Danish women were most likely to deliver at term, to be 25-34 years of age, to be non-smokers, and to have a normal body mass index (i.e.

| RE SULTS
18.5-24.9 kg/m 2 ). Table S1 reports incidences of the obstetric outcomes stratified by year. Figure 1 and Table 3 report the sample sizes calculated for tentative RCTs and cohort studies. As shown, the incidence of the outcome measure affected the sample size. When using neonatal mortality with an incidence of 0.05% as the outcome in a tentative RCT with an expected risk reduction of 50% and power of 80%, our sample size calculation showed required sample size of 195 036 deliveries. Using Apgar score <7 at 5 minutes, with an incidence of 0.58%, as the outcome in a tentative RCT, with the same risk reduction and same power, 16 254 deliveries were required. For ECS with an incidence of 10.5%, 818 deliveries were required for a tentative RCT. Figure 1 and Table 3 also report the sample sizes required for studies with a power of 90% and for studies with a risk reduction of 25%.
Our results illustrate that an expected lower risk reduction increased the sample sizes. Using neonatal mortality as the outcome in a tentative RCT with 80% power and changing the risk reduction from 50% to 25% resulted in a fourfold increase in the required sample size to 916 518 deliveries. Using Apgar score <7 at 5 minutes or ECS as outcomes, the same fourfold increase in the required sample size was seen when the expected risk reduction changed from 50% to 25% (Figure 1, Table 3).
Our results furthermore illustrate that changing the power from 90% to 80% had a small impact on the required sample sizes. Using Apgar score <7 at 5 minutes as the outcome in a tentative RCT with 50% risk reduction and 80% power instead of 90%, the required sample size decreased from 21 758 to 16 254 deliveries. Using ECS as the outcome in an RCT with 50% risk reduction and 80% power instead of 90%, the required sample size decreased from 1092 to 818 deliveries (Figure 1, Table 3).
The study design furthermore affected the required sample sizes. When changing the proportion of exposed women from 50% to a smaller proportion, larger sample sizes were required. If Apgar score <7 at 5 minutes was used as the outcome in a tentative cohort study with 25% exposed instead of an RCT (50% exposed) with a power of 80% and risk reduction of 50%, the required sample size increased from 16 254 deliveries to 20 112 deliveries (Table 3). The same tendency was seen when the proportion of exposed women was even smaller. If the proportion of exposed was 10%, the required sample size was 39 680, and if the proportion of The total population included all deliveries in Denmark from 2008 to 2015 with gestational age 20 +0 to 45 +0 weeks. In the event of multiple fetuses in one pregnancy, an outcome among one or more of the children counts. c The study population included all term (≥37 weeks' gestational age), singleton, intended vaginal cephalic delivery in Denmark from 2008 to 2015.
exposed was 5%, the sample size was further increased to 73 680 deliveries (Table 3).

| D ISCUSS I ON
We found that the majority of obstetric outcomes occurred at a very low incidence. Total sample sizes required for tentative classical randomized controlled trials with an allocation of 1:1 ratio (i.e. 50% exposed) and cohort studies with a proportion of exposed of 5, 10 and 25%.
Our sample size calculations showed that the choice of study design, the outcome incidence and the change from 50% to 25% in the risk reduction all contributed to the required sample size of the tentative studies. Changing the power from 90% to 80% did not have large impact on the required sample size.
A strength of our study is that the reported incidences are based on a large data source of 456 014 deliveries and the sample size calculations on a data source of 381 567 deliveries. The deliveries represent the period from 2008 to 2015, making the data fairly current.
Registry-based research always involves the uncertainty associated with inaccurate reporting. Several studies show, however, that data from the Danish Medical Birth Register are valid in terms of diagnosis on most well-defined outcomes, such as preeclampsia, birthweight, oxytocin augmentation of labor, vacuum extraction and cesarean section. 15,16 The study demonstrated that low incidence of the outcome affected the sample size. In Danish settings, if an RCT with 90% power was required to show a significant reduction of 25% of Apgar score <7 at 5 minutes, the study would take more than 2 years and require the inclusion of all deliveries. However, it might not be possible to include all eligible patients and some do not want to participate in the study. Thus, the time it takes to recruit patients will be longer than anticipated. This entails the researcher in the planning phase of an RCT to be realistic about recruitment and retention of participants in the study. This could be done through a feasibility study.
Our sample size calculations showed that a major contributor to the required sample size was the change from 50% to 25% in the risk reduction. Applying a risk reduction of 50% instead of 25% to the sample size calculation in a tentative RCT with a rare outcome such as Apgar score <7 at 5 minutes would still require a large-scale multicenter study. Multicenter studies have the advantage of including more participants in shorter time. However, multicenter studies are considerably more complex to run than single-site studies. Furthermore, the sample size calculations depend upon the assumption that the differences between the compared interventions in the centers are unbiased estimates of the same quantity. Based on previous studies, a reduction in risk of 50% or 25% in Apgar score <7 at 5 minutes is probably unrealistic. 17,18 A realistic and still clinically important reduction in Apgar score <7 at 5 minutes might be 10%, which would require an even larger sample size.
With a more common outcome such as ECS, conducting RCTs is more feasible because of the requirement of smaller sample sizes to achieve the adequate power. This might explain why ECS is often seen as an outcome in obstetric studies. 17,[19][20][21] ECS may be a relevant outcome, but it is also easier to obtain power to show statistically significant results compared with a more rare outcome.
Furthermore, in many studies an effect on the more common outcomes is often found and the interpretation is that a given intervention has only affected these common outcomes. The intervention, however, could potentially also have affected the rare outcomes, but the study might be underpowered to show this effect.
Meta-analyses of RCTs are a way of increasing the power of the estimated intervention effect. However, meta-analyses are, like single-site studies, prone to risk of systematic and random error. 22,23 Sometimes used in studies with rare outcomes, 24-26 composite outcomes combine several variables, which are considered to be equivalent, into one outcome to increase the total incidence of these outcomes. Composite outcomes enable the study to be performed with a smaller sample and/or in less time. However, composite outcomes often provide an unclear reflection of the effect because the outcomes are not necessarily equivalent in terms of severity or measurements, and it is possible that the exposure increases the risk of one complication and decreases the risk of another. In the latter situation, the possible effect of the exposure may be camouflaged. 27 RCTs are considered the gold standard for establishing causality between exposure and outcome in healthcare interventions. ier to detect an increase in the incidences of cesarean section than a reduction in morbidity because of the different sample sizes and time needed to detect a significant change in the two outcomes. 5 Moster et al demonstrated that large sample sizes were needed when comparisons of safety between different sizes of delivery units were made for low-risk pregnancies, including stillbirth as the outcome measure. 6 Our findings furthermore provide insights into sample sizes in relation to study design in both rare and more common obstetric outcomes.

| CON CLUS ION
Based on Danish national data from an 8-year period, we found that several obstetric outcomes occur rarely. Consequently, very large sample sizes are required to achieve adequate statistical power in tentative RCTs. This necessity entails a risk of studies being underpowered or only showing an effect on common outcomes when an effect on rare outcomes might also exist.
Focusing on international multicenter collaboration and prioritizing a feasible study design can provide high quality evidence when investigating rare outcomes.

ACK N OWLED G M ENT
We would like to express our deepest gratitude to senior advisor Steen Rasmussen, MSc (econ), MPH, Rigshospitalet, for managing the national registry data.

CO N FLI C T O F I NTE R E S T
The authors have stated explicitly that there are no conflicts of interest in connection with this article.