PREFACE: In silico pipeline for accurate cell‐free fetal DNA fraction prediction

Abstract Objective During routine noninvasive prenatal testing (NIPT), cell‐free fetal DNA fraction is ideally derived from shallow‐depth whole‐genome sequencing data, preventing the need for additional experimental assays. The fraction of aligned reads to chromosome Y enables proper quantification for male fetuses, unlike for females, where advanced predictive procedures are required. This study introduces PREdict FetAl ComponEnt (PREFACE), a novel bioinformatics pipeline to establish fetal fraction in a gender‐independent manner. Methods PREFACE combines the strengths of principal component analysis and neural networks to model copy number profiles. Results For sets of roughly 1100 male NIPT samples, a cross‐validated Pearson correlation of 0.9 between predictions and fetal fractions according to Y chromosomal read counts was noted. PREFACE enables training with both male and unlabeled female fetuses. Using our complete cohort (nfemale = 2468, nmale = 2723), the correlation metric reached 0.94. Conclusions Allowing individual institutions to generate optimized models sidelines between‐laboratory bias, as PREFACE enables user‐friendly training with a limited amount of retrospective data. In addition, our software provides the fetal fraction based on the copy number state of chromosome X. We show that these measures can predict mixed multiple pregnancies, sex chromosomal aneuploidies, and the source of observed aberrations.

Several fetal gender-independent methodologies have been described to assess FF. Prior parental genomic information often facilitates some of these procedures, as, eg, paternal or maternal homozygous loci that are determined to be partly heterozygous in maternal blood during pregnancy form a precise platform to quantify FF. [9][10][11] Nonetheless, parental priors are not always obliged: using binomial mixture modeling, fetal and maternal clusters of single nucleotide polymorphisms also reflect FF, yet a higher sequencing depth is required. 12 Likewise, different inputs, such as molecule size (cell-free fetal DNA fragments are often shorter) and methylation patterns (some fetal sites are hypermethylated), enable FF prediction. [13][14][15][16] Routine NIPT is converging towards a cost-effective recipe, with back-hand automated computational pipelines expecting mostly single-end shallow-depth whole-genome sequencing data (sWGS; 0.1-1x coverage) to determine copy number alterations. 17 Previously discussed FF determining techniques imply the need for additional laboratory steps and/or (currently) nonfeasible deep sequencing. Therefore, a handful tools have been developed to predict FF based on exclusively sWGS data. The copy number state of the X chromosome, and especially the number of observed Y chromosomal reads, form popular foundations to calculate FF-here, these are referred to as fetal fraction based on chromosome X (FFX) and fetal fraction based on chromosome Y (FFY), respectively. 18,19 Unfortunately, they are only informative for male fetuses. Accordingly, two other approaches have been described to predict FF, without relying on the gonosomes. One of these exploits nucleosome positions, hypothesizing that shorter fetal fragments are caused by differential nucleosome packaging. 20 The spatial distribution of mapped reads should represent FF; however, the reported performance of the predictive model seems rather unsatisfactory. 19 Finally, SeqFF, which uses a model designed directly on bin-wise copy number features of more than 25 000 pregnant women, reports accurate FF determination, with a Pearson correlation between predictions and FFY of 0.932. 21 The inventors state that cell-free fetal and maternal fragments are not uniformly distributed across the human reference genome: small differences in local read counts are predictive for FF. Aside from the seemingly excessive number of required male training samples, the software does not provide a training option. Therefore, users are restricted to a pretrained alternative. Because of inevitable differences in laboratory and computational procedures between training and test cases, the correlation is expected to be lower than what is claimed.
Applying similar biological principles as used by SeqFF, we

| Library preparation and sequencing
Blood samples were collected in 10-mL cell-free DNA BCT tubes (Streck) or PAXgene Blood DNA Tubes (Qiagen). Within 24 hours after collection, plasma isolation was executed by centrifugation (4°C; 10 minutes at 1600 g; 10 minutes at 16 000 g, or 15 minutes at 1900 g, respectively). The supernatant was transferred to a new tube and cfDNA was extracted from 3.5-mL plasma using the Maxwell RSC ccfDNA Plasma Kit (Promega), following the manufacturer's instructions.
Using 25 μL of cfDNA, library preparation was executed on a Hamilton Star liquid handler using the NEXTflex Cell Free DNA-Seq Library Prep Kit (Bioo Scientific) and NEXTflex DNA Barcodes (Bioo Scientific). After pooling, cluster generation and sequencing were completed by respectively a cBot 2 and HiSeq 3000 system (Illumina).
The minimal number of reads (single-read; 50-cycle mode) per sample was set to 15 million.

| Copy number profiling
Raw reads were mapped by Bowtie 2 onto human reference genome GRCh38 (and GRCh37, for SeqFF compliance), using the fast-local flag. 22 Biobambam's bamsormadup was used to mark duplicate reads and to sort resulting bam files. 23 Indexing was executed by SAMtools. 24 To reliably deduce normalized bin-wise log 2 ratios from sWGS data, we preferred WisecondorX, considering it yields superior copy number profiles, as shown by our group in earlier work. 25 These ratios represent the relation between the observed (numerator) and expected (denominator) number of reads, the latter matching the diploid state. Since these values are subject to Gaussian noise, a resolution of 100 kb was selected to yield reasonable noise levels in What's already known about this topic?
• Cell-free fetal DNA fraction is an important estimate during noninvasive prenatal testing (NIPT).
• Most techniques to establish fetal fraction require experimental procedures, which impede routine execution.
What does this study add?
• PREFACE is a novel software to accurately predict fetal fraction based on solely shallow-depth whole-genome sequencing data, the fundamental base of a default NIPT assay. Regions without resulting information were interpreted as loci of undeterminable copy number, as defined by WisecondorX.

| Response variable FFY
For male fetuses, the FF is linearly proportional to the read depthcorrected mean number of observed Y reads (Y NIPT,male ). In the formula below, the prior or naive FFY is interpreted as a Y NIPT,male observation between the median of a set of male liquid biopsies (LBs) g Y LB;male (FFY = 100%) and female background noise g Y NIPT;female (FFY = 0%). For female fetuses, the prior FFY is set to 0.
As previously reported, masking the Y chromosome prior to calculating FFY increases the precision. 18,19 We took this concept one step further by creating a model that provides a weighted selection of the most appropriate set of Y windows. This way, a large increase in power to separate males from females was noted. We believe hypervariable FF-unrelated bins are down-weighted, forming a supposed overall more accurate FFY. A general linear model with lasso regularization (λ = 1e −4 ) was selected, using the read depth-normalized number of reads at 5 kb Y bins as explanatory parameters, and the prior FFY as a response variable ( Figure S1). The fitted model parameters were retrieved to infer a final FFY, as shown below.
Above, β 0 is the intercept, β k indicates the beta estimate for bin k, whereas y k represents the observed normalized number of reads at the same locus. Chromosome Y has n bins (n = 11 447). Note that FFY final was calculated using a cross-validation strategy: different models were trained to circumvent overlap between train and test cases. An overall model determined that 10.76% of chromosome Y remained available for FFY determination (β k ≠ 0). The Pearson correlation between the prior and final FFY was 0.985 for male fetuses.

| PREFACE software
The PREFACE software, written in R, is divided in two large components: one for training and one for predicting ( Figure 1). It is available  (Figure 2A,B). Indeed, adding female samples (or in general, adding more samples) enables the PCA algorithm to explain a larger proportion of (nonrandom) variance in its most important PCs ( Figure S2). Although NNs perform generally better, users can opt for an OLM instead, as these tend to be more reliable on smaller data sets ( Figure S3). For sets of roughly 1100 male samples, a correlation of 0.9 is reached.

| Females
Since NIPT samples from female fetuses lack independent FF measurements, PREFACE values were compared with SeqFF predictions, an approach proven to be applicable to female cases. Two major conclusions could be drawn. First, for males, the correlation between FFY and SeqFF predictions is "only" 0.887, lower than the reported 0.932, thus presumably caused by experimental differences between the pretrained SeqFF model and FFY ( Figure S4a). 21  SeqFF of 0.895 ( Figure S4b). As expected, a similar yet inverse inconsistency with the identity relation is retrieved, validating PREFACE's applicability to female fetuses.

| Fetal fraction based on chromosome X
The relation between FFX and FFY seems trivial. Therefore, the PREF-ACE software solely fits an RLM to the provided male fetuses without executing cross-validation. A weighted correlation as high as 0.971 supports this approach ( Figure S5). 30 Extreme outliers are caused by (mosaic) (sub)chromosomal maternal rearrangements, illustrating the need for a robust model.

| There is a strong correlation between FF predictions and confirmed aneuploidies
Throughout the NIPT cohort, 14 fetuses were reported with con-   potential aneuploidy. This is shown by a compelling concordance between the mean log 2 ratio of confirmed whole-chromosome duplications and predictions of r = 0.959, additionally indicating PREFACE's accuracy ( Figure 2C). Where the amplitudes of fetal abnormalities are positioned to expectation, defined as in Adalsteinsson et al, nonfetal observations are randomly scattered (Figures S6 and S7). 31 Here, the difference between the expected FF (based on confirmed aberrations) and predicted FF (according to PREFACE) is characterized by a standard deviation of 1.92%.

| PREFACE empowers gender prediction in multiple pregnancies
Besides single pregnancies, the NIPT cohort includes 177 twins, established through ultrasonography. The ratio between FFY and true FF naturally provides information about the gender of each fetus: two males are theoretically characterized by a ratio of 1; while with female twins, this measure amounts to 0, whereas for mixed pregnancies, a close-to 0.5 ratio is expected.
Our cohort contains both confirmed (by birth) and unconfirmed twin genders. The density distribution of the ratio between FFY and FF intrinsically represents the ability to distinguish different combinations of genders. Using Gaussian mixture modeling, three distinct peaks are retrieved across twins lacking gender confirmation ( Figure 3A). This suggests that female twins can be categorized with high accuracy, yet, discriminating male-male from male-female twins remains difficult for pregnancies with low FF ( Figure 3B). Finally, a similar visualization, holding validated genders, does confirm the reliability of this technique ( Figure 3C).

| PREFACE indirectly hints towards potential sex aneuploidies
With PREFACE, FFY, and FFX, three methods have been presented to establish FF. A consequence of adopting these estimates-next to what has already been discussed-is the inherent information on sex aneuploidies they potentially reveal. Sex aneuploidies were until now not reported by our institution; therefore, none are confirmed, meaning this final section is purely indicative and further experimental validation is warranted.
A dual modeling strategy was developed. First, by simultaneously comparing both FFX and FFY to PREFACE predictions, the power to distinguish genders increases.
Eight FFX outliers (less than −40%; greater than 40%), caused by maternal aberrations, were removed prior to fitting Gaussian (mixture) models to analytically describe the density distributions, expecting three (males, females, and mixed twins) and one component(s), respectively ( Figure 4A,B). Optimally, the results are presented in a three-dimensional all-inclusive figure, plotting FFY, FFX, and PREF-ACE values along its axes (File S1). Here, we opted to visualize the results in accordance to two preferred viewpoints ( Figure 4C,D). It is notable that confirmed twins are highly enriched in the middle Gaussian component of Density 1: these are mixed twin pregnancies.
In total, 39 (0.71%) cases significantly deviate from the healthy FFY-FFX trend. The majority of these likely concern (mosaic) maternal events and a few suspected subchromosomal aberrations. However, four XXY, two XYY, one XXX, and none X fetuses seem to be present when evaluating the FFX-FFY outliers in function of the PREFACE predictions ( Figure S8). Worth saying, these numbers largely correspond to reported incidence. 32 37,38 Instead of solely categorizing bins as informative and noninformative, we reasoned that the informative bins also differ in their "level of male specificity," thereby encouraging the idea of a bin-wise weighted contribution to FFY. Second, read count normalization was executed by WisecondorX, a sophisticated within-sample normalization procedure, which supposedly delivers superior profiles. 25 And last but not least, the nature of the modeling strategy maximizes training input by allowing unlabeled samples.
Gonosomal aberrations are theoretically exposed during NIPT in a similar way as any other aneuploidy. Nevertheless, the specificity is reported to be much lower in comparison with traditional screening of chromosomes 13, 18, and 21, especially for monosomy X. [39][40][41] Ethical issues on reporting these sometimes nonsevere abnormalities aside, the incorporation of FF in statistical outcome-which is generally not done with, eg, the popular z-score approach-does improve performance. 42,43 Indeed, our study was concluded by revealing that 0.71% of all NIPT samples significantly differed from the healthy gonosomal trend; however, when evaluating these outliers in relation to predicted FF, only a few truly met the requirements to suffice as being potentially sex aneuploid.
The convenience by which PREFACE could be implemented in existing NIPT pipelines seems undeniable: a copy number profile, the fundamental base of an assay, is singly requisite as input. This paper extensively demonstrates the practical value of accurate FF estimations on real data collected over the course of nine months. We believe PREFACE and the elaborated FF methodologies could be useful to many NIPT laboratories, evidentially motivating this work.

CONFLICT OF INTEREST
None declared.

ETHICS STATEMENT
This study was conducted according to the guidelines of the Ethics Committee at Ghent University Hospital (ID 2004/094).

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available on request from the corresponding author. These are not publicly available due to privacy and ethical restrictions.