# The influence of a first-order antedependence model and hyperparameters in BayesCπ for genomic prediction

## Article information

## Abstract

### Objective

The Bayesian first-order antedependence models, which specified single nucleotide polymorphisms (SNP) effects as being spatially correlated in the conventional BayesA/B, had more accurate genomic prediction than their corresponding classical counterparts. Given advantages of BayesCπ over BayesA/B, we have developed hyper-BayesCπ, ante-BayesCπ, and ante-hyper-BayesCπ to evaluate influences of the antedependence model and hyperparameters for v_{g} and

### Methods

Three public data (two simulated data and one mouse data) were used to validate our proposed methods. Genomic prediction performance of proposed methods was compared to traditional BayesCπ, ante-BayesA and ante-BayesB.

### Results

Through both simulation and real data analyses, we found that hyper-BayesCπ, ante-BayesCπ and ante-hyper-BayesCπ were comparable with BayesCπ, ante-BayesB, and ante-BayesA regarding the prediction accuracy and bias, except the situation in which ante-BayesB performed significantly worse when using a few SNPs and π = 0.95.

### Conclusion

Hyper-BayesCπ is recommended because it avoids pre-estimated total genetic variance of a trait compared with BayesCπ and shortens computing time compared with ante-BayesB. Although the antedependence model in BayesCπ did not show the advantages in our study, larger real data with high density chip may be used to validate it again in the future.

**Keywords:**BayesCπ; Antedependence Model; Hyperparameter

## INTRODUCTION

Methods for genomic prediction have been developed widely since Meuwissen et al [1] firstly proposed genomic selection methods. To date, there are mainly two classical types of methods applied to genomic prediction in livestock breeding. The first one is BLUP methods such as GBLUP [2], which assume that single nucleotide polymorphisms (SNP) effects independently and identically follow a normal distribution with an equal variance. The second one is Bayesian hierarchical models with different prior distributions on the variances of SNP effects. For example, BayesA assumes a scaled-inverted chi square distribution on SNP-specific variances [1].

Habier et al [3] firstly developed BayesCπ for genomic prediction to address the drawbacks of BayesA and BayesB with respect to influences of prior hyperparameters and the prior probability π that a SNP has zero effect. BayesCπ assumes that all SNPs have a common effect variance instead of locus-specific variances in BayesA or BayesB so that the influence of

So far, most of Bayesian methods, such as BayesA, BayesB, and BayesCπ, have a common feature that SNP effects are assumed to be independent. However, the existence of LD between SNPs makes the above assumption improper, and a close relationship should exist between SNPs. To account for LD between SNPs, Yang and Tempelman [4] brought first-order antedependence models which model SNP effects as correlated into the conventional BayesA and BayesB, and developed ante-BayesA and ante-BayesB. It is reported in their paper that the antedependence methods had significantly higher accuracies than the corresponding classical counterparts at higher LD levels (r^{2}>0.24) in the simulation study, and they were also more accurate in mice data and other benchmark data sets.

However, although there are advantages of BayesCπ over BayesA and BayesB aforementioned, first-order antedependence models are still not brought to BayesCπ. Thus, we will develop ante-BayesCπ, and then investigate whether ante-BayesCπ can improve the prediction accuracy and bias compared to BayesCπ. Besides, it has not been widely appreciated that v_{g} and

Therefore, the objective of this study was to investigate the impact of adding a first-order antedependence model and hyperparameters into BayesCπ for genomic prediction regarding the prediction accuracy and bias. We have demonstrated the predictive ability of BayesCπ with first-order antedependence models and hyperparameters in simulation data from the study of Jiang et al [6], the 15th quantitative trait loci-marker assisted selection (QTL-MAS) workshop data set and the heterogeneous stock mice, compared to the classical BayesCπ, ante-BayesA, and ante-BayesB.

## MATERIALS AND METHODS

### Statistical model

The general linear mixed model used for genomic prediction could be written as

Here, y was a vector of phenotypes, β was a vector of fixed effects, g was a vector of random SNP effects, and e was a residual vector. X was a known incidence matrix linking β to y. Z was a known matrix linking g to y in which SNP genotypes were coded as 0, 1, or 2 copies of one allele for each SNP (column) and animal (row). Furthermore, we specified **g** ~N (0,**G**), where

### BayesCπ

Habier et al [3] firstly developed BayesCπ, and the obvious difference between BayesCπ and other methods such as GBLUP, BayesA, and BayesB was the characterization of **G**.

Here, π represented the proportion of SNPs having no genetics effects on the trait of interest. Based on Habier et al [3], the prior of π was treated as an unknown with uniform (0,1). The v_{g} was set to 4.2, and

Where

### Hyper-BayesCπ

We labeled hyper-BayesCπ when we sampled v_{g} using a Metropolis Hastings (MH) update and
_{g} and
_{g}~p(v_{g}) ∝ (v_{g}+1)^{−2}, and
*α** _{s}* =1.0 and rate parameter

*β*

*=1.0. The full conditional density (FCD) of v*

_{s}_{g}did not have a recognizable form, and a random walk MH step other than a Gibbs step was used to sample v

_{g}from this FCD. The detailed procedure can be seen Yang et al [7]. The FCD of

### Ante-BayesCπ

Ante-BayesCπ was defined here as combining BayesCπ and the first-order antedependence model proposed from the study of Yang and Tempelman [4]. Likewise, the following nonstationary first-order antedependence correlation structure for **g** based on the relative physical location of SNPs along the chromosome was used:

Here, t_{j,j-1} was the marker interval-specific antedependence parameter of g_{j} on g_{j–1} in the specified order. δ_{j} was the residual SNP effect following

It should be noted that, here π represented the proportion of SNPs having no residual SNP effects, which was a little different from π of BayesCπ and hyper-BayesCπ.

### Ante-hyper-BayesCπ

We labeled the model as ante-hyper-BayesCπ in which the hyperparameters for v_{g} and
**g** were combined.

For ante-BayesCπ and ante-hyper-BayesCπ, we specified
*u*_{t}_{0},
*v** _{t}*, and

_{j,j-1}was set 0 between the last SNP of one linkage group or chromosome and the first SNP in the subsequent group. The remaining priors were also specified, such as π~ U(0,1), v

_{δ}= v

_{g}and

_{δ}~p(v

_{δ})∝(v

_{δ}+1)

^{−2}in ante-hyper-BayesCπ. For all methods, the noninformative prior

We used Fortran 90 to compile BayesCπ, extent forms of BayesCπ, ante-BayesA and ante-BayesB, and program codes can be seen in Supplementary Material (See e-version for Supplement). It should be pointed out that ante-BayesB with a pre-defined π in our paper was extended from BayesB of Meuwissen et al [1], while π was inferred from the data in ante-BayesB of Yang and Tempelman [4].

In order to evaluate the performance of our proposed methods, we used data sets from other reported studies, i.e., simulation data from the study of Jiang et al [6], the 15th QTL-MAS workshop data set and real data on the heterogeneous stock mice from studies of Yang and Tempelman [4] and Gao et al [10], to compare them with ante-BayesA and ante-BayesB with respect to the accuracy and bias of genomic prediction.

### Simulation data

We downloaded simulation data from the study of Jiang et al [6], and the detailed simulation process could be seen in their study. Briefly, the default scenario was as follows: a total of 30 QTLs with a minor allele frequency >0.05 was randomly selected, and their effects on two traits were drawn from a standard bivariate normal distribution with correlation 0.5. Normal error deviates were added to achieve heritablities of 0.5 for trait 1 and 0.1 for trait 2. The error covariance between two traits was set to 0.

Sampling every 10th and 25th SNPs from the full set of SNPs respectively was conducted to form two subsets of SNPs. Averaged over 30 replicates, the average number of SNPs were 4,119, 411, and 164 for the full set and two subsets, which resulted in three average LD levels (r^{2} = 0.33, 0.22, and 0.14) of adjacent SNPs corresponding to the full set of SNPs and other two subsets with SNP intervals of 10 and 25, respectively. In our study, we selected trait 1 with both phenotype and genotype for first 20 of 30 replicates with the full dataset and other two subsets of SNPs as test data to investigate the influence of LD between adjacent SNPs on the prediction accuracy and bias. Generation 5001 was considered as the reference population, and generation 5002 as the candidate population.

### Analysis of the 15th QTL-MAS workshop data set

A base population was a collection of 20 sires and 200 dams. Each sire was mated to 10 dams, and each dam was mated to only one sire. Within each family, one dam gave birth to 15 offspring. The 10 progenies were randomly assigned to the reference population with trait phenotypes and marker genotypes information, and other 5 belonged to the candidate population, only recorded for genotype marker information. Thus, 2000 individuals formed the reference population, and the candidate population included 1,000 individuals. A genome consisting of 9,990 SNPs on five chromosomes with 1 Morgan each were simulated without any missing data and genotyping error. Eight QTLs (1 quadri-allelic, 2 linked in phase, 2 linked in repulsion, 1 imprinted and 2 epistatic) were simulated. Random error was added to create an heritability of 0.30 for the trait analyzed [11]. In the following analyses, we removed all SNPs with minor allele frequency = 0, leaving 7,121 SNPs on five chromosomes for comparing performances of different Bayesian methods aforementioned. These SNPs had been sorted by the physical position.

### Application to heterogeneous stock mice data set

A population of heterogeneous stock mice was generated by the Wellcome Trust Centre for Human Genetics (WTCHG) (http://gscan.well.ox.ac.uk/). This population was formed by a crossing of eight inbred strains, followed by 50 generations of pseudorandom mating [12]. The extent of LD in this population is small, and average r^{2} among adjacent SNPs is 0.62 [13]. It is well known for the family structure and history of this population, and thus interpretation of results will be easy.

In order to compare performances of genomic prediction using different Bayesian methods, we selected four traits: body weight at 6 weeks (W6W), growth slope between 6 and 10 weeks of age (GSL), body mass index (BMI), and body length (BL) similar to the study of Legarra et al [13]. For computational simplicity, the pre-corrected phenotypes of these traits provided by Valdar et al [14] were used as pseudo phenotypes for the following analysis. Two sets of genotype information were used, which were directly from studies of Yang and Tempelman [4] and Gao et al [10]. Genotypic data processing of Yang and Tempelman [4] resulted in the data set involving records on 1,917 animals with 950 SNPs on the 19 chromosomes, and the average LD of r^{2} was 0.35 between adjacent markers. In the study of Gao et al [10], there were 1,940 animals and 9,266 SNPs on the 19 chromosomes after their quality control steps, which resulted in the average LD of r^{2} about 0.60.

Numbers of animals having both phenotypes and two sets of genotypes were 1,917, 1,901, 1,814, and 1,821 for W6W, GSL, BMI, and BL respectively. All animals for each trait were randomly divided into two nearly equal-sized partitions of reference and candidate data sets. This was replicated 20 times for comparison of genomic prediction among different Bayesian methods. SNPs were sorted based on their physical positions along the chromosome.

For each of the six Bayesian methods, ante-BayesA, ante-BayesB, BayesCπ, hyper-BayesCπ, ante-BayesCπ, and ante-hyper-BayesCπ, in both simulation data and real data on the heterogeneous stock mice, the Markov chains were run for 350,000 cycles of Gibbs sampling (for ant-BayesB, 100 additional cycles of MH sampling cycle), and the first 50,000 cycles were discarded as the burning period. After the burning period, every 10th cycles were subsequently saved for obtaining estimates of SNP effects. In ante-BayesB, we set π = 0.95 in both the simulation data of Jiang et al [10] and the heterogeneous stock mice application, and 0.99 in the 15th QTL-MAS workshop data set.

Direct genomic values (DGVs) for individuals with genotypes, but no phenotypes, were calculated as the sum of all SNP effects according to their SNP genotypes. The prediction accuracy was calculated as Pearson’s correlation between DGVs and true breeding values (TBVs) in simulation data (or pre-corrected phenotypes in mice data) for the candidate population, and the prediction bias was evaluated by the regression coefficient of DGVs on TBVs in simulation data (or pre-corrected phenotypes in mice data). For the simulation data of Jiang et al [10] and the heterogeneous stock mice, one-way analysis of variance was performed to determine the statistical significance of differences in the accuracy of genomic prediction among above six methods and in estimates of π among BayesCπ, hyper-BayesCπ, ante-BayesCπ and ante-hyper-BayesCπ. The stringent Bonferroni multiple test corrections were used.

## RESULTS

### Results from the simulations

The prediction accuracies and unbias under three different LD levels can be seen in Table 1. For the prediction accuracy, there are not the statistical significance of differences in the accuracy of genomic prediction among six methods for all scenarios (p>0.05). Hyper-BayesCπ performed a little better or similar among all six methods in three scenarios of different LD levels. In the scenario of using all SNPs, ante-hyper-BayesCπ performed as well as hyper-BayesCπ, which was 0.6% higher than BayesCπ and ante-BayesCπ, 0.5% for ante-BayesB, and 0.8% for ante-BayesA, respectively. In other two scenarios, ante-hyper-BayesCπ and ante-BayesCπ performed a little worse than the counterparts. The prediction accuracy for ante-BayesB was much lower than other methods when using the SNP data set with the interval of 25 SNPs which was close to the level of significant difference (p = 0.07).

Regarding the prediction bias, the regression coefficients for all methods were close to 1.0, which indicated good unbiased prediction. Additionally, although there were similar prediction accuracies among BayesCπ and extensions of BayesCπ, the estimated π value from BayesCπ were significantly lower than those from extensions of BayesCπ in the scenario of using all SNPs, ante-(hyper-) BayesCπ had significantly lower π value than (hyper-) BayesCπ in the scenario of the interval of 10 SNPs, and the estimates of π were comparable in the scenario of the interval of 25 SNPs. We found that the π value became smaller as the density of SNPs decreased (Table 2).

### Common data set of the 15th QTL-MAS workshop

Using the above six methods, we also analyzed the 15th QTL-MAS workshop data set. As shown in Table 3, ante-hyper-BayesCπ and ante-BayesCπ had the same prediction accuracy (0.942), and they were 0.1%, 0.3%, 0.6%, and 1.2% higher than hyper-BayesCπ, BayesCπ, ante-BayesB, and ante-BayesA, respectively. The regression coefficients ranged from 1.044 to 1.068 indicating the unbiased genomic prediction for all methods. The π values estimated from BayesCπ, hyper-BayesCπ, ante-BayesCπ, and ante-hyper-BayesCπ were very similar which were 0.996, 0.997, 0.996, and 0.997, respectively.

### Real heterogeneous stock mice data set

As shown in Table 4, the genomic prediction for all six methods was comparable for all traits except for W6W in the scenario of low density SNP data set. The ante-BayesB using low density SNP data performed worse than other methods for all traits (statistical significance for W6W). The hyper-BayesCπ and ante-hyper-BayesCπ sometimes performed a little better than BayesCπ and ante-BayesCπ, such as for body weight trait. Using SNP data set from low density to high density did not reflect the advantage of ante-hyper-BayesCπ (ante-BayesCπ) over hyper-BayesCπ (BayesCπ) on the prediction accuracy for these four trait analyzed in mice data. Additionally, all the methods gave unbiased genomic prediction for each trait, which can be seen in Table 5.

As shown in Table 6, the estimated π were very different for different traits due to different genetic architectures of different traits. The estimated π values ranged from 0.327±0.010 to 0.949±0.009. The estimated π for W6W were higher than other three traits. The estimates of π from BayesCπ, hyper-BayesCπ, ante-BayesCπ, and ante-hyper-BayesCπ were significantly different (p<0.05) for all scenarios except for BMI with high density SNP data.

## DISCUSSION

Our study is the first one to comprehensively investigate the influence of a first-order ante-dependence model and key hyperparameters on BayesCπ using simulation data sets and real mice data set. According to our results, the prediction accuracies for hyper-BayesCπ, ante-BayesCπ, and ante-hyper-BayesCπ were comparable with BayesCπ, ante-BayesB and ante-BayesA except the situation where ante-BayesB performed worse when using a few SNPs and π = 0.95. BayesCπ with an antedependence model and hyperparameters did not work better than BayesCπ regarding the prediction accuracy and bias. Meanwhile, ante-(hyper-) BayesCπ had a longer computing time than BayesCπ and hyper-BayesCπ, and there were similar computing times between BayesCπ and hyper-BayesCπ. When SNP density increased, computing time for ante-(hyper-) BayesCπ increased faster than (hyper-) BayesCπ.

BayesA, BayesB, BayesCπ and their extended forms assume that the prior distribution of variances of SNP non-zero effects is a scaled inverse chi-square distribution with degree of freedom v_{g} and a scale parameter
_{g} and
_{g} for the range (0,1) and a uniform distribution on
_{g}~p(v_{g})∝(v_{g}+1)^{−2} for both BayesA and BayesB. In our study, we have applied

v_{g}~p(v_{g})∝(v_{g}+1)^{−2} for v_{g} and a Gamma (1,1) prior distribution on

Compared with hyper-BayesCπ (BayesCπ), ante-hyper-BayesCπ (ante-BayesCπ) did not show advantages in improving the prediction accuracy and bias for simulation data with LD levels r^{2} ranging from 0.136 to 0.333 and mice data with two different LD levels. This phenomenon is very different from the performance of ante-BayesA (ante-BayesB) over BayesA (BayesB) in the study of Yang and Tempelman [4]. They reported that ante-BayesA and ante-BayeB had significantly higher accuracies than their corresponding classical counterparts at higher LD levels in simulation data and real small data. However, subsequent studies did not show that the antedependence model performed significantly better. Wang et al [16] evaluated the antedependence model performance in Danish pigs and found that ante-BayesA showed lower accuracy compared to other models. Jiang et al [6] also introduced the first-order antedependence model to multi-trait BayesA, and the analysis from simulation and mice data showed that multi-trait ante-BayesA had less than 1% higher accuracies than multi-trait BayesA. The results from these studies were similar to the performance of ante-hyper-BayesCπ (ante-BayesCπ) over hyper-BayesCπ (BayesCπ) from our study.

When using genotype data from low density to high density in mice data, the antedependence model was not more advantageous over the corresponding method for four traits with different heritabilities. This surprising phenomenon may result from the following reasons. On one hand, the antedependence model had higher number of effective parameters, which suggested that the accuracy estimating SNP effects may be poor in current data set (less than 1,000 individuals in reference population) due to model complexity [16]; On the other hand, the relationship for adjacent SNPs is more closed with increasing SNP density. The linear relationship may not be good to explain the relationship for adjacent SNPs, and non-linear relationship may be better. In the future, with sequence data widely being used, the performance of the antedependence model can be validated again in livestock with sequence data.

Habier et al [3] reported that estimates of π from BayesCπ were sensitive to training data size and SNP density. From our results, estimates of π not only were sensitive to SNP density but also were sensitive to the genetic architecture of a quantitative trait. Compared with ante-BayesA and ante-BayesB, BayesCπ and its extensions could provided more information about the genetic architecture of a trait of interest. Although different estimates of π from BayesCπ and its extensions led to little differences on the prediction accuracy, they may have different power for QTL mapping which is interesting to be further studied. Additionally, it should be noted that ante-BayesB in this study was extended from the classical BayesB proposed by Meuwissen et al [1]. Ante-BayesB performed significantly worse than other methods when using a few SNPs, which suggests that treating π as known with a high value may be a poor choice in some cases. This agrees with Daetwyler et al [17] who reported that GBLUP outperformed BayesC with a fixed π when the number of simulated QTL was large, which is also validated by Habier et al [3]. This may be the reason why the extent of classical BayesB in which π is estimated by using a prior distribution and data information was proposed, such as ante-BayesB developed by Yang and Tempelman [4].

Ante-BayesA, ante-BayesB, and ante-(hyper-)BayesCπ had a similar genomic prediction accuracy, and no one outperformed the other methods across all traits, which is consistent with the study of Habier et al [3] who reported the similar prediction accuracy between BayesA, BayesB, and BayesCπ. However, computing time is very different among these three ante-Bayesian methods. From our study, ante-BayesA is the fastest, ante-(hyper-) BayesCπ is next, and ante-BayesB is the slowest. Ante-BayesB had the longest computing time because the 100 cycles of MH step for sampling the locus-specific variances in the implementation of ante-BayesB is repeated in each iteration. Ante-BayesA and ante-(hyper-) BayesCπ had the advantage on computing time, which becomes more and more important as SNP density increases. Compared with ante-BayesA, ante-(hyper-) BayesCπ can shrink SNP effects and is more sensitive to the genetic architecture of a trait of interest, which results in gaining higher accuracies for traits with some large QTL effects, such as fat yield in Holstein populations [18,19].

## CONCLUSION

In conclusion, BayesCπ with prior distributions on v_{g} and

## Supplementary Data

## ACKNOWLEDGMENTS

This work was supported by the National Natural Science Foundation of China (31601009), China Postdoctoral Science Foundation (2016M600699), and China Agriculture Research System (CARS-36).

## Notes

**AUTHOR CONTRIBUTIONS**

LXJ performed the analysis and wrote the manuscript. LXH and CYS conceived the study, gave help in the analysis and contributed to the manuscript. All authors have read and approved the final manuscript.

**CONFLICT OF INTEREST**

We certify that there is no conflict of interest with any financial organization regarding the material discussed in the manuscript.