EVALUATION DE L’EFFICACITE DE LA SELECTION GENOMIQUE
Génotypage, Contrôles qualité et Assignation de parenté
Genotyping was performed at the Gentyane genotyping platform (INRAE, Clermont-Ferrand, France). From the challenged individuals and their parents, 1,152 individuals from VNN_A, VNN_B and 1,151 individuals from VIB were genotyped on the ThermoFisher AxiomTM 57k SNP DlabChip. In the PAS cohort, 1,026 individuals were genotyped on the ThermoFisher AxiomTM 60k SNP SaurChip. Genotyped individuals were sub-sampled from the challenged ones ensuring each sub-sample had the same average survival rate as the whole challenge batch. SNP calling was performed using the ThermoFisher AxiomAnalysisSuite software and quality controls with PLINK 1.9 (Purcell and Chang, 2015). First, individuals with a genotyping rate lower than 90% were discarded. Then, for the sea bass data sets (VNN_A, VNN_B and VIB), genotyping quality controls were performed on a global population composed of the three data sets. A common set of markers was subset by keeping only markers with a call rate higher than 95%, a minor allele frequency higher than 0.05 and a p-value for the departure from Hardy-Weinberg equilibrium test higher than 10-8 , resulting in 44,772 common markers to be used for the three cohorts. For the PAS data set, markers with a call rate higher than 95%, a minor allele frequency higher than 0.05 and a p-value for the departure from Hardy-Weinberg equilibrium test higher than 10-4 were retained, leaving 43,618 usable markers. Finally, all missing genotypes were imputed with FImpute (Sargolzaei et al., 2014). Parentage assignment was done using 1,000 randomly sampled markers, analysed with the R package APIS (Griot et al., 2020) with a positive assignment error rate set to 1%.
Estimation de l’héritabilité
For each data set, we estimated the heritability of disease resistance with either a threshold model using THRGIBBSF90 (Tsuruta and Misztal, 2006) or a linear model using AIREMLF90 (Misztal et al., 2002). Only individuals with a phenotype, a genotype and a pedigree were used in heritability estimates, thus the sample size was 1,027 in VNN_A and, 1042 in VNN_B, 1,049 in VIB and 916 in PAS. The following model was computed in each data set using both threshold and linear models: 𝒚 = 𝟏𝑏 + 𝒁𝒖 + 𝒆 With y the vector of the phenotypes measured as binary dead/survival trait, 1 the incidence (unity) vector of the intercept, b the estimate of the intercept effect, u the vector of breeding values and Z the corresponding incidence matrix. To compare pedigree-based and genomic-based heritability estimation, 𝑢 followed either a multivariate normal distribution N(0, 𝑨σ²g) with 𝑨 the pedigree-based relationship matrix or a multivariate normal distribution N(0, 𝑮σ²g) with 𝑮 the genomic relationship matrix proposed by VanRaden (2008). σ²g is the additive genetic variance and e is the vector of the random residual errors that follows a normal distribution N(0, Iσ²e) with σ²e the residual variance and I the identity matrix. With the threshold model, the variance components (σ²g andσ²e) were estimated using a Gibbs sampler with 500,000 iterations, 100,000 of burn-in and one sample was kept every 20 iterations for posterior analysis. The residual variance σ²e was set to a value of 1. The posterior distributions were analyzed with the R package boa (Smith, 2007).With the linear model, the same components were estimated using a restricted maximum likelihood algorithm, considering the observed binary phenotype as a continuous variable. The heritability for survival was estimated as: ℎ 2 = 𝜎² 𝑔 𝜎² 𝑔 + 𝜎²𝑒 Heritability on the observed scale (h²o) was estimated using the variance components from the linear model, while the heritability on the underlying liability scale (h²u) was computed using the variance components from the threshold model. 4.1.3.6. Création de puces virtuelles de basse densité From the SNP markers obtained after quality controls, four low-density (1K, 3K, 6K and 10K) virtual SNP chips were created. To do so, we used a marker pruning method based on the linkagedisequilibrium (LD) (Porto-Neto et al., 2013). In a user-defined sliding window, every pairwise LD between markers was estimated by the r² metrics. Then, SNPs were pruned until no pair had a r² greater than a given value, until we reached the desired number of SNPs. For sea bass data sets, we used an iterative method to create the low-density chips. For the creation of the sea bass 10K chip, we used an iterative method with a sliding window of 1 Mb and a r² value of 0.434. First, the SNP were pruned in the cohort VNN_A data set with the LD method explained above. Then, the remaining markers in cohort VNN_A were subset from cohort VNN_B and the same pruning method was applied on the remaining ones. The same process was done one last time on the cohort VIB data set to obtain the desired number of markers. Finally, the remaining markers in the VIB data set were subset in cohort VNN_A and cohort VNN_B data sets to obtain 10,020 markers in each cohort. 81 The same process was repeated to create the 1K, 3K and 6K chips by modifying the sliding window size of 100, 200 and 500 kb and the r² threshold value of 0.175, 0.268 and 0.344 for 1K, 3K and 6K respectively. The virtual chips for sea bass contained 1,007, 3,022, 6,010 and 10,020 markers for the 1K, 3K, 6K and 10K respectively. For the PAS data set, the pruning was done by applying the same protocol, but on one dataset only. The chips contained 1,007, 2,999, 6,011 and 10,010 markers for the 1K, 3K, 6K and 10K respectively and were done by using a sliding window of 100, 200, 500, 1,000 kb and a r² threshold of 0.044, 0.191, 0.318 and 0.434 for the 1K, 3K, 6K and 10K respectively. In each species, the size of the sliding window as well as the r² value for LD were empirically chosen to uniformly sample the desired number of markers.
Création de la population d’entrainement
We fixed the validation population size to 200 individuals randomly chosen from the whole population in each data set. The training population was composed of a minimum of 50 individuals up to 800 in sea bass data sets and 700 in sea bream data set. From 50 to 200, we added 50 individuals at each step and then, from 200, we added 100 individuals per step. The initial 50 individuals were randomly sampled from remaining individuals after the validation population had been chosen. Then, to increase the training population size, the added individuals were randomly chosen from the remaining individuals and added to the previous training population. This process was repeated until the training validation population size reached the maximum limit. To account for stochastic sampling effects, the entire process was repeated 100 times for each data set. 4.1.3.8. Effet de la densité de marqueurs et de la taille de la population d’entrainement sur la précision de prédiction For each replicate of training population size, we tested six densities of markers. We tested the lowdensity SNP chips (1K, 3K, 6K and 10K) as well as the full SNP chip (44K for sea bass data sets and 43K for sea bream data set) and a control case with only pedigree information and no genomic information. The phenotypes of the individuals in the validation population were masked and the breeding values were estimated with the same linear model as the one described in “Heritability estimation” using the genomic relationship matrix when genomic information was available or the pedigree-based relationship matrix otherwise. The estimation using the genomic relationship matrix was reported as GBLUP and the estimation with the pedigree relationship matrix as PBLUP. The accuracy was computed as: 𝑟 = 𝑐𝑜𝑟(𝐸𝐵𝑉, 𝑦) ℎ With cor(EBV, y) the correlation between the estimated breeding values (EBV) and the phenotypes y of the 200 individuals belonging to the validation population and h the square-root of the heritability estimated with a linear model, using the pedigree-based relationship matrix and the whole data set as in the section “Heritability estimation.