Potential of F2 pig crosses - Institut für Tierzucht und Tierhaltung
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of Potential of F2 pig crosses - Institut für Tierzucht und Tierhaltung
Schriftenreihe des Instituts für Tierzucht und Tierhaltung der
Christian-Albrechts-Universität zu Kiel, Heft 235, 2020
©2020 Selbstverlag des Instituts für Tierzucht und Tierhaltung
der Christian-Albrechts-Universität zu Kiel
Olshausenstraße 40, 24098 Kiel
Schriftleitung: Prof. Dr. J. Krieter
ISSN: 0720-4272
Gedruckt mit Genehmigung des Dekans der Agrar- und Ernährungswissen-
schaftlichen Fakultät der Christian-Albrechts-Universität zu Kiel
The Institute of Animal Breeding and Husbandry, Faculty of Agricultural and
Nutritional Sciences of the Christian-Albrechts-Universität Kiel
Potential of F2 pig crosses: perspectives
from population and quantitative genomics
Dissertation
submitted for the Doctoral Degree
awarded by the Faculty of Agricultural and Nutritional Sciences
of the Christian-Albrechts-Universität Kiel
by
M.Sc. Iulia Georgiana Blaj
born in Suceava, Romania
Kiel, 2019
Dean: Prof. Dr. Dr. Christian Henning
1. Examiner: Prof. Dr. Georg Thaller
2. Examiner: Prof. Dr. Jens Tetens
Day of oral examination: 26th of June 2019
The dissertation was supported by a grant from the German Research
Foundation (Deutsche Forschungsgemeinschaft, DFG).
Table of contents
General introduction 1
Chapter 1
Genome-wide association studies and meta-analysis uncovers new candidate
genes for growth and carcass traits in pigs 7
Chapter 2
Non-additive effects in four diverse F2 pig crosses for growth, carcass and fat
related traits 31
Chapter 3
GWAS for meat and carcass traits using imputed sequence level genotypes in
pooled F2-designs in pigs 49
Chapter 4
Recombination landscape in multiple F2 pig crosses between genetically diverse
founder breeds 79
Chapter 5
A systematic survey of sequence data variation in the founder individuals of four
F2 pig crosses 87
General discussion 103
General summary 113
Allgemeine Zusammenfassung 115
Appendix 117
1
General introduction
F2 resource populations are the basis of many valuable findings in biology, medicine, and
agriculture. The onset of using such experimental population structure was the pioneering work
of Gregor Mendel, in the late 19th century, who derived, based on visual inspection of plants,
the fundamental laws of inheritance (Mendel, 1865). Throughout the beginning and middle of
the 20th century, few resource populations were established but with limited success to detect
genomic regions associated with a phenotype, i.e. quantitative trait loci (QTL). The major
constrain was the lack of segregating markers (Weller, 2009), nevertheless, in the last four
decades, several discoveries removed this constrain, thus offering unprecedented access to
deoxyribonucleic acid (DNA) level information. Noteworthy milestones are the detection of
first DNA polymorphisms (namely the restricted fragment length polymorphism by Grodzicker
et al. in 1974 and microsatellites by Mullis et al. in 1986) and, in 1995, the most prevalent
genetic marker came into use: the single nucleotide polymorphisms (SNP) (Brookes, 1999).
The pig (Sus Scrofa) has been vitally important for human development for thousands of years,
since the domestication process started around 10,000 year ago (Larson et al., 2007). Human
interference played a great role in shaping the genome of this major livestock species by
influencing processes such as selection, demographic history and gene flow (Groenen, 2016).
Insights into the genotype-phenotype relationship can be obtained by means of mapping
populations (e.g. F2 intercrosses) that have been efficiently developed in the last decades and
used to map QTLs (Rothschild et al., 2007). The initial way of exploiting such population
structure was by establishing the relationship between the phenotypes of the F2 individuals and
their genetic markers (mostly microsatellites) via linkage mapping.
The first QTL study in pigs considered a Wild boar x Large white F2 experimental population
and aimed at identifying QTLs responsible for fat deposition and growth (Andersson et al.,
1994). A significant locus identified on chromosome 4 explained 20% of the phenotypic
variance for abdominal and back fat. Many other QTL studies followed and most of them were
the result of F2 experimental designs produced by crossing breeds from genetically divergent
lineages. The Pig QTL database is a comprehensive catalogue enclosing results for curated QTL
mapping experiments from the last 25 years (Hu et al., 2018). Initial implications of finding
genetic markers associated with loci influencing traits of interest lead to transferring this
information into breeding programs by means of marker-assisted selection (Fernando and
Grossman, 1989). This methodology provided the basis for a groundbreaking development
named genomic selection in which high density markers covering the entire genome are used
in order to enable all QTLs to be in linkage disequilibrium with at least one marker (Meuwissen
2
et al., 2001). Mapping populations coupled with technological progress, specifically the
Illumina PorcineSNP60 (Ramos et al., 2009), were thus building blocks for the implementation
of genomic selection in pig breeding.
The present research investigates the potential of four existing F2 pig populations. Specifically,
it aims to demonstrate how the availability of tens of thousands of SNP markers and whole
genome sequence (WGS) data can raise and answer questions in the context of both population
and quantitative genomics. Firstly, a brief description of the F2 resource population is given.
The largest experimental design considered was established by Borchers et al. (2000) and
originates from Piétrain (P) boars and Large white x Landrace (Lw x L) crossbred or Large
white (Lw) sows. The remaining three populations developed by Geldermann et al. (1996) are
based on a European breed (Piétrain), an Asian breed (Meishan, M), and the European pig
ancestor, the Wild boar (W). Specifically, the populations are M x P, W x P and W x M. The
well-characterized F2 generation is the outcome of repeatedly crossing F1 boars with F1 sows
in order to obtain large full sib families. These population, their phenotypic (growth, carcass
and fat related), genotypic (SNP array), and sequence data provide the basis of the statistical
and exploratory analysis carried on in each chapter of this thesis as shown in Figure 1.
F0 F1 F2
Phenotype - - Growth, carcass and fat traits
SNP array PorcineSNP60 PorcineSNP60 PorcineSNP60
Sequence* ~ 20x ~1x Imputed to sequence
Figure 1. Scheme of the F2 crosses with phenotypic and genomic data available per
generation. Data usage in the thesis for chapters 1 to 5. *high coverage sequence data = 19x,
low coverage sequence data = 1x.
Genome-wide associations studies (GWAS) are widely used in the genetic dissection of
complex or quantitative traits and, as compared to linkage mapping, they rely on LD and reflect
1,2,3
1,2,3,4
3
2,3 2,3,4
3 3,5
3
historical recombination events (Goddard and Hayes, 2009). The potential of a GWAS is highly
influenced by the sample size of the experiment. Therefore, an increase in sample size implies
a higher statistical power. Chapter 1 demonstrates how a collective investigation of data
(phenotype and SNP array genotypes) from three of the F2 resource populations, with the
Piétrain breed as founder, is advantageous offering an increased mapping resolution. While this
chapter considers the additive effects of the genetic markers, the following work in Chapter 2
investigates dominance and imprinting effects based on SNP array data in the four F2 crosses,
each considered separately. The availability of sequence data provides further precision when
conducting a GWAS. High coverage sequenced founders F0 and low coverage sequenced F1s
facilitated the imputation step leading to F2 WGS individuals. Therefore, Chapter 3 conducts
an association study employing such sequence level information on the pooled imputed F2.
The resource populations are comprised of three generations and are established with the main
objective of discovering phenotype to genotype connections in the F2. Nevertheless, additional
layers of information, often disregarded, exists in the grandparental (F0) and parental generation
(F1). Chapter 4 covers investigations on recombination events, a process involved in
maintaining genetic variability and the evolution of genomes. The basis of characterizing the
recombination landscape are the maps built in the F1 generation. Here crossover calls can be
appropriately inferred in each parent-child pair (F1-F2) due to the high numbers of full sibs
allocated to an F1 parent. Finally, the research in Chapter 5 is motivated by the fact that almost
all the genomic variation that exists at the F2 level (used normally for GWAS) is being
propagated from the F0 generation. Thus, a reverse genetics approach explores the WGS
information in the founders. Employing various bioinformatics tools, a pooled and a breed-
based analysis of the F0 individuals was conducted and finally both population and breed
specific (e.g. related to the olfactory receptor gene family) conclusions were derived.
A general discussion subsumes the thesis. The topics covered are mostly beyond the above-
mentioned chapters and with implications to their outcomes. Particularly the following topics
are considered: the release of the latest pig reference genome, recombination rates and gene
density, pooling data from several F2 crosses for QTL mapping purposes and finally the shift
from SNP array to WGS level data that paves the way into the field of big data.
4
References
Andersson, L., Haley, C. S., Ellegren, H., Knott, S. A., Johansson, M., et al. (1994). Genetic
mapping of quantitative trait loci for growth and fatness in pigs. Science, 263(5154),
1771-1774.
Borchers, N., Reinsch, N., & Kalm, E. (2000). Familial cases of coat colour change in a
Piétrain cross. Journal of Animal Breeding and Genetics, 117(4), 285-287.
Brookes, A. J. (1999). The essence of SNPs. Gene, 234(2), 177-186.
Fernando, R. L., & Grossman, M. (1989). Marker assisted selection using best linear unbiased
prediction. Genetics Selection Evolution, 21(4), 467.
Geldermann, H., Müller, E., Beeckmann, P., Knorr, C., Yue, G., & Moser, G. (1996). Mapping
of quantitative trait loci by means of marker genes in F2 generations of Wild boar,
Piétrain and Meishan pigs. Journal of Animal Breeding and Genetics, 113(1-6), 381-
387.
Goddard, M. E., & Hayes, B. J. (2009). Mapping genes for complex traits in domestic animals
and their use in breeding programs. Nature Review Genetics, 10, 381–391.
Groenen, M. A. (2016). A decade of pig genome sequencing: a window on pig domestication
and evolution. Genetics Selection Evolution, 48(1), 23.
Grodzicker, T., Williams, J., Sharp, P., & Sambrook, J. (1974). Physical mapping of
temperature-sensitive mutations of adenoviruses. Cold Spring Harbor Symposia on
Quantitative Biology, 39, 439-446.
Hu, Z. L., Park, C. A., & Reecy, J. M. (2018). Building a livestock genetic and genomic
information knowledgebase through integrative developments of Animal QTLdb and
CorrDB. Nucleic Acids Research, 47(D1), D701-D710.
Larson, G., Albarella, U., Dobney, K., Rowley-Conwy, P., Schibler, J., et al. (2007). Ancient
DNA, pig domestication, and the spread of the Neolithic into Europe. Proceedings of
the National Academy of Sciences, 104(39), 15276-15281.
Mendel, G. (1866). Versuche über Pflanzenhybriden Verhandlungen des naturforschenden
Vereines in Brünn, Bd. IV für das Jahr, 1865. Abhandlungen, 3-47.
Meuwissen, T. H., Hayes, B. J., & Goddard, M. E. (2001). Prediction of total genetic value
using genome-wide dense marker maps. Genetics, 157(4), 1819-1829.
Mullis, K., Faloona, F., Scharf, S., Saiki, R. K., Horn, G. T., et al. (1986). Specific enzymatic
amplification of DNA in vitro: the polymerase chain reaction. Cold Spring Harbor
Symposia on Quantitative Biology, 51, 263-273.
5
Ramos, A. M., Crooijmans, R. P. M. A., Affara, N. A., Amaral, A. J., Archibald, A. L., et al.
(2009). Design of a high density SNP genotyping assay in the pig using SNPs identified
and characterized by next generation sequencing technology. PLoS ONE, 4(8), e6524.
Rothschild, M. F., Hu, Z. L., & Jiang, Z. (2007). Advances in QTL mapping in pigs.
International Journal of Biological Sciences, 3(3), 192.
Weller, J. I. (2009). Quantitative trait loci analysis in animals. CABI.
7
Chapter 1
Genome-wide association studies and meta-analysis uncovers
new candidate genes for growth and carcass traits in pigs
Iulia Blaj1, Jens Tetens2, Siegfried Preuß3, Jörn Bennewitz3 and Georg Thaller1
1Institute of Animal Breeding and Husbandry, Kiel University, Kiel, Germany
2Functional Breeding Group, Department of Animal Sciences, Göttingen University,
Göttingen, Germany
3Institute of Animal Husbandry and Breeding, University of Hohenheim, Stuttgart, Germany
Published in PLoS One
8
Abstract
Genome-wide association studies (GWAS) have been widely used in the genetic dissection of
complex traits. As more genomic data is being generated within different commercial or
resource pig populations, the challenge which arises is how to collectively investigate the data
with the purpose to increase sample size and implicitly the statistical power. This study
performs an individual population GWAS, a joint population GWAS and a meta-analysis in
three pig F2 populations. D1 is derived from European type breeds (Piétrain, Large White and
Landrace), D2 is obtained from an Asian breed (Meishan) and Piétrain, and D3 stems from a
European Wild Boar and Piétrain, which is the common founder breed. The traits investigated
are average daily gain, backfat thickness, meat to fat ratio and carcass length. The joint and the
meta-analysis did not identify additional genomic clusters besides the ones discovered via the
individual population GWAS. However, the benefit was an increased mapping resolution which
pinpointed to narrower clusters harboring causative variants. The joint analysis identified a
higher number of clusters as compared to the meta-analysis; nevertheless, the significance
levels and the number of variants in the meta-analysis were generally higher. Both types of
analysis had similar outputs suggesting that the two strategies can complement each other and
that the meta-analysis approach can be a valuable tool whenever access to raw datasets is
limited. Overall, a total of 20 genomic clusters were identified on chromosomes 2, 7 and 17,
many confirming previously identified quantitative trait loci. Several new candidate genes are
being proposed and, among them, a strong candidate gene to be taken into account for
subsequent analysis is BMP2 (bone morphogenetic protein 2).
Background
In pig breeding, the search for quantitative trait loci (QTLs) and the underlying causative
mutations has been in progress for more than two decades. The onset was the landmark
publication on genetic mapping of QTL for growth and fatness by [1]. Up to date, according to
the latest release of the AnimalQTLdb (Release 35, 29th April, 2018), the Pig QTL database
stores 27,465 pig QTLs curated from 620 publications and representing a wide range of
economically important phenotypes (PigQTLdb; https://www.animalgenome.org/cgi-
bin/QTLdb/SS/summary).
In the beginning of pig QTL experiments, the mapping was carried out using crosses between
outbred lines and the statistical test employed was linkage analysis. This approach has proven
to be efficient for pinpointing numerous QTLs in pigs [2, 3]. However, those studies had
reduced mapping resolution and statistical power due to several factors as: the limited number
9
of individuals and genetic markers, and linkage analysis usage which considers only recent
recombination events.
The release of the Illumina PorcineSNP60 Beadchip [4] represents a key advance in
overcoming the above-mentioned impediments. The SNP array facilitates the implementation
of genome-wide association studies (GWAS) in which the historical recombination events are
taken into account when reflecting the associations between markers and phenotypes. To
surpass the limitation given by small sample size in some mapping experiments, analyzing
several F2 resource populations jointly has been proven to be a suitable approach [5]. Via
stochastic simulations, [5] demonstrated that pooling data from multiple F2 populations can
increase power and mapping precision compared to single association analysis under the
scenario in which the crosses share at least one common founder breed. Thus, one approach to
combine data from several populations is a joint population GWAS (JA) which considers a
single dataset comprising the merged information from the individual population level. This
approach requires access to the complete original dataset (i.e. genotypes and phenotypes).
Another strategy for combining information from multiple genetic mapping studies is the meta-
analysis (MA) of the GWAS summary statistics. This method can increase the detection power
and reduce false-positive findings [6] while also allowing to efficiently account for population
substructure and for study specific covariates [7]. The MA is widely used in human genetics,
where access to the original datasets is usually limited due to privacy protection policies. In the
last years, the meta-analysis has been employed for pig association studies in an effort to
maximize the use of available genomic information from commercial or experimental pig
populations [8-10].
The current study considers three pig F2 resource populations which share a common founder
breed [11, 12]. The connecting breed, Piétrain, is an extensively used sire line in pig breeding.
Having a constant demand for improving traits related to growth and carcass composition,
transferring knowledge from genome-wide association studies results into practice is of utter
importance. For this purpose, the classical G-BLUP (genomic best linear unbiased prediction)
statistical framework, used for predicting genomic estimated breeding values, was extended to
incorporate prior information on QTLs and related biological knowledge via the genomic
feature BLUP (GF-BLUP) model [13]. The proposed model is an extension of the linear mixed
model used in standard G-BLUP which includes additional genetic effects, previously
unraveled by association studies. According to [13] the GF-BLUP can contribute to the
prediction accuracy improvement in genomic selection schemes. Considering the above
mentioned reasoning, the aim of the study was twofold. Firstly, to conduct a joint design
10
analysis (JA) and a meta-analysis (MA) of the two pig F2 resource populations and compare
the results yielded by these different approaches. Secondly, to identify candidate genes in
genomic regions associated with average daily gain, backfat thickness, meat to fat ratio and
carcass length.
Materials and methods
Description of resource populations
The three-generation experimental populations comprise a total of 2,380 animals. The designs
were established three decades ago by [14] and [15] and for this study, blood samples were
available from which DNA was extracted for genotyping purposes. [14] and [15] characterized
the populations in detail and will only be described briefly further. The first resource population
(D1) considered for this study has 1,785 individuals. It was obtained from five purebred Piétrain
(P) boars (all of them homozygous stress resistant) and one Large White (LW) and six crossbred
sows Landrace (L) x Large White. Large F2 families were generated by repeatedly crossing
seven F1 boars to full sib F1 sows. The second population (D2) is composed of 304 pigs
stemming from mating one Meishan (M) boar with eight Piétrain sows. The third population
(D3) with 291 individuals had as founders a European Wild Boar (WB) crossed with nine
Piétrain sows. Three of the Piétrain sows were common among the latter two families. For both
the D2 and the D3, the F2 individuals were the result of two or three F1 boars mated with F1
sows. Generally, each sow had two litters from different boars. The Piétrain founder females
were homozygous stress susceptible and the Meishan and wild boar males were homozygous
stress resistant.
Phenotypic trait data
For the current study the following growth and carcass composition traits were considered:
average daily gain (ADG), back fat thickness (BFT), meat to fat ratio (MFR) and carcass length
(CRCL). The ADG [g] is the daily weight gain in the fattening period, the BFT [mm] is
calculated as the average of three measurements: shoulder fat depth, back fat depth and loin fat
depth, the MFR [ratio] is the fat area in relation to the meat area at 13th/14th rib and CRCL [cm]
is measured from the first cervical vertebrae to the pubis symphysis. The methods of
measurement and the calculations employed for D1, D2 and D3 were in conformity with the
performance testing directive of the Central Association for German Pig Production [16, 17].
Table 1 contains a brief description of the contributing F2 designs and the summary of the
phenotypic data indicating mean and standard deviation (SD) for each trait. The animals were
11
slaughtered at 211.01 ± 22.3 days, 211.73 ± 6.92 days and 210.63 ± 3.22 for D1, D2 and D3,
respectively.
Table 1. Description of the experimental F2 populations and investigated traits.
D1: Px(LxLW)/LWa
Males F0 = 5
Females F0 = 7
D2: MxPb
Males F0 = 1
Females F0 = 8
D3: WBxPc
Males F0 = 1
Females F0 = 9
Mean (SD) Nd Mean (SD) Nd Mean (SD) Nd
ADG[g]e 675.9 (92.74)
311 - 1039 1769
590.1 (130.03)
174.0 - 951.0 304
527.4 (109.17)
125.0 - 790.0 291
BFT[mm]e 27.49 (3.84)
16 - 42.3 1766
27.89 (6.73)
8.70 - 46.00 304
22.8 (4.97)
10.3 - 40.0 291
MFR[ratio]e 0.38 (0.11)
0.14 - 0.85 1765
0.7248 (0.22)
0.28 - 1.39 304
0.516 (0.08)
0.19 - 1.07 289
CRCL[cm]e 100 (2.93)
91 - 111 1765
91.33 (6.12)
63.50 - 106.00 304
79.85 (5.20)
62.50 - 94.00 291
aPiétrain x (Landrace x Large White)/Large White; bMeishan x Piétrain; cWild Boar x Piétrain; dNumber of
individuals with phenotypic data; eAverage daily gain, backfat thickness, meat to fat ratio, carcass length.
Genotyping and quality control
The F2 individuals were genotyped with Illumina PorcineSNP60 BeadChip (61,565 SNPs).
SNP chromosomal positions were based on the current pig genome assembly (Sus Scrofa build
11.1 provided by Swine Genome Sequencing Consortium on NCBI). Genotypes were filtered
with respect to the following quality control (QC) criteria: i) removing SNPs with a minor allele
frequency less than 5% and ii) excluding individuals and SNPs with call rates lower than 90%.
The process of quality control was carried out using Plink [18]. The autosomal chromosomes
were further considered. The final set consisted of 44,457 SNPs in D1 design, 40,738 SNPs in
D2 design, 37,145 in D3 design and 31,299 SNPs in the joint design (D1D2D3). The latter was
obtained by merging common SNPs from D1, D2 and D3 after the QC step. In addition, the
RYR1:g.1843C>T [19] mutation status was available for the individuals in D2 and D3.
Persistence of linkage disequilibrium phase
The extent of linkage disequilibrium (LD) in the D1, D2, D3 and D1D2D3 populations was
characterized in detail by [20]. Of further interest for the joint analysis was to examine how
consistent is the LD phase in the designs. The statistical parameter chosen for the LD
measurement was r2 [21], which is the correlation coefficient between SNP pairs. A total of
12
31,299 common SNP across the individual populations and the joint data set was used to
compute the r2 values. Using Plink, r² was obtained for all SNP pairs located less than 5 Mb
apart. The average r2 values of the SNP pairs in classes of inter-marker distances 100 Kb
starting with interval [0-100] Kb up to [4900-5000] Kb was calculated and finally used to
compute correlation of phase between two populations according to the formula [22]:
𝑅𝐷𝑘,𝐷𝑘′=
𝛴(𝑖,𝑗)𝜖𝑝(𝑟𝑖𝑗(𝐷𝑘) − �̅�(𝐷𝑘))(𝑟𝑖𝑗(𝐷𝑘′) − �̅�(𝐷𝑘′))
𝑆(𝐷𝑘)𝑆(𝐷𝑘′)
where 𝑅𝐷𝑘,𝐷𝑘′ is the correlation of phase between 𝑟𝑖𝑗(𝐷𝑘) in population 𝐷𝑘 and 𝑟𝑖𝑗(𝐷𝑘′) in
population 𝐷𝑘′, 𝑆(𝐷𝑘) and 𝑆(𝐷𝑘′) are the standard deviation of 𝑟𝑖𝑗(𝐷𝑘) and 𝑟𝑖𝑗(𝐷𝑘′), respectively,
and the average 𝑟𝑖𝑗 across all SNP 𝑖 and 𝑗 within interval 𝑝 for 𝐷𝑘 and 𝐷𝑘′, accordingly, is
denoted with �̅�(𝐷𝑘) and �̅�(𝐷𝑘′). The 𝑅𝐷𝑘,𝐷𝑘′estimate was evaluated for the six population pairs:
D1-D2, D1-D3, D1-D1D2D3, D2-D3, D2-D1D2D3 and D3-D1D2D3.
Estimating genetic variance and genetic correlations
A general linear model was used to pre-adjust the animal phenotypes with the following fixed
effects: sex, stable and slaughter month class (i.e. 15 classes for D1, 6 classes for D2 and 8
classes for D3). The phenotype pre-adjustment analysis was carried out with R [23]. For D2
and D3 the RYR1 status was also incorporated as fixed effect for ADG, BFT and MFR traits.
Weight at slaughter was included as a covariate for all traits except for ADG. The residual
values of the phenotypes were subsequently used. The Genome-wide Complex Trait Analysis
tool (GCTA) [24] was utilized for estimating the variance components and the genetic
correlations. The genomic relationship matrix (GRM) was calculated between all pairs of
individuals using all the autosomal SNPs. Applying a univariate analysis, the variance of the
traits was partitioned using restricted maximum likelihood (REML) into an additive genetic
and a residual component. Following the classical definition of narrow-sense heritability, the
SNP-based heritability was obtained via ℎ𝑆𝑁𝑃2 = 𝜎𝑆𝑁𝑃
2 /𝜎𝑃2, representing the proportion of
phenotypic variance (𝜎𝑃2) explained by the additive effects of the common SNPs on the chip
array and/or by the unknown causal variants correlated with the SNPs. A bivariate GREML
analysis led to the assessment of the genetic correlations between traits.
GWAS and meta-analysis
Single marker association tests were performed using the GCTA tool for the individual
populations (i.e. D1, D2 and D3). . A mixed linear model analysis including the candidate SNP
(MLMi) was set up as follows: 𝑦∗ = 𝑥𝑏 + 𝑢 + 𝑒, where 𝑦∗ is the phenotype corrected for
13
systematic environmental effects and genetic effect (i.e. the RYR1 status for D2 and D3), 𝑥 is
the genotypes of the marker, 𝑏 is the additive effect size (fixed effect) of the candidate SNP to
be tested for association, 𝑢 is the random polygenic effect given by the construction of the
genomic relationship matrix (GRM) and 𝑒 is the residual effect.
For the joint design (D1D2D3), the same MLMi was used. The 𝑦∗, in this case, consisted of
pooled pre-corrected phenotypes from D1, D2 and D3. From the unique genotype file,
constructed based on the merged common SNPs among the three populations, the GRM was
assembled and then the marker genotypes tested for association. Given the fact that D1D2D3
contains strong familial relatedness (due to full-sib families) and weak population stratification,
observed in a multidimensional scaling analysis by [20], the mixed linear model analysis should
be efficient in capturing sample structure via the GRM as the random effect included in the
model [25]. Nevertheless, a fixed effect with three classes representing the three individual
designs was added to the model to account for the diverse genetic backgrounds.
A meta-analysis aims to statistically combine the information from multiple independent
studies and therefore to increase the power and reduce the false-positive results [6]. From the
several approaches to conduct a meta-analysis, the fixed effects meta-analysis is the most
powerful method and within this group, the inverse variance based strategy is predominant [6].
This strategy was employed for synthesizing the association studies summary statistics for the
common variants of D1, D2 and D3 populations. Specifically, using the METAL software [7].
each study was weighted according to the inverse of its squared standard error resulting in
newly derived effect sizes and standard errors estimates further used for calculating an overall
Z score and finally the overall p values.
Manhattan plots for D1, D2, D3 and D1D2D3 GWAS as well as for the MA results were created
via qqman R package [26]. By using Bonferroni correction, the genome-wide significance line
was set to 𝑝𝑔𝑒𝑛𝑜𝑚𝑒−𝑤𝑖𝑑𝑒 ≤ 0.05. Because Bonferroni correction acts in a stringent manner, an
additional nominal significant level was used for which the threshold was set up to 𝑝 ≤
5𝑥10−5. The R package qvalue [27] facilitated the calculation of the false discovery rate (FDR)
𝑞 value for each association test. The FDR 𝑞 value of the significant SNP with the largest 𝑝
value provided an assessment of the proportion of false positives among the significant SNPs.
Clusters incorporating strong evidence for trait-associated chromosomal regions were defined
based on the LD structure and the significant SNPs similar to [20]. A cluster enclosed a
minimum of two genome wide significant SNPs with a maximum distance of 2 Mb between
them. From the center point of the initially defined cluster, the upper and the lower boundaries
were assigned to the last nominally significant variant situated at a maximum of 1 Mb in both
14
directions. The jvenn tool [28] was used to draw Venn diagrams for all the SNPs surpassing the
nominal significance threshold to show all possible relations for each trait and between the four
different sets (D1, D2, D3, D1D2D3 and MA).
Exploratory analysis of clusters
The clusters identified were explored using BioMart tool [29], the Ensembl Genes 91 database
[30] and Gene Ontology [31, 32]. The interrogations were carried out using the latest genome
reference, the Sscrofa 11.1 assembly (GCA_000003025.6) and the latest gene annotation
(Genebuild released on July 2017).
Results
Genotypic data and individuals qualified for the analysis
All genotyped animals passed the quality control procedure, i.e. 1,785 in D1, 304 in D2 and
291 in D3. The final autosomal number of SNPs was 44,457, 40,738 and 37,145 for D1, D2
and D3, respectively. Based on the reference genome assembly (Sscrofa 11.1) the average
physical spacing between adjacent markers was 50,872 bp in D1, 55,495 bp in D2 and 60,821
bp for D3. For both the joint design and the meta-analysis, 31,299 SNPs were used with an
average physical distance among adjacent markers of 72,161 bp.
LD and persistence of phase
Average genome-wide 𝑟2 value for adjacent markers was 0.40, 0.44, 0.45 and 0.38 for D1, D2,
D3 and D1D2D3, respectively. The correlation of phase𝑅𝐷1,𝐷1𝐷2𝐷3exhibited high concordance
at all marker intervals, starting with 0.95 for the interval [0:100] Kb and maintained values
above 0.86 until the maximum interval length considered for the analysis (Figure 1). The
remaining five design pairs had visibly lower correlations levels ranging from 0.65 for D3-
D1D2D3 pair to 0.31 for D2-D3 when considering the first interval [0:100] Kb. The D2 design
was the least correlated with the joint design. Among the individual design pairs (i.e. D1-D2,
D1-D3 and D2-D3), the highest persistence of phase was observed for D1-D3 and the lowest
for D2-D3 pair.
Heritabilities and genetic correlations
The SNP-based heritabilities calculated were moderate to high, ranging from 0.31 for CRCL in
D3 to 0.74 for CRCL in D2 (Table 2). The ℎ𝑆𝑁𝑃2 estimates for ADG were highest in D3 and for
BFT and CRCL in D2. The values for MFR were constant in all three designs. The traits BFT
15
and MFR were strongly positive correlated with a coefficient 𝑟𝐺= 0.77 in D1, 𝑟𝐺= 0.53 in D2
and 𝑟𝐺= 0.83 in D3, while for BFT and CRCL a mild to strong negative correlation were
observed (i.e. 𝑟𝐺= -0.29 in D1, to 𝑟𝐺=-0.78 in D2 and 𝑟𝐺=-0.46 in D3).
Figure 1. Correlation of phase between D1-D2, D1-D3, D1-D1D2D3, D2-D3, D2-D1D2D3
and D3-D1D2D3 populations for SNP pairs at varying distances.
Table 2. SNP-based heritabilities (𝒉𝑺𝑵𝑷𝟐 ) on the diagonal and genetic correlations on the
lower triangle (with standard errors).
Design Trait ADG BFT MFR CRCL
D1 ADG 0.36 (0.04)
BFT -0.10 (0.09) 0.45 (0.04)
MFR -0.12 (0.09) 0.77 (0.04) 0.52 (0.03)
CRCL -0.07 (0.08) -0.29 (0.07) -0.04 (0.07) 0.60 (0.03)
D2 ADG 0.33 (0.10)
BFT -0.35 (0.17) 0.73 (0.07)
MFR -0.41 (0.19) 0.53 (0.12) 0.47 (0.09)
CRCL 0.42 (0.17) -0.78 (0.06) -0.20 (0.15) 0.74 (0.06)
D3 ADG 0.57 (0.09)
BFT -0.36 (0.17) 0.42 (0.09)
MFR -0.31 (0.16) 0.83 (0.08) 0.47 (0.09)
CRCL 0.31 (0.20) -0.46 (0.21) -0.26 (0.21) 0.31 (0.10)
16
Variants identified in D1, D2, D3, D1D2D3 GWAS and MA
A GWAS was conducted on the D1, D2, D3 and D1D2D3 designs for four phenotypic traits
(i.e. ADG, BFT, MFR and CRCL). The summary statistic from the D1, D2 and D3 association
analysis was further used for the MA step. The global view of 𝑝 values for all SNP markers of
each trait was visualized via a Manhattan plot (Figure 2 and 3). Cumulated over the five
analyses (i.e. D1 GWAS, D2 GWAS, D3 GWAS, D1D2D3 GWAS and MA), for ADG a total
of36SNPs surpassed the nominal significance level from which 14 SNPs were above the
genome-wide significance level. These variants were situated on the Sus Scrofa chromosome
(SSC) 2, 7, 8, 13 and 16. For BFT, 299 (148) variants were identified on the following SSC: 1,
2, 4, 5, 7, 10 and 13 and for MFR, 223 (137) SNPs located on SSC 1, 2 and 12. Lastly, for
CRCL, 368 (193) SNPs on SSC 1, 7, 13, 16 and 17 were discovered via the GWAS and the
MA. The top significant variants in all the four analyses conducted are presented in Table 3
together with their –log10 𝑝 value and the associated q value. The genomic regions with
genome-wide significant and nominal significant SNPs (S1 Table) were assigned to a total of
20 clusters (Table 4).
The concordance of the nominally significant variants identified was assessed via Venn
diagrams (Figure 4). The meta-analysis revealed more unique SNPs associated to the traits as
compared to the joint analysis. Specifically, five SNPs for ADG and CRCL, two SNP for BFT
and seven SNPs for MFR were identified exclusively by the meta-analysis. One variant for
BFT, four variants for MFR and two variants for CRCL were common elements found only by
the joint and meta-analysis. While many variants from the individual population association
studies overlapped with variants from JA and MA, the individual populations showed a higher
number of unshared variants for the all traits.
By using Pearson's product-moment correlation, the accordance of the 𝑝 values was estimated
by selecting the common set of markers between the output from the association studies
conducted for D1, D2, D3, D1D2D3 and the meta-analysis results (S2 Table). The correlation
value between JA and MA 𝑝 values was 0.83 for ADG, 0.81 for BFT, 0.74 for MFR and CRCL
(all significant with 𝑝 < 0.05). The D1 p values were highest correlated with the JA and the
MA as compared to the D2 and D3 designs’ p values.
17
Figure 2. Manhattan plots of the −log10 𝒑 values for association of SNPs with ADG and
BFT in D1, D2, D3 and D1D2D3 GWAS and MA. The top horizontal line indicates the
genome-wide significance level 𝑝𝑔𝑒𝑛𝑜𝑚𝑒−𝑤𝑖𝑑𝑒 ≤ 0.05, and the bottom line indicates the
nominal level of significance 𝑝𝑛𝑜𝑚𝑖𝑛𝑎𝑙 ≤ 5𝑥10−5.
18
Figure 3. Manhattan plots of the −log10 𝒑 values for association of SNPs with MFR and
CRCL in D1, D2, D3 and D1D2D3 GWAS and MA. The top horizontal line indicates the
genome-wide significance level 𝑝𝑔𝑒𝑛𝑜𝑚𝑒−𝑤𝑖𝑑𝑒 ≤ 0.05, and the bottom line indicates the
nominal level of significance 𝑝𝑛𝑜𝑚𝑖𝑛𝑎𝑙 ≤ 5𝑥10−5.
19
Table 3. List of the highest significant SNP from all five analyses and for all four traits.
Trait GWAS
or MA Top SNP SSCa
Location
bp
-log10
(𝒑
value)
q value Total
SNPsb
Other
SSCc
ADG
D1 ALGA0123907 2 2,556,939 7.26 0.00206 16 (6) 7, 13, 16
D3 H3GA0024295 8 11,805,802 4.47 0.97386 1 -
D1D2D3 ALGA0123907 2 2,556,939 6.85 0.00224 8 (3) 7, 8
MA ALGA0123907 2 2,556,939 8.16 0.00020 11 (5) 7
BFT
D1 ASGA0085597 2 1,083,343 10.71 3.29e-07 48 (22) 1, 10
D2 INRA0024524 7 26,069,284 13.56 1.12e-09 179 (93) 5,13
D3 INRA0015172 4 79,915,989 5.47 0.09920 10 1, 2
D1D2D3 ASGA0085597 2 1,083,343 12.14 1.68e-08 34(16) 1, 7
MA ASGA0008415 2 3,895,569 14.01 2.74e-10 28 (16) 1, 7
MFR
D1 MARC0044928 2 2,494,326 21.28 1.93e-17 103 (76) 1
D2 ASGA0089068 2 3,237,229 5.35 0.16883 7 12
D3 ALGA0011643 2 7,557,050 5.70 0.04177 15 -
D1D2D3 ASGA0085597 2 1,083,343 20.84 2.89e-17 43 (22) 1
MA ASGA0085597 2 1,083,343 24.04 1.63e-20 55 (39) 1
CRCL
D1 MARC0070553 17 15,827,832 28.68 9.25e-25 98 (47) 7,16
D2 DIAS0000554 7 34,166,932 14.04 3.68e-10 146 (93) 13
D3 H3GA0004878 1 265,179,997 5.65 0.04440 11 -
D1D2D3 ALGA0093478 17 16,919,581 17.88 4.08e-14 56 (29) 1, 7
MA ALGA0093478 17 16,919,581 19.08 2.59e-15 57 (26) 1, 7 aSus Scrofa chromosome; bTotal number of SNPs significant at a nominal level (total number SNPs significant at
a genome-wide level); cOther chromosomes on which associations surpassing the nominal significance level were
detected.
20
Table 4. Number of genomic clusters, localization and number of significant SNPs.
Trait
Analysis
GWAS
or MA
Cluster
number SSCa
Cluster boundaries
(bp)
Lengt
h in
Mb
Number of
significant
SNPsb
ADG
D1 1 2 631,324-2,586,096 1.94 7 (5)
D1D2D3 2 2 2,556,939-2,586,096 0.03 3 (3)
MA 3 2 141,798-2,586,096 2.44 9 (5)
BFT
D1 4 2 236,179-5,189,397 4.95 31 (23)
D2 5 7 19,567,933-37,145,252 17.58 165 (92)
D1D2D3 6 2 141,798-3,895,569 3.75 13 (13)
7 7 26,522,116-28,252,780 1.73 5 (2)
MA 6 2 141,798-3,895,569 3.75 17 (15)
MFR
D1 8 2 70,140-13,307,467 13.24 98 (76)
D1D2D3
6 2 141,798-3,895,569 3.75 18 (16)
9 2 7,536,991-8,647,689 1.11 6 (2)
10 2 12,168,412-13,294,789 1.13 11 (3)
MA 11 2 141,798-10,254,380 10.11 37 (30)
12 2 12,275,048-13,294,789 1.02 9 (8)
CRCL
D1
13 7 97,195,350-99,887,568 2.69 15 (10)
14 17 12,361,530-19,474,175 7.11 42 (27)
15 17 21,832,087-23,773,788 1.94 11 (10)
D2 16 7 19,567,933-36,795,710 17.23 135 (92)
D1D2D3
17 7 23,659,424-26,522,116 2.86 8 (5)
18 7 97,147,161-99,424,987 2.28 9 (7)
19 17 13,692,477-19,474,175 5.78 21 (16)
MA 20 7 97,147,161-99,491,117 2.34 10 (7)
19 17 13,692,477-19,474,175 5.78 27 (17) aSus Scrofa chromosome; bTotal number of SNPs significant at a nominal level (total number SNPs significant at
a genome-wide level).
21
Figure 4. Venn diagram displaying common variants identified for the four traits via the
five analyses (i.e. D1, D2, D3, D1D2D3 GWAS and MA).
Discussion
Linkage disequilibrium between markers and quantitative trait loci is fundamental for
conducting a successful genome-wide association study. In order to disentangle variants
associated with complex traits, the LD pattern in the populations under investigation must be
evaluated. This particular analysis was carried out by [20] for the D1, D2 and D3 populations
included in this study. The main findings were that there is a faster LD decay in the European
type breeds cross (D1) as compared to the Asian/Wild Boar and European breeds cross (D2 and
D3), while the fastest breakdown of LD is observed by pooling the data. The latter finding is
supportive of the fact that the joint design (D1D2D3) could have a positive impact on the
mapping resolution. Also in accordance to this study were the results by [33] and [5] obtained
via stochastic simulations of populations with a similar phylogeny as D1, D2 and D3.
The linkage phases between SNPs and the causative mutations that underlie the detected QTL
are not always identical across populations. The D1, D2 and D3 F2 populations are established
from genetically divergent breeds of Asian (Meishan) and European origin (Landrace, Large
22
White, Piétrain – the common founder breed) as well as the European Wild Boar ancestor.
Introgression of Asian pigs into the European stocks has been well documented during the 18th
and 19th century, fact which led to the existence of Asian haplotypes within the European
commercial breeds for traits such as backfat and litter size [34]. Therefore, the premise of
having shared QTLs for some of the investigated traits among the populations is supported.
Considering this, an additional method to assess the feasibility of conducting a joint analysis
depends on how consistent is the LD phase in the individual designs as compared to the LD
phase in the joint dataset. Across all population pairs (i.e. D1-D2, D1-D3, D1-D1D2D3, D2-
D3, D2-D1D2D3 and D3-D1D2D3), the phase correlation decreased with increasing marker
distance (Figure 1). Or alternatively stated, the shorter the chromosomal segment, the greater
the chance of the LD phase to be similar. As longer distances are considered, there is a higher
chance for recombination events to disrupt the LD which was present in the ancestral population
as new LD is formed within the derived subpopulations [21]. The phase agreement between
D1-D1D2D3 had correlation values ranging from 0.86 to 0.96 due to the fact that 75% of the
joint design is composed of D1 individuals and their overall allele frequencies prevail when
pooling the three designs. The second most correlated pair, D3-D1D2D3, contains only
individuals of European ancestry derived from breeds Piétrain, Large White, Landrace and the
European Wild Boar. The least correlated individual design with the joint design was the D2
population which stems from Meishan (Asian) and Piétrain. Nevertheless, for the classes of
inter-marker distances less than 500 Kb the correlation of phase was higher than 0.44 suggesting
that for shorter chromosome lengths there are LD similarities among these populations. When
considering the individual population pairs, the different genetic background was responsible
for the low levels of phase agreement in D1-D2 and D3-D2.
Meta-analysis of genome-wide association studies results can increase the power to detect
association signals by increasing sample size. The use of this approach grew substantially in
the genomics field in the last decade as the scientific community recognized the value of
collaborating to combine genetic resources [6, 8-10]. The output of the inverse variance based
meta-analysis strategy is dependent on the standard errors of each SNP in each of the study
because the weight assigned to each variant is being calculated as the inverse of the squared
standard error. Therefore, studies with higher standard errors will have a smaller weight in the
meta-analysis. Considering the individual populations GWAS summary statistics, the D2
population has overall higher standard errors as a result of higher phenotypic variance (Table
1). This implies that this study has a smaller weight in the MA, however, this aspect is
compensated to a certain degree by the high effect sizes of the associated variants particularly
23
when considering the traits BFT and CRCL, where highly significant associations were
detected. Hence, factors influencing the meta-analysis output are the standard error and the
effect size which greatly depend on the genetic architecture of the trait under investigation.
One of the main objectives of this study was to compare the meta-analysis with the joint
analysis, in which the common individual level genotypes are combined into a single dataset
before the association study. Therefore, the agreement among the 𝑝 values was assessed. The
correlation value between JA and MA 𝑝 values was higher than 0.7 (𝑝 < 0.05) for all traits
suggesting that the significance levels were similar in the two analysis. Moreover, the 𝑝 values
from the individual population association studies were also compared with the results from the
joint and meta-analysis. It was observed that the D1 𝑝 values were the most correlated to the
JA (strong persistence of LD phase) and the MA 𝑝 values, while D2 and D3 showed low levels
of correlation. A limitation when assessing the agreement of the individual designs with the JA
and MA is that the correlation only considers common variants between all five analyses. Some
variants which could be highly associated in the individual populations might be disregarded
due to being monomorphic in the others; however the correlations value gives a valuable
overview at a genome-wide scale of the majority of the SNPs (i.e. 31,299).
From the joint analysis summary statistic a total of eight clusters were assigned and from the
meta-analysis output, six (Table 4). Clusters 7 and 17 were identified only by the JA. Many of
the significant regions overlapped (S1 Figure) or were identified via both analysis (i.e. Cluster
6 and 19). Except for CRCL, the MA had higher significance levels of the SNPs surpassing the
nominal threshold and the clusters were supported by a higher number of variants. There were
more unique variants identified via MA (Figure 4), yet none of them surpassed the genome
wide significance level. The size of the clusters identified was generally smaller for the joint
and meta-analysis as compared to the individual population clusters. This suggests that both
these approaches have a positive impact on the mapping resolution, pinpointing to narrower
locations of causative variants.
Genetic variance and correlations
In the current study, growth (ADG) and carcass traits (related to fatness: BFT and to anatomy:
MFR and CRCL) were investigated. The previous reported heritabilities range from 0.03-0.49
for ADG, 0.12-0.74 for BFT and 0.55-0.60 for CRCL [35]. There are limited resources for the
MFR trait represented by only seven QTL listed up to date in the PigQTLdb. The SNP-based
heritabilities (Table 2) were overall moderate to high and mostly in accordance to the literature,
except for CRCL in D2 (ℎ𝑆𝑁𝑃2 = 0.74), ADG in D3 (ℎ𝑆𝑁𝑃
2 =0.57) and CRCL in D3 (ℎ𝑆𝑁𝑃2 =0.31).
24
The genetic correlations were high between BFT and MFR (0.77 in D1, 0.53 in D2 and 0.83 in
D3) as both traits have a genetic architecture composed of genes involved in the fat metabolism.
Cluster identification and candidate genes
A total number of 20 genomic clusters were identified (Table 4 and S1 Table). They were
located on SSC2, SSC7 and SSC17. Three clusters were found for ADG, four for BFT, six for
MFR (one overlapping with a BFT cluster) and eight for CRCL in the D1 GWAS, D2 GWAS,
D3 GWAS, JA and MA. The length of the segments varied substantially from 0.03 Mb (Cluster
2 supported by 3 significant SNPs) to 17.58 (Cluster 5 supported by 165 significant SNPs). The
long size of the clusters can be attributed to the fact that there are high levels of LD between
SNPs which leads to positive signals of association over large genomic regions. The results of
the joint and meta-analysis did not reveal any new non overlapping clusters with the ones
identified via the single population association study (S1 Figure). Nevertheless, the clusters for
the joint and meta-analysis span over shorter genomic regions, pinpointing to more precise
locations to identify candidate genes.
The traits in this study are influenced by several genes expressed during the prenatal and
postnatal development. Carcass length is a trait mostly determined prenatally and proportional
to the length of the spine, as well as the individual length of the vertebrae [36]. Average daily
gain, backfat thickness and meat to fat ratio are primarily influenced by tissue growth which
can be obtained through cell hyperplasia (e.g. cell proliferation) or hypertrophy (i.e. growth in
size) [37].
Several known genes that have their functions previously reported were associated to the traits.
One of the most prominent genes, the IGF2 was not assembled in the Sscrofa 10.2 genome
version, but now has been positioned on the Sscrofa 11.1 reference genome. IGF2 has been
described to have an effect on muscle mass and fat deposition [38]. The region where IGF2
resides is included on six of the clusters identified for ADG, BFT and MFR (1, 3, 4, 6, 8 and
11) which are partially overlapping (S1 Figure). The highest significant SNP in the vicinity of
IGF2 was identified for MFR via meta-analysis: ASGA0085597 (with –log10 (𝑝 value) = 24.04
and q value = 1.63e-20). Moreover, the partially overlapping clusters located on SSC2 (Table
4) harbor genes with growth factor activity (GO: 0008083): FGF3, FGF4, FGF19 and VEGFB
as well as genes responsible for the maintenance of gastrointestinal epithelium (GO: 0030277):
MUC6, MUC2, MUC5AC and MUC5B. Other genes underlined for ADG, BFT and MFR are
HRAS (positive regulation on cell proliferation GO: 0008284) and DHCR7 (lipid metabolic
process GO: 0006629). Cluster 10 and 12 found via the joint and meta-analysis narrowed down
25
a region specific for MFR which contains associations for several olfactory receptors which
reside in the genomic region SSC2: 12-14 Mb, gene family which is known to have significant
expansion throughout time within the pig genome [39].
On SSC7 from 19 to 38 Mb, clusters 5 and 7 showed associations with BFT for the D2
population and in the joint design. Several other studies pinpointed QTLs related to fat traits in
the same region [40, 41]. The gene PPARD was found as a good gene candidate for fat
deposition traits [42]. One of the significant SNPs on cluster 5 (H3GA0020846 with –log10 (𝑝
value) = 6.11 and q value = 3.5e-4) is located in one intron of the gene. Furthermore, the critical
region on SSC7 harboring the two clusters for BFT overlaps with the clusters 16 and 17 which
were assigned to CRCL in D2 and in the joint analysis, respectively. The common significant
variants associated within these clusters for BFT and CRCL also demonstrate discordant
direction of effects. This suggests the great interplay between BFT and CRCL associated
variants leading to pleiotropic consequences on the phenotypes. It is then reasonable to believe
that in prenatal developmental stages the horizontal growth of the animal is already mainly
determined by the number of vertebrae and their length while the same or other genes contribute
conversely in postnatal existence of the individual ensuring the vertical growth of the backfat
thickness. For that reason, the following genes which are located in a highly associated region
for both BFT and CRCL (SSC7: 24 – 26 Mb) were suggested: BMP5 part of the transforming
growth factor-beta (TGF-beta) signaling pathway, HMGCLL1 (ketone body biosynthetic
process GO: 0046951), GFRAL, a receptor required for GDF15 (Growth differentiation factor
15) mediated reductions in food intake and body weight in mice with obesity [43] and HCRTR2
(feeding behavior GO: 0007631).
Cluster 13, 18 and 20, which mostly overlapped, contained significant associations for carcass
length for D1 design, for the joint analysis and the meta-analysis. This genomic region contains
genes which have been already associated with carcass length: VRTN [44], LTBP2 [45] and
TGFB3 [46]. The three genes influence the development of vertebrae and ribs in mammalian
embryos and thus having a direct influence on the carcass length. Moreover, an inspection of
SSC17 which contained regions (i.e. cluster 14, 15 and 19) highly associated to CRCL from
D1, joint and meta-analysis revealed genes potentially influencing this trait of interest: PLCB1
(positive regulation of developmental growth GO: 0048639), FLRT3 (embryonic
morphogenesis GO: 0048598) and FERMT1 (positive regulation of transforming growth factor
beta receptor signaling pathway GO:0030511). Nevertheless, the most interesting finding was
situated close to the highest significant SNP for CRCL in D1 (MARC0070553 with –log10 (𝑝
value) = 28.68 and q value = 9.25e-25) and is represented by BMP2. The bone morphogenetic
26
protein 2 (BMP2) belongs to the same family as BMP5 and is involved in the transforming
growth factor-beta (TGF-beta) signaling pathway, playing a role in bone and cartilage
development.
Conclusion
A genome-wide association study was conducted for growth and carcass traits using SNP-chip
information from two populations sharing a common founder breed (Piétrain). An individual
population GWAS was conducted and two strategies for combining the datasets were
employed: a joint population GWAS and a meta-analysis of the individual population GWAS
summary statistics. While the joint population GWAS and the meta-analysis did not identify
new associated regions besides the ones identified in the individual populations, both
approaches had a positive impact on the mapping resolution which implies that causative
mutations can be identified with higher accuracy. Depending on the access to the complete
original datasets, the strategies can complement or substitute each other. A total of 20 genomic
clusters were pinpointed and they contained genes previously associated with the traits (e.g.
IGF2, VRTN and TGFB3). Finally, among the additional candidate genes being suggested,
BMP2 is being proposed as a strong candidate gene for carcass length. The findings of this
study provide novel insights into approaches of dissecting the genetic basis of growth and
carcass traits and indicate directions of further research which will lead to the identification of
causal mutations affecting traits relevant in pig breeding programs.
27
References
1. Andersson L, Haley C, Ellegren H, Knott S, Johansson M, Andersson K, et al. Genetic
mapping of quantitative trait loci for growth and fatness in pigs. Science. 1994;
263:1771-4.
2. Rothschild MF, Hu Z-l, Jiang Z. Advances in QTL Mapping in Pigs. Int J Biol Sci..
2007;3(3):192-7. PubMed PMID: PMC1802014.
3. Ernst CW, Steibel JP. Molecular advances in QTL discovery and application in pig breeding.
Trends Genet. 2013;29(4):215-24. doi: https://doi.org/10.1016/j.tig.2013.02.002.
4. Ramos AM, Crooijmans RPMA, Affara NA, Amaral AJ, Archibald AL, Beever JE, et al.
Design of a High Density SNP Genotyping Assay in the Pig Using SNPs Identified and
Characterized by Next Generation Sequencing Technology. PLoS One.
2009;4(8):e6524. doi: 10.1371/journal.pone.0006524.
5. Schmid M, Wellmann R, Bennewitz J. Power and precision of QTL mapping in simulated
multiple porcine F2 crosses using whole-genome sequence information. BMC Genet.
2018;19:22. doi: 10.1186/s12863-018-0604-0. PubMed PMID: PMC5883302.
6. Evangelou E, Ioannidis JPA. Meta-analysis methods for genome-wide association studies
and beyond. Nat Rev Genet. 2013;14:379. doi: 10.1038/nrg3472.
7. Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide
association scans. Bioinformatics. 2010;26(17):2190-1. doi:
10.1093/bioinformatics/btq340.
8. Bernal Rubio YL, Gualdrón Duarte JL, Bates RO, Ernst CW, Nonneman D, Rohrer GA, et
al. Implementing meta-analysis from genome-wide association studies for pork quality
traits. J Anim Sci. 2015;93(12):5607-17. doi: 10.2527/jas.2015-9502.
9. Guo Y, Qiu H, Xiao S, Wu Z, Yang M, Yang J, et al. A genome-wide association study
identifies genomic loci associated with backfat thickness, carcass weight, and body
weight in two commercial pig populations. J Appl Genet. 2017;58(4):499-508. doi:
10.1007/s13353-017-0405-6. PubMed PMID: 28890999.
10. Le TH, Christensen OF, Nielsen B, Sahana G. Genome-wide association study for
conformation traits in three Danish pig breeds. Genet Sel Evol. 2017;49:12. doi:
10.1186/s12711-017-0289-2. PubMed PMID: PMC5259967.
11. Ruckert C, Bennewitz J. Joint QTL analysis of three connected F2-crosses in pigs. Genet
Sel Evol. 2010;42:40. Epub 2010/11/03. doi: 10.1186/1297-9686-42-40. PubMed
PMID: 21040563; PubMed Central PMCID: PMCPMC2988712.
28
12. Boysen TJ, Tetens J, Thaller G. Detection of a quantitative trait locus for ham weight with
polar overdominance near the ortholog of the callipyge locus in an experimental pig F2
population. J Anim Sci. 2010;88(10):3167-72. Epub 2010/06/29. doi: 10.2527/jas.2009-
2565. PubMed PMID: 20581286.
13. Edwards SM, Sørensen IF, Sarup P, Mackay TFC, Sørensen P. Genomic Prediction for
Quantitative Traits Is Improved by Mapping Variants to Gene Ontology Categories in
Drosophila melanogaster. Genetics. 2016;203(4):1871.
14. Borchers N, Reinsch N, Kalm E. Familial cases of coat colour‐change in a Piétrain cross.
Journal of Animal Breeding and Genetics-Zeitschrift Fur Tierzuchtung Und
Zuchtungsbiologie. 2000;117(4):285-7. doi: doi:10.1111/j.1439-0388.2000.00255.x.
15. Geldermann H, Müller E, Beeckmann P, Knorr C, Yue G, Moser G. Mapping of
quantitative‐trait loci by means of marker genes in F2 generations of Wild boar, Pietrain
and Meishan pigs. Journal of Animal Breeding and Genetics-Zeitschrift Fur
Tierzuchtung Und Zuchtungsbiologie. 1996;113(1‐6):381-7. doi: doi:10.1111/j.1439-
0388.1996.tb00629.x.
16. Zentral Verband der Deutschen Schweinproduktion (ZDS). Richtlinie für die
Stationsprüfung auf Mastleistung, Schlachtkörperwert und Fleischbeschaffenheit beim
Schwein. 1981.
17. Zentral Verband der Deutschen Schweinproduktion (ZDS). Richtlinie für die
Stationsprüfung auf Mastleistung, Schlachtkörperwert und Fleischbeschaffenheit beim
Schwein. 1999.
18. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: A
Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. Am
J Hum Genet. 2007;81(3):559-75. doi: https://doi.org/10.1086/519795.
19. Fujii J, Otsu K, Zorzato F, de Leon S, Khanna V, Weiler J, et al. Identification of a mutation
in porcine ryanodine receptor associated with malignant hyperthermia. Science.
1991;253(5018):448-51. doi: 10.1126/science.1862346.
20. Stratz P, Schmid M, Wellmann R, Preuss S, Blaj I, Tetens J, et al. Linkage disequilibrium
pattern and genome-wide association mapping for meat traits in multiple porcine F2-
crosses. Anim Genet. 2018; doi: 10.1111/age.12684. In press.
21. Hill WG, Robertson A. Linkage disequilibrium in finite populations. Theor Appl Genet.
1968;38(6):226-31. doi: 10.1007/bf01245622.
29
22. Badke YM, Bates RO, Ernst CW, Schwab C, Steibel JP. Estimation of linkage
disequilibrium in four US pig breeds. BMC Genomics. 2012;13(1):24. doi:
10.1186/1471-2164-13-24.
23. R Core Team. R: A Language and Environment for Statistical Computing 2014.
24. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: A Tool for Genome-wide Complex
Trait Analysis. American Journal of Human Genetics. 2011;88(1):76-82. doi:
https://doi.org/10.1016/j.ajhg.2010.11.011.
25. Li G, Zhu H. Genetic Studies: The Linear Mixed Models in Genome-wide Association
Studies. The Open Bioinformatics Journal. 2013;7(1):27-33.
26. Turner SD. qqman: an R package for visualizing GWAS results using Q-Q and manhattan
plots. bioRxiv. 2014. doi: 10.1101/005165.
27. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad
Sci USA. 2003;100(16):9440-5. doi: 10.1073/pnas.1530509100.
28. Bardou P, Mariette J, Escudié F, Djemiel C, Klopp C. jvenn: an interactive Venn diagram
viewer. BMC Bioinformatics. 2014;15(1):293. doi: 10.1186/1471-2105-15-293.
29. Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, et al. BioMart and
Bioconductor: a powerful link between biological databases and microarray data
analysis. Bioinformatics. 2005;21(16):3439-40. doi: 10.1093/bioinformatics/bti525.
30. Zerbino DR, Achuthan P, Akanni W, Amode M R, Barrell D, Bhai J, et al. Ensembl 2018.
Nucleic Acids Res. 2018;46(D1):D754-D61. doi: 10.1093/nar/gkx1098.
31. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology:
tool for the unification of biology. Nat Genet. 2000;25:25. doi: 10.1038/75556.
32. The Gene Ontology Consortium. Expansion of the Gene Ontology knowledgebase and
resources. Nucleic Acids Res. 2017;45(D1):D331-D8. doi: 10.1093/nar/gkw1108.
PubMed PMID: PMC5210579.
33. Bennewitz J, Wellmann R. Mapping Resolution in Single and Multiple F2 Populations
using Genome Sequence Marker Panels. Proceedings of the 10th World Congress on
Genetics Applied to Livestock Production. 2014.
34. Bosse M, Lopes MS, Madsen O, Megens H-J, Crooijmans RPMA, Frantz LAF, et al.
Artificial selection on introduced Asian haplotypes shaped the genetic architecture in
European commercial pigs. Proc Biol Sci. 2015;282(1821):20152019. doi:
10.1098/rspb.2015.2019. PubMed PMID: PMC4707752.
35. Rothschild MF, Ruvinsky A. The Genetics of the Pig. 2nd Edition ed. Oxfordshire, UK,
Cambridge, USA: CAB International; 2011. 526 p.
30
36. Borchers N, Reinsch N, Kalm E. The number of ribs and vertebrae in a Piétrain cross:
variation, heritability and effects on performance traits. J Anim Breed Genet.
2004;121(6):392-403. doi: doi:10.1111/j.1439-0388.2004.00482.x.
37. Lawrence TLJ, Fowler VR. Growth of farm animals. 2nd edition ed. Oxon, UK, New York,
USA CABI Pub.; 2002. 347 p.
38. Nezer C, Moreau L, Brouwers B, Coppieters W, Detilleux J, Hanset R, et al. An imprinted
QTL with major effect on muscle mass and fat deposition maps to the IGF2 locus in
pigs. Nat Genet. 1999;21:155. doi: 10.1038/5935.
39. Nguyen DT, Lee K, Choi H, Choi M-k, Le MT, Song N, et al. The complete swine olfactory
subgenome: expansion of the olfactory gene repertoire in the pig genome. BMC
Genomics. 2012;13(1):584. doi: 10.1186/1471-2164-13-584.
40. de Koning DJ, Janss LLG, Rattink AP, van Oers PAM, de Vries BJ, Groenen MAM, et al.
Detection of Quantitative Trait Loci for Backfat Thickness and Intramuscular Fat
Content in Pigs (Sus Scrofa). Genetics. 1999;152(4):1679-90.
41. Bidanel J-P, Milan D, Iannuccelli N, Amigues Y, Boscher M-Y, Bourgeois F, et al.
Detection of quantitative trait loci for growth and fatness in pigs. Genet Sel Evol.
2001;33(3):289. doi: 10.1186/1297-9686-33-3-289.
42. Zhang Y, Gao T, Hu S, Lin B, Yan D, Xu Z, et al. The Functional SNPs in the 5’ Regulatory
Region of the Porcine PPARD Gene Have Significant Association with Fat Deposition
Traits. PLoS One. 2015;10(11):e0143734. doi: 10.1371/journal.pone.0143734.
43. Yang L, Chang C-C, Sun Z, Madsen D, Zhu H, Padkjær SB, et al. GFRAL is the receptor
for GDF15 and is required for the anti-obesity effects of the ligand. Nat Med.
2017;23:1158. doi: 10.1038/nm.4394
44. Mikawa S, Sato S, Nii M, Morozumi T, Yoshioka G, Imaeda N, et al. Identification of a
second gene associated with variation in vertebral number in domestic pigs. BMC
Genet. 2011;12(1):5. doi: 10.1186/1471-2156-12-5.
45. Zhang L, Liu X, Liang J, Zhao K, Yan H, Li N, et al. Genome-wide Association Study for
Number of Vertebrae in an F2 Large White × Minzhu Population of Pigs. bioRxiv. 2015.
doi: 10.1101/016956.
46. Zhang L, Yue J-W, Pu L, Wang L-G, Liu X, Liang J, et al. Genome-wide study refines the
quantitative trait locus for number of ribs in a Large White × Minzhu intercross pig
population and reveals a new candidate gene. Mol Genet Genomics. 2016;291(5):1885-
90. doi: 10.1007/s00438-016-1220-1.
31
Chapter 2
Non-additive effects in four diverse F2 pig crosses for growth,
carcass and fat related traits
Iulia Blaj1, Jens Tetens2, 3, Siegfried Preuß4, Jörn Bennewitz4 and Georg Thaller1
1Institute of Animal Breeding and Husbandry, Kiel University, Kiel, Germany
2Functional Breeding Group, Department of Animal Sciences, Göttingen University,
Göttingen, Germany
3Center for Integrated Breeding Research, Göttingen University, Göttingen, Germany
4Institute of Animal Husbandry and Breeding, University of Hohenheim, Stuttgart, Germany
Manuscript in preparation
32
Abstract
Non-additive effects, such as dominance and imprinting, have received substantial attention in
the recent past. Advances were in general coupled with technological progress regarding
genomic data generation. In this study, we focus on four F2 resource populations for which prior
investigations on non-additive effects were mostly done based on sparse genotyping and linkage
mapping. Thus, SNP array based genomic information complementary to phenotypic data are
here used to gain insights into mechanisms underlying non-additivity. Dominance and
imprinting are evaluated by means of variance components estimation and usage of various
genome-wide association models (i.e. dominance, imprinting, maternal and paternal) for
growth, carcass and fat related traits. Variance attributed to imprinting varies from 0% to 19%
and it prevails more in the larger F2 stemming from Piétrain boars crossed with Large white x
Landrace crossbred sow or Large white sow. Dominance is responsible for 0 to 34% of the
phenotypic variance and is more pronounced in the crosses originating from more distant
founders (due to elevated heterozygosity levels). Disentangling additive, dominance and
imprinting variances revealed the confounding nature of these various genetic partitions. High
levels of significance for imprinting and paternal were detected around IGF2, gene known to
be under epigenetic control. Attention is drawn to the fact that the statistical models used for
the non-additive effects can lead to spurious associations, as an artefact of the population
setting.
Background
The non-additive genetic effects field has recently generated considerable interest in the area of
livestock genomics (Varona et al., 2018). While dominance effects have been long included as
a component of the genetic variance (Falconer and Mackay, 1996), imprinting effects is a
younger area of research. Imprinting, an epigenetic mechanism, refers to a locus at which the
two identical alleles are not equivalent from a functional point of view leading to the preferential
expression from either the maternally or paternally inherited allele (Lawson et al., 2013). A
classic example for imprinting in pigs is the paternally expressed IGF2 gene (Jeon et al., 1999;
Nezer et al., 1999). While imprinting is best understood as a mechanism in the classical setting
where a complete parent-of-origin monoallelic expression (resembling dominance) is expected,
situations that deviate from complete imprinting in which e.g. genes display a tissues specific
and/or time dependent expression (Prickett and Oakey, 2012) are encountered as well. The
different variations of the imprinting and dominance (polar overdominance, polar
underdominance, bipolar dominance) and the interplay among these effects (O'Doherty et al.,
33
2015) contribute collectively to the difficulty of defining appropriate statistical framework
which could account for the various scenarios. Nevertheless, complex mechanisms can be
unraveled such as the classic example of dominance (polar overdominance) interacting with
imprinting in the callipyge phenotype in sheep (Cockett et al., 1996).
The estimation of non-additive effects proved to be challenging in the past due to technical
limitations that were overcome with the availability of dense single nucleotide polymorphisms
(SNP) arrays. In this context, estimation of dominance and imprinting effects became appealing
in the last decade and it has been under debate how much of the phenotypic variance can be
assigned to these effects and whether including such effects in the genetic evaluation models
could improve breeding value estimates (Nishio and Satoh, 2015; Jiang et al., 2017; Blunk et
al., 2019). From an applied perspective in livestock breeding, predicting breeding values, either
in a classical setting or by means of genomic selection (Meuwissen et al., 2001), focuses on the
additive genetic effects. Nevertheless, given the complex nature of the quantitative traits, a
deeper insight into the possible mechanisms that contribute in a non-additive manner to the
genetic variance can pave the way to employing such findings in practice.
For this study, we consider data from four F2 resource populations for which previous
investigations regarding dominance and imprinting were carried out using sparse genotyping
and linkage mapping. Three of the crosses were analyzed using 250 genetic markers (mostly
microsatellites) jointly and separately for additive, dominance and imprinting effects on fat and
metabolic traits (Rückert and Bennewitz, 2010). For the other cross, an interval mapping
approach using microsatellite data detected three quantitative trait loci (QTL) affecting carcass
traits comparable with the callipyge phenotype in sheep (Boysen et al., 2010). Using selected
genotypic subgroups in the IGF2 region in the same population and an interval mapping
approach, Boysen et al. (2011) reported additive and parent-of-origin evidence for additional
functional genetic variation within the IGF2 region affecting body composition traits.
The available phenotypic information from these resource populations coupled with the
information from the dense SNP arrays could facilitate a better understanding of the underlying
non-additive mechanisms. Thus, the aim of the study was twofold. Firstly, to decompose the
phenotypic variance for growth, carcass and fat related traits by means of genomic relationship
matrices accounting for an additive, dominance and imprinting partition. Secondly, to
investigate non-additive, i.e. dominance, imprinting, maternal and paternal SNP effects by
using different models for genome-wide association studies (GWAS).
34
Materials and methods
Resource populations
For this study, we used four F2 pig resource populations (Borchers et al., 1999; Geldermann et
al., 1996). The first design (D1) considered contained 1,785 F2 individuals originating from
Piétrain (P) boars and Landrace x Large white (LwxL) crossbred or Large white (Lw) sow
founder individuals. The remaining three populations (i.e. D2, D3 and D4) stemmed from a
Meishan (M) boar or Wild boar (W) crossed with either Piétrain or Meishan females. The F2
generation was the outcome of repeatedly crossing F1 boars with F1 sows in order to obtain
large full sib families. The D2 (MxP) had 304 F2s, the D3 (WxP) contained 291 F2s and finally
D4 (WxM) was comprised of 302 F2 individuals. Further details are available in Figure 1.
Figure 1. Description of the four F2 resource populations per generation.
Phenotypes and genotypes
The phenotypes available for all F2 individuals were average daily gain (ADG), average backfat
thickness (BFT), meat to fat ratio (MFR) and carcass length (CRCL). An additional set of
phenotypes was used for D1: fat thickness neck (SCF), fat thickness middle of the back (BFM),
fat thickness end of the back (BFTR), fat thickness at latissimus dorsi muscle (SCFLD), fat
thickness over the loin muscle (SCFLM) and belly fatness score (BFS). The phenotypes were
pre-corrected for systematic environmental effects and for the effect of RYR1 gene (Fujii et al.,
1991). All three generations have been genotyped with the Illumina PorcineSNP60 BeadChip
(Ramos et al., 2009). SNP chromosomal positions were based on the current pig genome
assembly (Sscrofa 11.1). The quality control step was done in PLINK (Purcell et al., 2007) and
it removed SNPs with MAF < 0.05 and with a call rate < 0.90. The final genotypes count were
44,451, 40,733, 37,139 and 35,887 for D1, D2, D3 and D4, respectively.
35
Variance component and association studies
To distinguish between the maternal and paternal origin of the alleles and implicitly to
discriminate between aA and Aa heterozygous genotypes, phasing was conducted using
SHAPEIT2 (O'Connell et al., 2014) and the flag --duoHMM which ensures that the haplotype
inference is consistent with the pedigree structure. For the phasing step, information from all
three generations was inputted (i.e. F0, F1 and F2). Having now the phased genotypes of the F2
individuals, aa, aA, Aa and AA were coded in the following manner: {1,0,0,-1} for additive,
{0,1,1,0} for dominance and {0,1,-1,0} for imprinting (Nishio and Satoh, 2015). The genomic
relationship matrices (i.e. gA, gD and gI) were constructed similar to Laurin et al. (2018) and the
phenotypic variance was partitioned in a step wise approach using three models:
i) y=µ+gA+e additive
ii) y=µ+gA+gD+e additive and dominance
iii) y=µ+gA+gD+gI+e additive, dominance and imprinting
Five mixed linear models were used, all incorporating the following random terms: gA, gD and
gI. Models were according to Mantey et al. (2005) and Mozaffari et al. (2019) and are shown
below. In the first three, the SNPs were tested for additivity, dominance and imprinting. The
latter two were a maternal and a paternal model where the phased alleles coming either from
the dam or from the sire were tested separately for association. The variance component
estimation and the mixed models were solved in R (R Core Team, 2018) using the package
‘sommer’ (Covarrubias-Pazaran, 2016).
i) y=µ+ xaddbadd+gA+gD+gI+e additive model
ii) y=µ+ xdombdom+gA+gD+gI+e dominance model
iii) y=µ+ ximpbimp+gA+gD+gI+e imprinting model
iv) y=µ+xmatbmat+gA+gD+gI+e maternal model
v) y=µ+xpatbpat+gA+gD+gI+e paternal model
Manhattan plots for the GWAS were created via ggplot2 R package (Wickham, 2009). By using
Bonferroni correction, the genome-wide significance threshold was set to 𝑝𝑔𝑒𝑛𝑜𝑚𝑒−𝑤𝑖𝑑𝑒 ≤
0.05. To balance the stringency of the Bonferroni correction, an additional nominal significant
level with a threshold of 𝑝 ≤ 5𝑥10−5 was applied.
36
Results
Variance component estimation
In the first instance, we partitioned the phenotypic variance into additive and environmental
component (residual) using genomic relationship matrices computed from genotypes.
Gradually, a component accounting for dominance and one accounting for imprinting was
added. Table 1 and Table 2 present how including an additional element influences the overall
phenotypic variance explained in all the four crosses for all ten traits considered. The
visualization of the percentage breakdown for the variance including all four components (i.e.
additive, dominance, imprinting and residual) is available in Figure SM1. The dominance
variance ranged from 0% to 34%, whereas the imprinting variance from 0% to 19%. In the D1
cross, the highest dominance variance was for ADG, contributing with 6% to the total variance.
For the trait SCFLM in D1, the imprinting variance explained 19% of the phenotypic variance.
Considering the other three crosses, a 34% dominance contribution was recorded for ADG in
D2 and an 11% imprinting contribution for BFT in D3.
Table 1. Additive (VA), dominance (VD) and imprinting (VI) variance contribution for D1
cross (where h2=VA/VP is the narrow sense heritability) in: i) additive, ii) additive +
dominance and iii) additive + dominance + imprinting.
Cross Trait VA / VP (VA + VD) / VP
VA / VP + VD / VP
(VA + VD + VI) / VP
VA / VP + VD / VP + VI / VP
D1
ADG
0.35 (0.04)
0.39 (0.04)
0.32 (0.04) + 0.07 (0.02)
0.41 (0.04)
0.29 (0.04) + 0.06 (0.02) + 0.06 (0.04)
BFT 0.46 (0.04)
0.46 (0.04)
0.45 (0.04) + 0.01 (0.02)
0.46 (0.04)
0.45 (0.04)+ 0.01 (0.02) + 0 (0.02)
MFR 0.51 (0.03) 0.52 (0.03)
0.52 (0.04) + 0 (0.01)
0.59 (0.03)
0.44 (0.04) + 0 (0.01) + 0.15 (0.04)
CRCL 0.60 (0.03)
0.60 (0.03)
0.60 (0.03) + 0 (0.01)
0.60 (0.03)
0.60 (0.03) + 0 (0.01) + 0 (0.02)
SCF 0.32 (0.04) 0.34 (0.04)
0.29 (0.04) + 0.04 (0.02)
0.34 (0.04)
0.29 (0.04) + 0.04 (0.02) + 0.01 (0.03)
BFM 0.31 (0.04)
0.32 (0.04)
0.30 (0.04) + 0.02 (0.02)
0.32 (0.04)
0.30 (0.04) + 0.02 (0.02) + 0 (0.02)
BFTR 0.42 (0.04)
0.42 (0.04)
0.42 (0.04) + 0 (0.02)
0.42 (0.04)
0.42 (0.04) + 0 (0.02) + 0 (0.02)
SCFLD 0.49 (0.03)
0.50 (0.04)
0.48 (0.04) + 0.02 (0.02)
0.51 (0.04)
0.46 (0.04) + 0.01 (0.02) + 0.03 (0.03)
SCFLM 0.46 (0.04)
0.47 (0.04)
0.45 (0.04) + 0.01 (0.02)
0.56 (0.03)
0.36 (0.04) + 0.01 (0.01) + 0.19 (0.04)
BFS 0.40 (0.04)
0.41 (0.04)
0.40 (0.04) + 0.01 (0.02)
0.42 (0.04)
0.39 (0.04) + 0.01 (0.02) + 0.02 (0.03)
37
Table 2. Additive (VA), dominance (VD) and imprinting (VI) variance contribution for D2,
D3 and D4 cross (where h2=VA/Vp is the narrow sense heritability) in: i) additive, ii)
additive + dominance and iii) additive + dominance + imprinting.
Cross Trait VA / VP (VA + VD) / VP
VA / VP + VD / VP
(VA + VD + VI) / VP
VA / VP + VD / VP + VI / VP
D2
ADG 0.32 (0.09)
0.63 (0.11)
0.29 (0.09) + 0.34 (0.12)
0.63 (0.11)
0.29 (0.09) + 0.34 (0.12) + 0 (0.07)
BFT 0.71 (0.07) 0.79 (0.08)
0.70 (0.07) + 0.08 (0.07)
0.79 (0.08)
0.69 (0.08) + 0.08 (0.07) + 0.02 (0.07)
MFR 0.46 (0.09) 0.50 (0.11)
0.46 (0.09) + 0.04 (0.09)
0.52 (0.11)
0.44 (0.10) + 0.03 (0.09) + 0.05 (0.08)
CRCL 0.73 (0.07) 0.73 (0.08)
0.73 (0.07) + 0 (0.06)
0.73 (0.08)
0.73 (0.08) + 0 (0.06) + 0 (0.06)
D3
ADG 0.54 (0.09) 0.60 (0.12)
0.53 (0.09) + 0.07 (0.12)
0.60 (0.12)
0.53 (0.12) + 0.07 (0.12) + 0 (0.11)
BFT 0.40 (0.10) 0.41 (0.14)
0.40 (0.10) + 0.01 (0.12)
0.44 (0.14)
0.30 (0.12) + 0.01 (0.12) + 0.11 (0.13)
MFR 0.46 (0.09) 0.53 (0.13)
0.43 (0.10) + 0.10 (0.13)
0.53 (0.13)
0.41 (0.12) + 0.09 (0.13) + 0.02 (0.11)
CRCL 0.30 (0.09) 0.40 (0.14)
0.24 (0.09) + 0.16 (0.15)
0.41 (0.15)
0.21 (0.11) + 0.14 (0.15) + 0.05 (0.12)
D4
ADG 0.40 (0.08) 0.46 (0.10)
0.39 (0.09) + 0.07 (0.08)
0.50 (0.10)
0.37 (0.09) + 0.07 (0.08) + 0.05 (0.06)
BFT 0.49 (0.08) 0.49 (0.09)
0.49 (0.08) + 0 (0.06)
0.51 (0.10)
0.48 (0.09) + 0 (0.06) + 0.03 (0.05)
MFR 0.59 (0.08) 0.59 (0.09)
0.59 (0.08) + 0 (0.05)
0.59 (0.09)
0.58 (0.08) + 0 (0.05) + 0.01 (0.05)
CRCL 0.56 (0.08) 0.56 (0.09)
0.56 (0.08) + 0 (0.05)
0.59 (0.09)
0.53 (0.08) + 0 (0.05) + 0.06 (0.06)
Genome-wide association studies
The majority of the genome-wide statistically significant SNPs for all linear mixed models used
were located on Sus Scrofa chromosomes (SSC): 2, 7 and 17. Table 3 and Table 4 contain the
top SNPs from the regions above the genome wide significance level for the non-additive
effects (dominance, imprinting, maternal and paternal). The SNPs exceeding the nominal
significance level (threshold of 𝑝 ≤ 5𝑥10−5) for all four populations and for all traits are
summarized in Table SM1. The most significant SNP for the dominance model was
H3GA0048042 on SSC17 for CRCL with –log10 𝑝 value =14.94 and for the imprinting model
H3GA0054053 on SSC2 for MFR (–log10 𝑝 value = 26.45, Figure 2). The latter variant was the
most significant SNP on SSC2 for MFR when using the paternal model (–log10 𝑝 value = 51.17)
and it often appeared at high levels of significance together with H3GA0005584 (SSC2:
38
4.37Mb) in the imprinting and paternal models for most of the traits in D1 (except CRCL).
Overall, the top maternal effect was found on SSC17 for CRCL in D1 for the variant
MARC0070553 (–log10 𝑝 value = 21.33).
Regional Manhattan plots for MFR in D1 (SSC2:0-14Mb) and for CRCL associations in D1
(SSC17: 0-30Mb) are depicted in Figure 3. A region around SSC7:30Mb had several significant
variants (either for dominance or maternal model) related to ADG, BFT and CRCL in D2 and
D4 and, exemplarily, this specific region for CRCL in D4 is shown in Figure 4. To investigate
the causality of the H3GA0054053 for MFR in D1 in the imprinting setting, a conditional
GWAS was conducted using the variant as a fixed effect (Figure 5).
Table 3. Top SNPs for traits with associations over the genome wide significance level in
D1. Type GWAS: I = imprinting, D = dominance, M = maternal and P = paternal.
Cross Trait SNP SSC Position –log10 𝒑 value Type
GWAS
D1
ADG ALGA0123907 2 2556939 9.31 P
BFT
H3GA0005584 2 4378975 11.19 I
H3GA0005584 2 4378975 19.75 P
MFR
ALGA0105438 2 631324 6.76 D
H3GA0054053 2 1100354 26.45 I
H3GA0054053 2 1100354 51.17 P
CRCL
H3GA0048042 17 19474175 14.94 D
H3GA0047609 17 1616893 8.38 I
MARC0070553 17 15827832 21.33 M
MARC0027977 17 17667594 10.12 P
SCF
H3GA0005584 2 4378975 8.38 I
H3GA0005584 2 4378975 14.60 P
BFM
H3GA0005584 2 4378975 6.26 I
H3GA0005584 2 4378975 11.90 P
BFTR
H3GA0005584 2 4378975 9.89 I
H3GA0054053 2 1100354 21.17 P
SCFLD
MARC0045154 2 4671904 8.24 D
H3GA0054053 2 1100354 16.60 I
H3GA0054053 2 1100354 36.47 P
SCFLM
MARC0045154 2 4671904 6.83 D
H3GA0054053 2 1100354 20.41 I
H3GA0005584 2 4378975 43.97 P
BFS
ALGA0105438 2 631324 6.41 D
H3GA0054053 2 1100354 16.97 I
H3GA0054053 2 1100354 33.72 P
39
Table 4. Top SNPs for traits with associations over the genome wide significance level in
D2, D3 and D4. Type GWAS: I = imprinting, D = dominance, M = maternal and P = paternal.
Cross Trait SNP SSC Position –log10 𝒑 value Type
GWAS
D2
BFT
H3GA0021185 7 36795710 6.28 D
ALGA0104042 2 2036226 6.72 P
MFR
ALGA0104042 2 2036226 12.28 I
MARC0008125 2 2984595 12.67 P
CRCL MARC0014933 7 33216744 6.75 M
D3
BFT
ASGA0085784 2 236179 8.18 I
H3GA0054053 2 1100354 9.53 P
MFR
ASGA0008415 2 3895569 7.05 I
H3GA0054053 2 1100354 8.59 P
D4
ADG INRA0024524 7 26069284 5.88 M
BFT
INRA0024835 7 31763707 6.39 D
ASGA0032187 7 26662865 8.17 M
CRCL
ALGA0040423 7 32827483 6.49 D
M1GA0009879 7 29268803 6.96 M
Figure 2. Box plots showing the distribution of the MFR residuals at top imprinting (–
log10 𝒑 value = 26.45) and paternal (–log10 𝒑 value = 51.17) variant H3GA0054053 in D1
for: additive, imprinting and parental models (i.e. maternal and paternal).
40
Figure 3. Regional Manhattan plots for GWAS in D1. Left: Associations for all five models
for MFR on SSC2:0-14Mb (blue vertical line indicates the location of IGF2 gene). Right:
Associations for all five models for CRCL on SSC17:0-30Mb. Horizontal dashed line is the
genome-wide significance threshold (Bonferonni corrected) set at –log10 (0.05/NSNPs).
Figure 4. Regional Manhattan plots for GWAS in D4. Associations for all five models for
CRCL on SSC7:0-60Mb. Horizontal dashed line is the genome-wide significance threshold
(Bonferonni corrected) set at –log10 (0.05/NSNPs).
Figure 5. Regional Manhattan plots for conditional GWAS in D1. Associations for
imprinting model and imprinting + top SNP (H3GA0054053) as fixed effect for MFR on
SSC2:0-14Mb (blue vertical line indicates the location of IGF2 gene). Horizontal dashed line
is the genome-wide significance threshold (Bonferonni corrected) set at –log10 (0.05/NSNPs).
41
Discussion
In this study, we utilize SNP array based genomic information coupled with a set of phenotypic
data from four pig F2 resource populations for a better understanding of non-additive effects.
We investigate dominance and imprinting by means of variance components estimation and
usage of various GWAS models (i.e. dominance, imprinting, maternal and paternal) for growth,
carcass and fat related traits.
In the variance partitioning process, the contribution of the environmental term decreased with
the inclusion of the dominance and imprinting effects, with few exceptions (for CRCL and
BFTR in D1, CRCL in D2 and MFR in D4) suggesting that the term captured part of these non-
additive effects prior to them being accounted for in the model. Furthermore, a reduction of the
additive variance was also observed (e.g. for MFR and SCFLM in D1, BFT and CRCL in D4)
proportional to the contribution of the phenotypic variance due to imprinting and dominance.
This reduction demonstrates the confounding nature among additivity and non-additivity (Hill
et al., 2008) that is emphasized here presumably as a consequence of the F2 population structure,
its linkage disequilibrium patterns and allele frequencies.
Looking into detail at variance partitions, a higher percentage of dominance variance were
observed in D2, D3 and D4 (Table 2) as a response to the increased heterozygosity levels
obtained by crossing breeds from different genetic backgrounds. Imprinting variance was more
pronounced in the D1 population (Table 1), accounting for a maximum of 19% of the
phenotypic variation encountered for MFR. The non-additive variances here derived were
compared to previous estimates in pigs available for backfat thickness and daily gain based on
SNP array information (Lopes et al., 2015; Guo et al., 2016). In the two studies mentioned, the
imprinting variance ranged from 1% to 2% and dominance from 4% to 16% for both traits. Our
ranges are broader with higher maximums (i.e. 11% imprinting share for BFT in D3 and 34%
dominance share for ADG in D2). Having said that, it is also noteworthy to mention that the
underlying populations used to obtain these estimates are different (F2 versus breeding or
commercial populations) and thus leading to the observed differences.
The SNP effects were coded for non-additivity, similar to Mantey et al. (2005) and Mozaffari
et al. (2019), and included into GWAS models to be tested for association. Some of the non-
additive effects detected for D2 and D4 are most probably an artefact of the F2 population
structure stemming from few founder individuals (Sandor and Georges, 2008; Lawson et al.,
2013). For example, the significant maternal effects for CRCL in D4 (Figure 4) is definitely
spurious and appears due to the fact that the top SNP (M1GA0009879 –log10 𝑝 value = 6.96) is
not fixed in the founders. Therefore, interpretation of these statistically detected effects has to
42
be made with caution because they can have no relevance for the statistically presumed
biological underlying mechanism. However, even in such population designs, imprinting and
parent-of-origin effects can be detected as it is the case for the D2 and D3 population. More
concrete, for BFT and MFR the significant variants for the imprinting and paternal model
(Table 4) are likely to have biological meaning as the SNPs are located in the vicinity of the
IGF2 gene on SSC2. This gene (insulin-like growth factor 2) is the most well documented case
of an imprinted gene in pigs (Jeon et al., 1999; Nezer et al., 1999) for which the maternal allele
is imprinted, thus having only the paternal allelic expression. With an influence on muscle mass
and fat deposition, significant associations (for imprinting and parent-of-origin) nearby IGF2
for the subsequent traits were detected: in D1 for all traits (except CRCL) and in D2 and D3 for
BFT and MFR (Table 3 and Table 4). A closer investigation was carried on for the IGF2 area
in D1. The trait with the highest imprinting and paternal association was considered,
specifically MFR with its lead variant (H3GA0054053: –log10 𝑝 value (I) = 26.45 and –log10 𝑝
value (P) = 51.17). The significant associations are stretching over a large genomic region
SSC2:0-14Mb (Figure 2). To check and/or to prove causality for the imprinting effects on
SSC2, a conditional imprinting GWAS was conducted, where H3GA0054053 (imprinting
coding) was included as a fixed effect (Figure 5). Only one variant exceeded the significance
threshold suggesting that the imprinting signal can be attributed to the lead SNP.
According to their significance levels, the pattern of the top significant SNPs on SSC2 was in
the order: paternal >> imprinting >> additive, emphasizing on the paternal expression of the
IGF2. While for D1, the signals in the IGF2 were discovered as well via the additive model but
with a lower intensity, in D2 and D3 there were no significant additive variants in the respective
area. Thus, to some extent, the additive parametrization is efficient in capturing other types of
gene actions. This is in line with the theory (Hill et al., 2008; Huang and Mackay, 2016) that
has at its core additivity according to the fundamentals of quantitative genetics (Fisher, 1918).
On SSC17, for CRCL in D1, although additive effects were the most significant, underlying
non-additive effect can be noticed (Figure 3). For example, the top additive SNP
(MARC0070553 –log10 𝑝 value = 30.12) was also at a high significance level in the maternal
model (–log10 𝑝 value = 21.33), but at lower level. This SNP is nearby the gene BMP2 (bone
morphogenetic protein 2) which was proposed already as a gene candidate for carcass length
(Blaj et al., 2018), thus would be of further interest to investigate whether the maternal origin
of this variant has a true physiological meaning.
43
Conclusions
The results of this study showed that both dominance and imprinting effects contributed to the
genetic variation encountered in four investigated F2 populations for a set of growth, carcass
and fat traits. Caution in interpreting non-additive effects must be taken in cross designs, as the
findings can be purely statistical with no physiological fundament. Highly significant variants
were found nearby the IGF2 gene when using the imprinting and the paternal association model
and mostly were uncovered by the additive model but at lower significance levels. This draws
attention on the confounding or overlapping nature of additive and non-additive. These findings
aim to contribute to a better understanding, in the context of F2 resource populations, of the
non-additive effects.
44
References
Aliloo, H., Pryce, J. E., González-Recio, O., Cocks, B. G., Goddard, M. E., & Hayes, B. J.
(2017). Including nonadditive genetic effects in mating programs to maximize dairy
farm profitability. Journal of dairy science 100 (2), pp. 1203–1222. DOI:
10.3168/jds.2016-11261.
Blaj, I., Tetens, J., Preuß, S., Bennewitz, J., & Thaller, G. (2018). Genome-wide association
studies and meta-analysis uncovers new candidate genes for growth and carcass traits
in pigs. PLoS ONE, 13(10), e0205576.
Blunk, I., Mayer, M., Hamann, H., & Reinsch, N. (2019). Scanning the genomes of parents
for imprinted loci acting in their un-genotyped progeny. Scientific reports 9 (1), p. 654.
DOI: 10.1038/s41598-018-36939-3.
Borchers, N., Reinsch, N., & Kalm, E. (2000). Familial cases of coat colour‐change in a Piétrain
cross. Journal of Animal Breeding and Genetics, 117(4), 285-287.
Boysen, T. J., Tetens, J., & Thaller, G. (2010). Detection of a quantitative trait locus for ham
weight with polar overdominance near the ortholog of the callipyge locus in an
experimental pig F2 population. Journal of animal science 88 (10), pp. 3167–3172.
DOI: 10.2527/jas.2009-2565.
Boysen, T. J., Tetens, J., & Thaller, G. (2011). Evidence for additional functional genetic
variation within the porcine IGF2 gene affecting body composition traits in an
experimental Piétrain × Large White/Landrace cross. Animal: an international journal
of animal bioscience 5 (5), pp. 672–677. DOI: 10.1017/S1751731110002466.
Cockett, N. E., Jackson, S. P., Shay, T. L., Farnir, F., Berghmans, S., Snowder, G. D. et al.
(1996). Polar Overdominance at the Ovine callipyge Locus. Science 273 (5272), pp.
236–238. DOI: 10.1126/science.273.5272.236.
Covarrubias-Pazaran, G. (2016). Genome-Assisted Prediction of Quantitative Traits using the
R Package sommer. In PLoS ONE 11 (6), e0156744. DOI:
10.1371/journal.pone.0156744.
Falconer D. S. & Mackay T. F. C. (1996). Introduction to Quantitative Genetics. 4th. Longmans
Green, Harlow, Essex, UK.
Fisher R.A. (1918). The Correlation between Relatives on the Supposition of Mendelian
Inheritance. Trans R Soc Edin 53, pp. 399–433.
Fujii, J., Otsu, K., Zorzato, F., Leon, S. de, Khanna, V., Weiler, J. et al. (1991). Identification
of a mutation in porcine ryanodine receptor associated with malignant hyperthermia.
Science 253 (5018), pp. 448–451. DOI: 10.1126/science.1862346.
45
Geldermann, H., Müller, E., Beeckmann, P., Knorr, C., Yue, G., & Moser, G. (1996). Mapping
of quantitative‐trait loci by means of marker genes in F2 generations of Wild boar,
Pietrain and Meishan pigs. Journal of Animal Breeding and Genetics, 113(1‐6), 381-
387.
Guo, X., Christensen, O. F., Ostersen, T., Wang, Y., Lund, M. S., & Su, G. (2016). Genomic
prediction using models with dominance and imprinting effects for backfat thickness
and average daily gain in Danish Duroc pigs. In Genetics, selection, evolution: GSE 48
(1), p. 67. DOI: 10.1186/s12711-016-0245-6.
Hill, W. G., Goddard, M. E. & Visscher, P. M. (2008). Data and theory point to mainly additive
genetic variance for complex traits. PLoS genetics 4 (2), e1000008. DOI:
10.1371/journal.pgen.1000008.
Huang, W. & Mackay T. F. C. (2016): The Genetic Architecture of Quantitative Traits
cannot be inferred from Variance Component analysis. PLoS genetics 12 (11),
e1006421. DOI: 10.1371/journal.pgen.1006421.
Jeon, J. T., Carlborg Ö., Törnsten A., Giuffra E., Amarger V., Chardon P., et al. (1999). A
paternally expressed QTL affecting skeletal and cardiac muscle mass in pigs maps to
the IGF2 locus. Nature genetics 21 (2), p. 157.
Jiang, J., Shen, B., O'Connell, J. R., VanRaden, P. M., Cole, J. B., & Ma, L. (2017). Dissection
of additive, dominance, and imprinting effects for production and reproduction traits in
Holstein cattle. BMC genomics 18 (1), p. 425. DOI: 10.1186/s12864-017-3821-4.
Laurin, C., Cuellar-Partida, G., Hemani, G., Smith, G. D., Yang, J., & Evans, D. M. (2018).
Partitioning Phenotypic Variance Due to Parent-of-Origin Effects Using Genomic
Relatedness Matrices. Behavior genetics 48 (1), pp. 67–79. DOI: 10.1007/s10519-017-
9880-0.
Lawson, H. A., Cheverud, J. M. & Wolf, J. B. (2013). Genomic imprinting and parent-of-origin
effects on complex traits. Nature reviews. Genetics 14 (9), pp. 609–617. DOI:
10.1038/nrg3543.
Lopes, M. S., Bastiaansen, J. W. M., Janss, L., Knol, E. F., & Bovenhuis, H. (2015). Estimation
of Additive, Dominance, and Imprinting Genetic Variance Using Genomic Data. G3
(Bethesda, Md.) 5 (12), pp. 2629–2637. DOI:10.1534/g3.115.019513.
Mantey, C., Brockmann, G. A., Kalm, E., & Reinsch, N. (2005). Mapping and exclusion
mapping of genomic imprinting effects in mouse F2 families. The Journal of heredity
96 (4), pp. 329–338. DOI: 10.1093/jhered/esi044.
46
Meuwissen T. H. E., Hayes B. J. & Goddard M. E. (2001). Prediction of Total Genetic Value
Using Genome-Wide Dense Marker Maps Genetics (157(4)), pp. 1819–1829.
Mozaffari, S. V., DeCara, J. M., Shah, S. J., Sidore, C., Fiorillo, E., Cucca, F. et al. (2019).
Parent-of-origin effects on quantitative phenotypes in a large Hutterite pedigree.
Communications biology 2, p. 28. DOI: 10.1038/s42003-018-0267-4.
Nezer C., Moreau L., Brouwers B., Coppieters W., Detilleux J. et al. (1999). An imprinted QTL
with major effect on muscle mass and fat deposition maps to IGF2 locus in pigs.
Nature genetics 21 (2), p. 155.
Nishio, M. & Satoh, M. (2015). Genomic best linear unbiased prediction method
including imprinting effects for genomic evaluation. Genetics, selection, evolution:
GSE 47, p. 32. DOI: 10.1186/s12711-015-0091-y.
O'Connell, J., Gurdasani, D., Delaneau, O., Pirastu, N., Ulivi, S., Cocca, M. et al. (2014). A
general approach for haplotype phasing across the full spectrum of relatedness. PLoS
genetics 10 (4), e1004234. DOI: 10.1371/journal.pgen.1004234.
O'Doherty, A. M., MacHugh, D. E., Spillane, C., & Magee, D. A. (2015). Genomic
imprinting effects on complex traits in domesticated animal species. Frontiers in
genetics 6, p. 156. DOI: 10.3389/fgene.2015.00156.
Prickett, A. R., & Oakey, R. J. (2012). A survey of tissue-specific genomic imprinting in
mammals. Molecular genetics and genomics: MGG 287 (8), pp. 621–630. DOI:
10.1007/s00438-012-0708-6.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D. et al. (2007).
PLINK: a tool set for whole-genome association and population-based linkage analyses.
American journal of human genetics 81 (3), pp. 559–575. DOI: 10.1086/519795.
R Core Team (2018). R: A language and environment for statistical computing. R Foundation
for Statistical Computing, Vienna, Austria.
Ramos, A. M., Crooijmans, R. P. M. A., Affara, N. A., Amaral, A. J., Archibald, A. L., Beever,
J. E. et al. (2009). Design of a high density SNP genotyping assay in the pig using SNPs
identified and characterized by next generation sequencing technology. PloS one 4 (8),
e6524. DOI: 10.1371/journal.pone.0006524.
Rückert, C. & Bennewitz, J. (2010). Joint QTL analysis of three connected F2-crosses in pigs.
Genetics, selection, evolution: GSE 42, p. 40. DOI: 10.1186/1297-9686-42-40.
Sandor, C. & Georges, M. (2008). On the detection of imprinted quantitative trait loci in line
crosses: effect of linkage disequilibrium. Genetics 180 (2), pp. 1167–1175. DOI:
10.1534/genetics.108.092551.
47
Varona, L., Legarra, A., Toro, M. A. & Vitezica, Z. G. (2018). Non-additive Effects in Genomic
Selection. Frontiers in genetics 9, p. 78. DOI:10.3389/fgene.2018.00078.
Wellmann, R. & Bennewitz, J. (2012). Bayesian models with dominance effects for
genomic evaluation of quantitative traits. Genetics research 94 (1), pp. 21–37. DOI:
10.1017/S0016672312000018.
Wickham, H. (2009). Ggplot2. Elegant graphics for data analysis.
49
Chapter 3
GWAS for meat and carcass traits using imputed sequence
level genotypes in pooled F2-designs in pigs
Clemens Falker-Gieske 1,*, Iulia Blaj 2,*, Siegfried Preuß3, Jörn Bennewitz3, Georg Thaller2
and Jens Tetens1,4
1Department of Animal Sciences, Georg-August-University, Göttingen, Germany
2Institute of Animal Breeding and Husbandry, Kiel University, Kiel, Germany
3Institute of Animal Husbandry and Breeding, University of Hohenheim, Stuttgart, Germany
4Center for Integrated Breeding Research, Georg-August-University, Göttingen, Germany
*contributed equally
Published in G3 Genes|Genomes|Genetics
50
Abstract
In order to gain insight into the genetic architecture of economically important traits in pigs and
to derive suitable genetic markers to improve these traits in breeding programs, many studies
have been conducted to map quantitative trait loci. Shortcomings of these studies were low
mapping resolution, large confidence intervals for quantitative trait loci-positions and large
linkage disequilibrium blocks. Here, we overcome these shortcomings by pooling four large F2
designs to produce smaller linkage disequilibrium blocks and by resequencing the founder
generation at high coverage and the F1 generation at low coverage for subsequent imputation
of the F2 generation to whole genome sequencing marker density. This lead to the discovery of
more than 32 million variants, 8 million of which have not been previously reported. The
pooling of the four F2 designs enabled us to perform a joint genome-wide association study,
which lead to the identification of numerous significantly associated variant clusters on
chromosomes 1, 2, 4, 7, 17 and 18 for the growth and carcass traits average daily gain, back fat
thickness, meat fat ratio, and carcass length. We could not only confirm previously reported,
but also discovered new quantitative trait loci. As a result, several new candidate genes are
discussed, among them BMP2 (bone morphogenetic protein 2), which we recently discovered
in a related study. Variant effect prediction revealed that 15 high impact variants for the traits
back fat thickness, meat fat ratio and carcass length were among the statistically significantly
associated variants.
Introduction
Mapping experiments in livestock generally serve two purposes: The first is to understand the
genetic architecture of quantitative traits, and to derive and prove new hypotheses of trait
expression. The second is the identification of genetic markers that may be useful for livestock
breeding. There have been many quantitative trait loci (QTL) mapping experiments carried out
over the last decades (see review article by (Rothschild et al., 2007)), mainly in experimental
F2 crosses established from two outbred founder pig breeds. In early studies, genotyping was
mainly achieved using microsatellite markers and mapping was achieved through linkage
analysis (see overview in (Knott, 2005)). These designs were set up to enable QTL detection
with high power, but they suffered from a low mapping resolution and large confidence
intervals for QTL-positions. This was partly due to the limited number of meiosis cycles
exploited in these designs in conjunction with typically small numbers of 300 and 500 F2
individuals. Furthermore, this approach assumes the divergent fixation of the QTL alleles in the
founder breeds, and highly different gene frequencies and variation within these breeds were
51
not considered (Nagamine et al., 2003). The breed Piétrain, for instance, has been selected for
growth and meat yield for many generations and still exhibits a large genetic variation for these
traits (Wellmann et al., 2013). More recent QTL-mapping experiments utilized genome-wide
association studies (GWAS), which in contrast to linkage analyses, exploit historical meiosis
and rely on linkage disequilibrium (LD) requiring high marker densities. The precision of
GWAS is then a function of LD block lengths and the number of individuals analyzed, which
in turn limits the usefulness of its application in F2 designs (Hayes and Goddard, 2001).
However, enormous efforts have been made in the establishment of these mapping populations,
usually including extensive phenotyping far beyond what would be available in field
populations. It would thus be desirable to revisit these resources using current genotyping and
sequencing technologies, which would require an increase in the number of individuals and a
decrease in the LD block lengths. In a recent simulation study, it was shown to be possible by
pooling F2 designs, particularly when founder breeds are closely related and QTLs are
segregating in one founder breed (Schmid et al., 2018). This approach has already been
successfully applied based on medium density SNP chip data (Blaj et al., 2018; Stratz et al.,
2018).
With the aim to overcome the aforementioned limits in mapping resolution and to fully exploit
the potential of the resource populations, we pooled four well-characterized F2 designs (Table
1), three of them having the founder breed Piètrain in common. Twenty four founder animals
were genotyped by high coverage whole genome sequencing (WGS) and 91 of the F1 animals
were sequenced at a low coverage for subsequent imputation to a high coverage WGS level. A
total of 2,657 F2 animals that were genotyped with the 62K Illumina PorcineSNP60 BeadChip
(Ramos et al., 2009) were imputed to WGS levels with pedigree information and analyzed in a
joint GWAS (see workflow in Figure 1). As a proof of concept, four relevant production traits
were analyzed: Average daily gain (ADG), back fat thickness (BFT), meat to fat ratio (MFR),
and carcass length (CRCL).
Material and methods
Description of resource populations and phenotypes
Four well characterized experimental populations were pooled for this study. Detailed
descriptions of the resource populations were done by Borchers et al. (Borchers et al., 2000)
and Rückert et al. (Rückert and Bennewitz, 2010), hence they will only be described briefly.
The largest population was obtained from five purebred Piétrain boars and one Large white and
six crossbred sows Landrace x Large white. The other three populations stemmed from a
52
Figure 1. Genotyping workflow. 24 Founder animals were sequenced with high coverage,
variants were called with GATK 4.0 and phased with Beagle 5.0. 91 F1 animals were sequenced
with low coverage and variants were called with GATK 3.8 and BCFtools mpileup. The F1
dataset was imputed using Beagle 4.0 and pedigree information with phased Founders as a
reference-panel for haplotype structure. The imputed F1 was then merged with the F0 variant
call data set and phased with Beagle 5.0. Finally, the 2657 chip genotyped F2 individuals were
imputed to WGS levels with Beagle 4.0 and pedigree information with the merged and phased
Founder/F1-imputed dataset as the reference-panel.
Meishan boar or Wild boar crossed with either Piétrain or Meishan females. The Wild boar and
three Piétrain females were common founders in three of the crosses. The F2 generation was
the result of repeatedly crossing F1 boars with F1 sows in order to obtain large full-sib families.
From the crosses, a total number of 2772 animals (24 F0-generation pigs, 91 F1-generation pigs,
2657 F2-generation pigs) and blood samples were used to extract genomic DNA for genotyping
purposes (Table 1). The F0 and F1 animals selected for sequencing were generally chosen
according to the number of F2 individuals, i.e. we prioritized individuals from which large
families were derived. Four phenotypic traits were considered: ADG, BFT, MFR, and CRCL.
53
The phenotypes were pre-corrected for systematic effects (e.g. stable, slaughter month) and for
the effect of RYR1 gene (Fujii et al., 1991) using a general linear model. Trait definition,
descriptive statistics and information about the pre-adjustment and the fixed effects used per
cross can be found in (Blaj et al., 2018).
Table 1. Per cross information of the sequenced individuals (F0 and F1) and SNP array
genotyped individuals (F2). F0 and F1 animals served as the reference panel for the imputation
of the F2 generation to sequence level for subsequent genome wide association analyses.
Cross/Generation F0* F1 F2
Piétrain x (Large white x Landrace)/Large white 13 55 1750
Meishan x Piétrain 8 19 304
Wild boar x Piétrain 6 17 291
Wild boar x Meishan 1 0 312
Total 24* 91 2657
*Four founders are common among crosses
Sequencing
A total number of twenty four founder animals were sequenced with an average 19x coverage
at the sequencing facility University Hohenheim. Out of 17 F1 families, 91 animals were
sequenced with an average 0.9x coverage. All paired-end sequencing (read length 2 x 100 bp)
was done on an Illumina HiScan SQ using TruSeq SBS v3 Kits. For the library construction,
the DNA samples were fragmented on a Covaris S220 ultrasonicator. Parameters were adjusted
to yield 350 bp inserts. Fragment length was measured with High Sensitivity DNA Chips on an
Agilent Bioanalyzer. Sequencing adapters and indexes were ligated using Illumina’s TruSeq
DNA PCR-Free Library Prep Kits. Quantification of libraries was done by qPCR using KAPA
Library Quant Kits. Flow cells were prepared using an Illumina cBot and TruSeq PE v3 Cluster
kits. Raw sequencing data were demultiplexed and converted into FASTQ files using Illumina’s
CASAVA software.
Mapping and variant detection
Mapping and variant calling of the F0 generation was performed according to the GATK best
practice pipeline using GATK v. 4.0 (McKenna et al., 2010) and genome assembly Sus scrofa
11.1 (GCA_000003025.6 provided by Swine Genome Sequencing Consortium on NCBI). Base
quality score recalibration was performed with dbSNP build 150 as the knownSites dataset.
Truth datasets used for Variant Quality Score Recalibration (VQSR) were as follows. SNPs:
54
Illumina Infinium PorcineSNP60 v2 BeadChip and Affymetrix Axiom PorcineHD. INDELs:
High confidence fraction (filter settings: QD 15.0, FS 200.0, ReadPosRankSum 20.0) of the
PigVar database. Training dataset for SNP VQSR was also a high confidence fraction of the
PigVar database (filter settings: QD 21.5, FS 60.0, MQ 40.0, MQRankSum 12.5,
ReadPosRankSum 8.0, SOR 3.0) (Zhou et al., 2017). A truth sensitivity of 99.0 was chosen for
SNPs and INDELs. The known dataset for SNP and INDEL VQSR was dbSNP build 150. Since
SNPs were filtered with two truth datasets a Ti/Tv free recalibration according to the GATK
best practice guidelines was applied to the data. Low coverage sequencing reads of F1 animals
were processed according to the GATK best practice guidelines with the following deviations.
SNP Calling was performed using GATK HaplotypeCaller v. 3.8 in joint mode with the settings
minPruning 1 and minDanglingBranchLength 1 as well as BCFtools mpileup v 1.9 (Li et al.,
2009), respectively. INDELs in the F1 variant call dataset were neglected due to low sequencing
depth. An intersection variant call set between HaplotypeCaller, mpileup and the founder SNPs
was created and stringently filtered with the following settings: QD 30.0, FS 60.0, MQ 40.0,
QUAL 300.0.
Haplotype construction and imputation
To make use of the most recent phasing algorithms Beagle 5.0 was used for all phasing
operations (Browning et al., 2018). Beagle 4.0 was applied for genotype imputation since it is
the latest version that supports the usage of pedigree information (Browning and Browning,
2007). Haplotype phasing of the F0 generation variant call set was done using Beagle 5.0 and
subsequent imputation with pedigree information of the F1 low coverage SNPs was achieved
with Beagle 4.0. F0 and imputed F1 variants were merged with GATK CombineVariants and
phased with Beagle 5.0. F2 generation 60k SNP chip data was imputed with Beagle 4.0 and
pedigree information with merged and phased F0 and F1 WGS level variants as the reference
the panel. Imputation accuracy was determined by the construction of 24 F0 reference-panels
with one animal left out. Genotype data acquired with the 60k SNP chip from each F0 individual
was imputed with a reference-panel where the respective individual was missing utilizing
Beagle 4.0. The 24 individual datasets were merged and together with the F0 reference dataset
converted to additive coding with Plink 1.9 (Chang et al., 2015). Correlation (coefficient of
determination, R²) for each variant on QTL harboring chromosomes was calculated with an in
house R script.
55
Genome wide association studies and cluster assignment
Single-trait association analyses were performed with GCTA v. 1.92.4 beta 3 on the F2
population only (Yang et al., 2011). In order to perform a “leave one chromosome out” (LOCO)
analysis, multiple genomic relationship matrices (GRMs) were created from the F2 60k SNP
chip data by excluding each chromosome once with a minor allele frequency (MAF) cutoff of
1 %. Mixed linear model association analyses (MLMAs) were performed with imputed F2
variants for each chromosome separately using the GRM where the respective chromosome
was left out and a MAF cutoff of 1 %. To account for the pooled population structure, covariates
representing the different crosses (4 classes) were included in the MLMA. For further
downstream analysis, significance threshold was established by applying Bonferroni correction
(i.e. 0.05/number of independent tests). Manhattan plots were created with R. Clusters
incorporating potential genomic regions of interest were defined using the Manhattan Harvester
(MH) tool (Haller et al., 2019). MH provides quality assignment for each peak via a general
quality score (GQS) which can be used as the main parameter for peak assessment. The GQS
is generated based on a trained mixed-effects proportional odds model using 16 various
parameters (e.g. maximal slope, height to width ratio) and human peak identification data. For
this study, the variants with a p-value below 1.0x10−7 (option -inlimit) were included and further
the clusters with a GQS > 3.5 (1 is min and 5 is max) were taken into account. Conditional
association analyses were performed by including single highly associated variants as fixed
effects in a LOCO analysis.
Variant effect prediction and gene enrichment analysis
To predict variant effects the Ensembl Variant Effect Predictor (VEP) release 94 was utilized,
which is part of the Ensembl advanced programming interface (API) (McLaren et al., 2016).
The vep command using the clusters’ statistically significant variants was executed with the
following settings: --merged --force_overwrite --variant_class --symbol --nearest gene. To
provide further functional interpretation, the Database for Annotation, Visualization and
Integrated Discovery (DAVID) (Huang et al., 2009) was used for a systematic and integrative
analysis. The gene list from the VEP output was the input for DAVID and Sus scrofa genes
were considered as the background. Gene Ontology (GO) terms (i.e. cellular component,
molecular function, and biological process) from the functional annotation chart report which
were significantly overrepresented with an EASE Score (i.e. a modified Fisher Exact P-Value)
below 0.05 and with a gene count higher or equal to 5 were retained.
56
Statement on data and reagent availability
Sequencing data, which were used to conduct this study will be made publicly available upon
publication of the article in the NCBI Sequence Read Archive (SRA). Supplementary Tables
have been uploaded to GSA. Supplementary Table 1 contains the coefficients of determination
(R²) for each variant on QTL harboring chromosomes where calculation was possible.
Supplementary Table 2 contains the complete list of clusters identified in the GWAS with
additional supporting information for cluster assignment. GRMs from F2 60k genotypes
(File_S1.zip) were created by a "leave one chromsome out" approach using the program
"Genome-wide Complex Trait Analysis (GCTA) version 1.91.4 beta3". GWAS was conducted
with imputed sequence level F2 genotypes (Supplementary File_S2.zip) for each chromosome
using the GRM where the respective chromosome was left out (GRM command: gcta64 --bfile
SG_F2_chip_wo_chrNO --autosome --maf 0.01 --make-grm --out SG_F2_chip_wo_chrNO --
thread-num 10 --autosome-num 18 ; GWAS command: gcta64 --mlma --covar
Kiel_Hoh_cross.covar --bfile F2_beagle4.0_ped_ChrNO --grm SG_F2_chip_wo_chrNO --
pheno TRAIT.pheno --out TRAIT_chrNO --maf 0.01 --thread-num 10; replace NO with the
respective chromosome number and TRAIT with the respective trait to be analyzed). Phenotype
files located in Supplementary File_S2.zip for the traits ADG, BFT, MFR, and CRCL were
used in the GWAS. Crosses were used as covariates in the GWAS and provided as a gcta
compatible covar file in Supplemental File_S4.zip. SNP locations can be found in the bim files
of the genotype data (Supplementary File_S1.zip and Supplementary File_S2.zip). Population
structure information is provided in form of a Beagle 4.0 compatible pedigree file
(Supplementary File_S3.zip). Raw sequencing data is accessible via the NCBI Sequence Read
Archive (SRA) under BioProject ID PRJNA553106. File_S5 contains the 60k chip genotype
data in variant call file (VCF) format. Genomic positions have been lifted to genome assembly
Sus scrofa 11.1 (GCA_000003025.6) and annotated with dbSNP build 150. gcta compatible
covar files for the conditional association analyses with top variants are provided in File_S6.
Results
Whole genome sequencing and variant calling
An average of 592,788,350 (SD = 38,623,216, MIN = 525,921,083, MAX = 649,442,924)
sequencing reads per sample were aligned to the reference genome in the F0 generation with an
average mapping efficiency of 99.37 %. In the F1 generation an average of 26,562,876 (SD =
6,619,568, MIN = 15,915,385, MAX = 59,300,856) reads were mapped to the reference
assembly with an average mapping efficiency of 99.33 %.
57
With respect to the number of SNPs detected in the founder population, 22,671,759 were
previously reported and 3,950,955 were novel. Furthermore, 1,482,139 of the INDELs were
previously reported and 4,335,345 were novel. Per chromosome, average distances among the
variants are summarized in Table 2. The Ti/Tv of SNPs in the founder population was 2.39,
whereas known SNPs had a Ti/Tv of 2.44 and novel SNPs had a Ti/Tv of 2.08. Due to the low
sequencing coverage (average = 0.96 x, min = 0.58 x, max = 2.14 x) only autosomal SNPs were
called in the F1 population. The raw output of Haplotypecaller consisted of 20,055,697 known
and 3,529,441 novel SNPs and the raw output of mpileup contained 19,932,201 known and
3,291,758 novel SNPs. The intersection of the two datasets resulted in 19,264,662 known and
2,951,058 novel SNPs whereas removing all SNPs that were not present in the founder variant
calling dataset lead to a final number of 19,224,132 known and 2,911,780 novel raw SNPs.
After the application of a stringent filtering approach (see Material and Methods) 5,753,444
known and 741,155 novel SNPs remained in the variant calling dataset of the F1 population.
Table 2. Average distance between variants discovered in the founder population. A
number of 24 F0 animals were sequenced at high coverage and the average distances between
variants (SNPs and INDELs) were calculated per chromosome.
Chromosome Avg. distance (bp) SD
1 105,78729 196,8571
2 84,83889 201,4813
3 79,13779 183,627
4 80,16339 174,2767
5 73,37176 177,8913
6 85,34639 214,5176
7 79,12655 175,2067
8 78,90639 164,8448
9 79,61324 166,9157
10 56,5826 141,8648
11 67,6446 139,1412
12 65,41473 190,7484
13 102,70483 209,4908
14 83,58366 158,1296
15 93,02928 187,2321
16 71,18928 155,2467
17 65,91122 163,1281
18 76,95204 149,7542
Mean 82,21004 180,6988
58
Identification of local drops in imputation accuracy
To detect local inaccuracies in the imputed data, we imputed chip data from each founder with
the remaining 23 founders as a reference panel. The data does not provide information about
the imputation accuracy of the experiment since pedigree information could not be used. The
coefficient of determination for each variant located on a chromo-some harboring relevant QTL
was determined where feasible. The average coefficients of determination for each chromosome
analyzed are summarized in Table 3 (complete analysis results in Supplementary Table 1).
Table 3. Identification of local imputation inaccuracies. Chip data from each of the 24
founders was imputed using the remaining 23 founder animals as the reference panel.
Coefficients of determination (R2) were calculated for each variant in order to calculate average
R2 for SSC1, SSC2, SSC4, SSC7, SSC17, and SSC18.
GWAS results and clusters
From the genome-wide association study conducted in the pooled F2 population, the following
number of variants exceeded the genome-wide significance threshold: 448, 17,105, 6635, and
27,641 for ADG, BFT, MFR, and CRCL, respectively. Manhattan plots of the GWAS for the
four phenotypic traits are shown in Figure 2. A total of 120 clusters were designated by the MH
tool (i.e., 4 for ADG, 33 for BFT, 22 for MFR and61 for CRCL) and they were located on the
following Sus Scrofa chromosomes (SSC): 1, 2, 4, 5, 7, 17, and 18. The complete cluster list
with additional supporting information for cluster assignment can be found in Supplementary
Table 2. From each of the defined clusters, the top5 variants were retained. The genes
incorporating or lying nearby these highly significant associations are presented in Table 4. The
clusters associated with the traits overlapped on several chromosomes, specifically on SSC2,
SSC4, and SSC7. The location and the extent of the overlapping clusters is depicted in Figure
3. Particular chromosomes had exclusive clusters assigned, e.g., SSC17 for CRCL and SSC18
for MFR. To evaluate all possible relations among the variants exceeding the significance
threshold for each trait, a Venn diagram was used (Figure 4). The highest number of common
Chromosome Average R² SD
1 0.28 0.32
2 0.22 0.29
4 0.25 0.30
7 0.25 0.31
17 0.18 0.25
18 0.29 0.32
59
variants (i.e., 6,859) was between BFT and CRCL and the second highest was between BFT
and MFR (i.e., 2,380). To get an estimate of systemic bias, quantile-quantile plots were
generated for all p-values from each GWAS (Supplementary Figure 2). As a measure of
association between observed and expected p-values, lambda values were calculated for all four
traits: λADG = 1.282319, λBFT = 1.333425, λCRCL = 1.422044 and λMFR = 1.35587.
VEP and high impact variants
To predict functional consequences on genes the ensembl VEP tool was employed. Multiple
transcripts per gene resulted in larger numbers of annotations that are reflected in the higher
number of predicted effects as compared to the actual number of identified variants per trait.
All inferred consequences for Bonferroni corrected variants per trait and their percentage
breakdown are summarized in Table 5. The large majority (over 70%) of the consequences were
classified as intron variants. According to the severity of the variant consequence, intron
variants are assigned to having a modifier impact, which means that predictions are difficult to
be made or there is no solid evidence of impact. Variants inferred to have a disruptive impact
on the protein, leading to protein truncation, loss of function or causing nonsense-mediated
decay were of further interest. These significant high impact variants (Table 6) were mostly
located on SSC7, with the exception of SSC2:rs1110687780 (splice donor variant) affecting
TCN1 for the trait MFR. For the BFT, the most severe consequences were located in the genes
C6orf89, PI16, DST, and PRIM2, while for the CRCL disruptive impact variants were found in
NEU1, four novel genes, ABCD4, DST, PRIM2, and LPCAT4. Notably, the same two splice
donor variants affect the common genes for BFT and CRCL: DST and PRIM2. Sorting
Intolerant From Tolerant (SIFT) scores were determined for all significant missense variants
and are summarized in Supplementary Table 3 (Ng and Henikoff, 2003).
60
Figure 2. Manhattan plots of the –log10 p-values for association of variants with the traits
(A) average daily gain (ADG), (B) back fat thickness (BFT), (C) meat to fat ratio (MFR),
and (D) carcass length (CRCL). P-values < 0.001 were excluded from the plots.
Figure 3. Cluster overlap for (A) SSC2, (B) SSC4 and (C) SSC7 for all traits (average daily
gain (ADG) – red, back fat thickness (BFT) – green, meat to fat ratio (MFR) – purple, and
carcass length (CRCL) – blue). The heights of the clusters are according to the top variant
(–log10 p-value) within each given cluster.
61
Figure 4. Variants concordance and discordance between the traits average daily gain
(ADG), back fat thickness (BFT), meat to fat ratio (MFR), and carcass length (CRCL).
The Venn diagram contains statistically significant variants. Intersections between traits include
the number of common variants. Numbers of variants that were exclusively found in the single
traits are outside of intersections.
Figure 5. Imputation accuracy on SSC2 between positions 1,250,000 and 2,000,000. IGF2
is located between bp 1,469,183 and 1,496,417.
62
Table 4. Top associated genes for average daily gain (ADG), back fat thickness (BFT),
meat to fat ratio (MFR), and carcass length (CRCL) identified in the GWAS. Genes
incorporating or nearby the top 5 variants in the clusters are listed with chromosome and cluster
numbers.
Trait SSC Cluster
no./SSC Genes
ADG
2 1 SHANK2
4 1 RPS20, LYN, PLAG1
7 2 HMGCLL1, TFEB
BFT
1 1 ZNF462, ENSSSCG00000005432
2 7 LOC102158414, PGA5, MRPL16, ENSSSCG00000013151, ZFP91, CTNND1,
ENSSSCG00000024984, OR9Q2, OR10Q1
4 1 RPS20
7 24
SCGN, LRFN2, DAAM2, C7H6orf223, C7H6orf132, RIPOR2, CARMIL1,
BMP5, ENSSSCG00000001500, KIFC1, C6orf106, PPARD, FKBP5, CPNE5,
ENSSSCG00000001574, TMEM217, LRFN2, MRPS10, TRERF1, RUNX2,
RCAN2, MEP1A, ADGRF5, PTCHD4, ENSSSCG00000001734, PGK2, IREB2,
ABHD17C, GSTA2, CRABP1, CRISP3, PRSS16, TBC1D2B,
ENSSSCG00000038708, BCL2A1, E2F3
MFR
1 3 LOC106507123, TMEM245, SCAI, ABL1, RAPGEF1, CFAP77, DDX31,
MAPKAP1
2 8
LOC102158414, LOC110259166, LOC110259708, TMEM80, DEAF1, EHD1,
MACROD1, ATL3, NAV2, DHCR7, ENSSSCG00000028537, CTTN, SHANK2,
ENSSSCG00000036180 (KRTAP5-5-like), NELL1
4 3 PDE7A, SNTG1, RPS20
5 1 ENSSSCG00000034097
7 1 ENSSSCG00000001500
18 6 PPP1R3A, IMMP2L, LRRC4, EXOC4, SND1, ELMO1, MDFIC, TFEC
CRCL
1 1 FNBP1
7 52
VEGFA, FLRT2, LRFN2, MCTP2, DAAM2, PGF, SV2B, MAX, COL21A1,
KLHL25, NPAS3, LOC110261756, NHLRC1, TPMT, CDKAL1, GMNN,
RIPOR2, MDC1, DDX39B, HMGCLL1, ENSSSCG00000001500, C6orf106,
KCTD20, SRSF3, ENSSSCG00000001612, FOXP4, TFEB, RCAN2, ADGRF1,
MUT, CRISP1, TFAP2D, PKHD1, BNC1, ENSSSCG00000001827, TMEM266,
NKX2-1, PRKD1, LPCAT4, NR2F2, MCTP2, SLCO3A1,
ENSSSCG00000002270, FUT8, ENSSSCG00000002317, DPF3, PTGR2,
ZNF410, FAM161B, EIF2B2, MLH3, VIPAS39, SPTLC2,
ENSSSCG00000010328, RF01299, RF00100, HMGN4, NRXN3, ID4,
SYNJ2BP, ZFP36L1, RAD51B, AVEN, ANG, GCM1, FOXG1,
ENSSSCG00000033840, ENSSSCG00000035274, RSL24D1,
NSSSCG00000036697,ENSSSCG00000037115, ENSSSCG00000038445,
CEMIP, SLC25A21, SPTSSA, ENSSSCG00000039877, DIO2,
ENSSSCG00000040930
17 8 BMP2, JAG1, SPTLC3, TMX4
63
Table 5. Results of variant effect prediction for the production traits average daily gain
(ADG), back fat thickness (BFT), meat to fat ratio (MFR), and carcass length (CRCL).
Bonferroni-corrected variants were analyzed.
Predicted effect ADG ADG % BFT BFT % MFR MFR % CRCL CRCL %
Missense variant 2 0.1580 962* 0.6523* 58 0.1893 787* 0.4750*
Frameshift variant 0 0 0 0 0* 0* 6* 0.0036*
Start lost 0 0 1 0.0007 0 0 0 0
Stop gained 0 0 0 0 0 0 1 0.0006
Inframe deletion 0 0 1* 0.0007* 0 0 2 0.0012
Intron variant 936 73.9336 116815 79.2090 21556* 70.3525* 131590 79.4275
5 prime UTR variant 0 0 229* 0.1553* 89 0.2905 277* 0.1672*
3 prime UTR variant 8 0.6319 1160* 0.7866* 1242 4.0535 1543* 0.9314*
Upstream gene variant 50 3.9494 5680* 3.8514* 2195* 7.1638* 5300* 3.1991*
Downstream gene
variant 44* 3.4755* 5791* 3.9267* 3442 11.2337 6893* 4.1606*
Frameshift variant,
splice region variant 0 0 2 0.0014 0 0 0 0
Missense variant,
splice region variant 0 0 41* 0.0278* 0 0 75* 0.0453*
Splice region variant,
non coding transcript
exon variant
0 0 2 0.0014 3* 0.0098* 5 0.0030
Splice region variant, 3
prime UTR variant 0 0 3* 0.0020* 3* 0.0098* 0 0
Splice region variant,
intron variant, non
coding transcript
variant
0 0 2* 0.0014* 4 0.0131 20* 0.0121*
Splice region variant,
intron variant 0 0 426* 0.2889* 41* 0.1338* 489* 0.2952*
Splice region variant,
synonymous variant 0 0 21 0.0142 22* 0.0718* 28* 0.0169*
Splice donor variant 0 0 36 0.0244* 1* 0.0033 37 0.0223
Intergenic variant 109 8.6098 3318 2.2498 644 2.1018 9909 5.9811
Synonymous variant 0 0 2837 1.9237 214* 0.6984* 2751 1.6605
Intron variant, non
coding transcript
variant
117* 9.2417* 9636 6.5339 1060* 3.4595* 5759* 3.4761*
Non coding transcript
exon variant 0 0 514* 0.3485* 66 0.2154 200* 0.1207*
Start lost, start retained
variant, 5 prime UTR
variant
0 0 0 0 0 0 1* 0.0006*
Total 1266 147477 30640 165673
64
Table 6. Statistically significant high impact variants that were discovered in the genome
wide association studies for the production traits average daily gain (ADG), back fat
thickness (BFT), meat to fat ratio (MFR), and carcass length (CRCL).
Trait High impact
consequence Variant
Position
bp Gene Gene name
BFT
Start lost SSC7:rs319855624 32544657 C6orf89 chromosome 7 C6orf89
homolog
Frameshift variant,
splice region variant
SSC7:._504514 32606375 PI16 peptidase inhibitor 16
SSC7:._504513 32606373 PI16 peptidase inhibitor 16
Splice donor variant
SSC7:rs80834233 29157904 DST dystonin
SSC7:rs327743463 28571665 PRIM2 DNA primase subunit 2
MFR Splice donor variant SSC2:rs1110687780 11630410 TCN1 transcobalamin 1
CRCL
Start lost, start
retained variant, 5
prime UTR variant
SSC7:rs793752812 23958518 NEU1 neuraminidase 1
Stop gained SSC7:rs334442580 87783592 novel
gene
Frameshift variant
SSC7:._1165873 97574140 ABCD4 ATP binding cassette
subfamily D member 4
SSC7:rs693811701 48561663 novel
gene aurora kinase A-like
SSC7:._1068730 87783712 novel
gene
SSC7:._1068731 87783718 novel
gene
Splice donor variant
SSC7:rs80834233 29157904 DST dystonin
SSC7:rs327743463 28571665 PRIM2 DNA primase subunit 2
SSC7:rs331245426 80150975 LPCAT4 lysophosphatidylcholine
acyltransferase 4
Gene set analysis
GO functional enrichment analysis revealed eleven significantly overrepresented GO terms
including molecular functions (MF), biological processes (BP), and cellular components (CC).
A list containing the GO terms and the associated list of genes is presented in Table 7. For BFT
a GO-MF term was overrepresented and related to calcium ion binding (GO:0005509). Several
olfactory receptor genes were prevalent for the GO terms assigned to MFR (e.g. GO-BP
GO:0007186 G-protein coupled receptor signaling pathway, GO-MF GO:0005549 odorant
binding). The gene set for the CRCL trait was associated with two BP terms (GO:0001666
response to hypoxia and GO:0008283 cell proliferation) and two CC terms (GO:0045177 apical
part of the cell and GO:0031410 cytoplasmic vesicle).
65
Table 7. Most significant Gene Ontology (GO) terms from DAVID for the top associated
genes that were identified in genome wide association studies for the traits back fat
thickness (BFT), meat to fat ratio (MFR), and carcass length (CRCL).
Trait Category Term Genes
BFT MF GO:0005509
calcium ion binding
DST, LOC100152993, SCGN, GUCA1B, ITPR3, CIB2,
GUCA1A, RASGRP2
MFR
BP
GO:0007186
G-protein coupled
receptor signaling
pathway
OR5B3, LOC100623017, LOC106509349, LOC100512519,
LOC100513457, OR9Q2, LOC100628183, LOC100511243,
LOC100512154, LOC100514032, LOC100521066,
LOC100519351, OR10Q1, LOC100511620, LOC106509346
CC
GO:0016021
integral component of
membrane
ANO9, OR5B3, LOC100512519, LOC100519082,
LOC100513457, LOC100628183, SIGIRR, LOC100512154,
BET1L, LOC100521066, TMX2, OR10Q1, TMEM80,
LOC100623017, LOC106509349, OR9Q2, LOC100511243,
ZDHHC5, ATL3, LOC100514032, LRRC4, PPP1R3A,
LOC100519351, LRRN3, LOC100511620, STX3,
LOC100521938, CCDC136, LOC106509346, NRXN2
CC GO:0005886
plasma membrane
OR5B3, EHD1, LOC100623017, LOC106509349,
LOC100512519, OR9Q2, LOC100513457, LOC100628183,
CTNND1, LOC100511243, ELMO1, LOC100512154,
ZDHHC5, LOC100514032, LOC100521066, LOC100519351,
STX3, LOC100511620, RABEPK, LOC106509346, RASGRP2
MF
GO:0004930
G-protein coupled
receptor activity
OR5B3, LOC100623017, LOC106509349, LOC100512519,
LOC100513457, OR9Q2, LOC100628183, LOC100511243,
LOC100512154, LOC100514032, LOC100521066,
LOC100519351, OR10Q1, GPR141, LOC100511620,
LOC106509346
MF
GO:0004984
olfactory receptor
activity
OR5B3, LOC100623017, LOC106509349, LOC100512519,
LOC100513457, OR9Q2, LOC100628183, LOC100511243,
LOC100512154, LOC100514032, LOC100521066,
LOC100519351, OR10Q1, LOC100511620, LOC106509346
MF GO:0005549
odorant binding
OR5B3, LOC100623017, LOC106509349, LOC100513457,
OR9Q2, LOC100628183, LOC100512154, LOC100514032,
LOC100521066, LOC100519351, OR10Q1, LOC100511620,
LOC106509346
CRCL
BP GO:0001666
response to hypoxia ANG, TGFB3, PGF, PLAT, VEGFA
BP GO:0008283
cell proliferation FURIN, FAM83B, ZFP36L1, MORF4L1, BYSL, RASGRF1
CC GO:0045177
apical part of cell ADGRF5, VASH1, PLAT, HOMER2, BYSL
CC GO:0031410
cytoplasmic vesicle ANG, ADGRF5, FES, NEU1, GRM4, RHGC
66
Discussion
Genotyping strategy
The genotyping strategy that we developed for this study is outlined in Figure 1. Briefly: 24 F0
pigs were subjected to high coverage Illumina short read sequencing and in addition 91 F1
animals were sequenced at low coverage and imputed to high coverage WGS levels in order to
allow phasing. 2657 F2 animals were chip genotyped, imputed using a merged dataset of F0 and
imputed F1 as reference-panel. All imputation steps involved pedigree information. Opposed to
a population-based strategy this approach does not rely on a large reference-panel but on the
relatedness of individuals. In general, the genotyping strategy can be considered reliable since
the majority of the QTLs identified were already described for the four traits analyzed in this
study (cross-reference with Pig QTL database (Hu et al., 2019)). Nevertheless, we expected to
identify a variant that was associated with muscle mass and fat deposition in exon 2 of IGF2,
which has been extensively described to influence muscle development (Nezer et al., 1999).
The absence of IGF2 associated variants can be explained by a local drop in coefficients of
determination from an average of R² = 0.22 to R² = 0.03 in the genomic region where IGF2
resides (SSC2 1,469,183 – 1,496,417 bp, Figure 5). It must be pointed out that those coefficients
of determination cannot be used to draw conclusions about the actual accuracy of the
imputation. Since no pedigree information was included in the simulation, it can solely be used
to identify local inaccuracies, which were most likely due to assembly errors in the reference
genome.
The genotyping approach presented in this study can be considered a reasonable strategy to
radically increase the marker density of large F2 populations to WGS levels. By sequencing the
founder individuals with high coverage and the F1 with low coverage, which are only a fraction
of the number of F2 animals, the approach provides an affordable opportunity to improve the
power and potential of otherwise obsolete datasets. Due to the relatedness of the animals deep
sequencing of only a few animals is necessary, rendering it economically attractive.
Cluster identification and exploratory analysis
To fully exploit the potential of the four resource populations, the crosses were pooled and
further used for conducting GWAS. The increased sample size together with the increased
marker density ensures a high resolution that might allow the pinpointing of more specific
causative genes and mutations. Further experiments, e.g. Sanger sequencing of promising
regions could elaborate on that. Designing F2 populations implies that the LD-blocks are longer,
a fact that is counteracted to some extent by jointly analyzing the four designs. Lambda values
67
of 1.282319 to 1.422044 point to a moderate degree of p-value inflation in the GWAS, which
is most likely caused by the usage of WGS data and a LOCO GWAS approach. However, to
exploit the whole depth and power of the dataset we chose a LOCO analysis approach. To
further comprehend the closely linked association signals from GWAS, the following approach
was employed: i) clusters incorporating strong evidence for trait-associated chromosomal
regions were defined, ii) the effect of the significant variants was predicted and iii) a gene set
analysis was employed to identify sets of genes jointly associated with the traits of interest.
The quantitative traits considered for this study have been investigated in the past and are
mostly well represented in the Pig QTL database (Hu et al., 2019), except for MFR. The clusters
assigned to each trait were compared with the QTL regions from the database. For MFR,
additional fat-related traits (e.g. fat percentage in the carcass and fat-cuts percentage) were
considered in order to allow an adequate comparison given that the trait has few records in the
database and the trait definition can be country dependent. Most of the clusters overlapped or
were in the vicinity of the previously reported QTLs. This was expected as the database has
been recently updated and also includes our previous results (Blaj et al., 2018) using SNP chip
data and three out of the four pig populations which were taken into account here. Some of the
earlier reported QTLs in the database spread over large genomic regions (e.g. > 5Mb). It is
assumed that many of these large QTL regions might in fact not be due to a single mutation,
thus representing haplotype effects caused by several causative variants (Andersson, 2009). In
the current study, we were able to assign numerous clusters within these regions, which implies
that a higher genomic resolution was achieved and that it may be possible to disentangle distinct
quantitative trait nucleotides.
Conditional association analyses by including the top variant as a fixed effect in the MLMA
were carried out in order to gather statistical evidence for putative causality (Cohen-Zinder et
al., 2005) and was specifically applied to CRCL and BFT on SSC7. This chromosome exhibits
the highest number of clusters (SM with Clusters) and the highest association signals. By
including the top variant (rs81228492) for BFT, only one well-supported peak was above the
significance threshold (Supplementary Figure 1) meaning that there is additional genetic
variation within this region. Similarly, for CRCL the two top variants (rs333021601 and
rs319044994) representing the two different significant genomic regions were included
alternatively in the model. After fixing the effect of the latter variant, the surrounding significant
region disappeared, pointing to the possibility that there could be only one QTL responsible for
CRCL on SSC7 around the 99 Mb region. An alternative or additional explanation could be the
presence of long LD blocks, long-range LD and/or various epistatic interactions among the loci.
68
The overlap among the BFT and CRCL significant variants (see Figure 3 and Figure 4)
localized mostly in the genomic region 24-32 Mb indicate the existence of pleiotropic loci for
the two traits. When conditioned on the top BFT variant (rs81228492) as a fixed effect for a
MLMA on CRCL and the top CRCL variant (rs333021601) for MLMA on BFT, the initially
associated clusters and those nearby dropped in the intensity of the association signals
(Supplementary Figure 1), supporting the presence of pleiotropic loci. It is also noteworthy that
CRCL might be influenced by the number of thoracolumbar vertebrae (Rohrer et al., 2015).
Since the variant that has been associated with a higher number of vertebrae is a large Indel in
intron 1 of the VRTN gene (Fan et al., 2013) we were not able to discover this variant since the
genotyping pipeline applied in this study does only cover small INDELs.
In order to gain insight into the possible genetic mechanisms that control the traits, an
enrichment analysis of the gene function was performed with DAVID, prioritizing on the GO
terms. The GO-MF calcium ion binding term found for BFT supports the relationship between
the calcium ion, food intake and lipid metabolism previously described in the literature (Cui et
al., 2017). Furthermore, one of the genes in this group is DST, a strong candidate gene for which
high impact variants were found via VEP, which is discussed in detail below. A GO-BP term
related to cell proliferation comprised the FAM83B gene, which is the gene incorporating the
top variant found for CRCL. Interestingly, the majority of the genes included in the over-
represented terms for MFR were olfactory receptors. This enrichment is a consequence of the
MFR-identified clusters overlapping regions that are rich in various olfactory receptor genes.
This particular gene family is known to have significant expansion throughout time within the
Sus Scrofa genome (Nguyen et al., 2012).
Variant effect prediction
ADG: A QTL for ADG found on SSC7 comprises 115 statistically significant intron variants
and 83 variants upstream (min. p-value 8.71 x 10-14) of the HMGCLL1 gene, which was shown
by Comuzzie et al. to be associated with childhood obesity in the Hispanic population and to
influence creatinine levels. Another QTL on SSC2 contains 112 intron variants in SHANK2
(min. p-value 1.33 x 10-12). SHANK2 was also shown to be associated with childhood obesity
in the same study and to have an influence on estradiol blood concentrations (Comuzzie et al.,
2012). A third QTL on SSC4 harbors 2 intron and 12 downstream variants (min p-value 1.06 x
10-13) affecting LYN, which encodes for the LYN proto-oncogene, which was also identified by
Comuzzie et al. and correlated with the amount of fat mass in obese children (Comuzzie et al.,
2012). Six additional variants in the QTL on SSC4 (min. p-value 2.44 x 10-12) lie in an
69
intergenic region 13,463 – 14,460 bp downstream of RPS20, a gene which in interplay with
GNL1 is critical for cell growth (Krishnan et al., 2018). Another likely candidate SNP to
influence ADG is an intron variant in the PLAG1 transcription factor (p-value 1.32 x 10-11),
which is a regulator of IGF2 expression (Zatkova et al., 2004).
BFT: A QTL for BFT with a very prominent peak was detected on SSC7. The SNP with the
lowest p-value (6.63 x 10-54) is an intron variant in gene C6orf106. C6orf106 is a target of the
human miRNA has-miR-192, which has been identified to have regulatory functions in type 2
diabetes mellitus (Cui et al., 2016). The second top scoring SNP is an intron variant in the
RIPOR2 gene (p-value 4.34 x 10-50). RIPOR2 expression and protein levels are upregulated
during muscle cell differentiation in human fetal muscle cells (Yoon et al., 2007). Another gene
containing top scoring variants on SSC7 is KIFC1 (7 intron variants, min p-value 3.12 x 10-47).
Overexpression of KIFC1 promotes cell proliferation in non-small cell lung cancer (Liu et al.,
2016). 21 intron and 8 downstream variants in BMP5 (min. p-value 1.91 x 10-29), which induces
cartilage and bone formation (Wozney et al., 1988), are also located in a cluster on SSC7. 6
variants downstream of the aforementioned RPS20 (min. p-value 1.90 x 10-15) were found in
the cluster on SSC4.
MFR: GWAS for the MFR trait revealed a strong QTL on SSC2 with variant rs81327136
upstream of KRTAP5-5-like being the most significant (p-value 1.59 x 10-23). Of 72 variants, 6
were located in KRTAP5-5-like introns and 66 in the vicinity of the gene. KRTAP5-5 is a
transcription factor that regulates proliferation of epithelial cells (Barker et al., 2008) and that
forms a dominant-negative splice isoform in type 1 diabetes, which correlates with disease
severity (Yip et al., 2015). Other variants found in clusters on SSC2 are located in or adjacent
to DEAF1 (8 intron variants, min p-value 3.47 x 10-29), which is a transcription factor that
regulates proliferation of epithelial cells (Barker et al., 2008) and that forms a dominant-
negative splice isoform in type 1 diabetes, which correlates with disease severity (Yip et al.,
2015). Clusters on SSC2 also harbor variants associated with SHANK2 (1,714 intron variants,
3 5’ UTR variants, min p-value 2.18 x 10-26) and CTTN (188 up- and downstream variants, min
p-value 1.53 x 10-25). CTTN’s protein product Cortactin binds to and is indirectly
phosphorylated by obesity factor PTP1B (Stuible et al., 2008). A noteworthy intron variant is
located in the vitamin D pathway gene DHCR7 (p-value 3.06 x 10-25), which has been
associated with obesity traits in humans (Vimaleswaran et al., 2013). A total of 14 DHCR7
intron variants were above the significance threshold. A less prominent QTL on SSC4 harbors
variants in or close to the aforementioned genes RPS20 (17 downstream variants, min p-value
1.51 x 10-25) and in SNTG1 (19 intron variants, min p-value 1.48 x 10-14), which has been
70
associated with type 2 diabetes (Ban et al., 2010). A third, rather minor QTL on SSC18, contains
21 variants downstream of MDFIC (min p-value 1.97 x 10-15), a gene which has been linked to
improved piglet birth weight (Zhang et al., 2014). 25 intron and 42 downstream variants were
found for the PPP1R3A gene (min p-value 6.92 x 10-15), which in a whole exome sequencing
study was found to be associated with type 2 diabetes in a Mayan population (Sánchez-Pozos
et al., 2018).
CRCL: In the GWAS for CRCL 52 clusters were identified on SSC7. Although not located in
one of the clusters, the two lowest p-values (min p-value 5.40 x 10-49) were found in the intron
and coding region (silent mutation) of FAM83B (or C6orf143) respectively. A total of 62
significant variants in FAM83B were discovered comprising of 60 intron variants, 1 silent
mutation, and 1 missense mutation. Cipriano et al. demonstrated that overexpression or
mutation of FAM83B leads to EGFR hyperactivation by direct interaction and consequent
hyperactivation of the EGFR downstream effector phospholipase D1, which was previously
associated with BMI in humans (Davenport et al., 2015). An intron variant in the RIPOR2 gene
with a p-value of 5.08 x 10-47 is the same SNP, which was found in the GWAS for BFT. A total
of 85 mostly intronic RIPOR2 variants were found for the CRCL trait. A second, less prominent
QTL on SSC7 harbors 9 intron, 12 downstream and 317 upstream variants (min p-value 3.62 x
10-31), which have been assigned to the RSL24D1 gene. RSL24D1 has been identified as a
potential target in familial hypercholesterolemia (Li et al., 2015). One of the clusters identified
for CRCL on SSC17 contains 230 variants 122,416-126,520 bp downstream of BMP2 (min p-
value 7.21 x 10-38), a bone formation inducing factor (Wang et al., 2013). In addition, 18 intron
and 114 variants upstream of TMX4 were discovered. TMX4 was associated with feed
conversion ratios in chickens (Shah et al., 2016).
High impact variants: Various high impact variants were discovered by variant effect
prediction. A splice donor variant (rs80834233) in DST, the gene encoding Dystonin, is
associated with BFT (p-value 1.98 x 10-19) and CRCL (p-value 1.25 x 10-19). Knockout of DST
leads to intrinsic muscle weakness and instability of skeletal muscle cytoarchitecture in mice
(Dalpé et al., 1999). Variant rs793752812 leads to a probable start codon loss in NEU1 and is
associated with CRCL (p-value 1.49 x 10-12). A deficiency of the NEU1 gene product
Neuraminidase 1 leads to vertebral deformities in humans (Sphranger et al., 1977), which is
reasonable considering CRCL is largely determined by the number of vertebrae. Furthermore
one frameshift variant in AURKA (rs693811701, p-value 2.95 x 10-12) and one splice donor
variant in NUTM1 (rs331245426, p-value 1.38 x 10-9), both oncogenes (Umene et al., 2015)
(Schaefer et al., 2018), are associated with CRCL. The splice donor variant rs1110687780,
71
which affects the gene coding for placenta-specific protein 1-like, was detected in the GWAS
for MFR. In humans, PLAC1 has been found to be highly expressed in various types of tumors
(Koslowski et al., 2007).
Application of results in breeding programs and follow up studies
Functional validation studies based on appointed candidate genes and genetic variants will be
considered in follow-up studies. Besides understanding the underlying molecular mechanisms
of ADG, BFT, MFR and CRCL, the results of GWAS can render a substantial increase in the
reliability of genomic predictions in breeding programs. This concept was demonstrated in
several studies in cattle (Brøndum et al., 2015; Porto-Neto et al., 2015; van den Berg et al.,
2016) and in Drosophila melanogaster (Ober et al., 2015) by including pre-selected variants
from GWAS results in the prediction models. Even though implementing genomic selection is
becoming a common practice, the usage of marker-assisted selection or genomic screening is
not obsolete pointing that the identification of relevant genetic markers via GWAS and post-
GWAS analyses is still of practical importance in pig breeding.
Conclusion
Putting the results of previous simulation studies to test, we conducted GWAS in four pooled
F2 designs, which have been imputed to sequence level based on high coverage founder and
low coverage F1 sequencing. We found that by pooling the designs the sequence level marker
density can be exploited efficiently. QTLs for four well-characterized traits were identified in
agreement with previous mapping studies and candidate genes and pathways were unraveled,
that should be subject to further studies. Thus, the approach applied herein is a feasible strategy
to efficiently utilize extremely well phenotyped experimental designs that have been established
in the past.
72
References
Andersson, L., 2009 Genome-wide association analysis in domestic animals: a powerful
approach for genetic dissection of trait loci. Genetica 136: 341–349.
https://doi.org/10.1007/s10709-008-9312-4
Ban, H.-J., J. Y. Heo, K.-S. Oh, and K.-J. Park, 2010 Identification of Type 2 Diabetes-
associated combination of SNPs using Support Vector Machine. BMC Genet. 11: 26.
https://doi.org/10.1186/1471-2156-11-26
Barker, H. E., G. K. Smyth, J. Wettenhall, T. A. Ward, M. L. Bathet al., 2008 Deaf-1 regulates
epithelial cell proliferation and side-branching in the mammary gland. BMC Dev. Biol.
8: 94. https://doi.org/10.1186/1471-213X-8-94
Berens, E. B., G. M. Sharif, M. O. Schmidt, G. Yan, A. Wellstein et al., 2017 Keratin-
associated protein 5-5 controls cytoskeletal function and cancer cell vascular invasion.
Oncogene 36: 593–605. https://doi.org/10.1038/onc.2016.234
Blaj, I., J. Tetens, S. Preuß, J. Bennewitz, and G. Thaller, 2018 Genome-wide association
studies and meta-analysis uncovers new candidate genes for growth and carcass traits
in pigs. PLoS One 13: e0205576. https://doi.org/10.1371/journal.pone.0205576
Borchers, N., N. Reinsch, and E. Kalm, 2000 Familial cases of coat colour-change in a Piétrain
cross. J. Anim. Breed. Genet. 117: 285–287. https://doi.org/10.1046/j.1439-
0388.2000.00255.x
Brøndum, R. F., G. Su, L. Janss, G. Sahana, B. Guldbrandtsen et al., 2015 Quantitative trait
loci markers derived from whole genome sequence data increases the reliability of
genomic prediction. J. Dairy Sci.98: 4107–4116. https://doi.org/10.3168/jds.2014-9005
Browning, B. L., Y. Zhou, and S. R. Browning, 2018 A One-Penny Imputed Genome from
Next-Generation Reference Panels. Am. J. Hum. Genet. 103: 338–348.
https://doi.org/10.1016/j.ajhg.2018.07.015
Browning, S. R., and B. L. Browning, 2007 Rapid and accurate haplotype phasing and
missing-data inference for whole-genome association studies by use of localized
haplotype clustering. Am. J. Hum. Genet. 81: 1084–1097.
https://doi.org/10.1086/521987
Chang, C. C., C. C. Chow, L. C. Tellier, S. Vattikuti, S. M. Purcell et al., 2015 Second-
generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4:
7. https://doi.org/10.1186/s13742-015-0047-8
Cohen-Zinder, M., E. Seroussi, D. M. Larkin, J. J. Loor, A. E. Wind et al., 2005 Identification
of a missense mutation in the bovine ABCG2 gene with a major effect on the QTL on
73
chromosome 6 affecting milk yield and composition in Holstein cattle. Genome Res.
15: 936–944. https://doi.org/10.1101/gr.3806705
Comuzzie, A. G., S. A. Cole, S. L. Laston, V. S. Voruganti, K. Haack et al., 2012 Novel
genetic loci identified for the pathophysiology of childhood obesity in the Hispanic
population. PLoS One 7: e51954. https://doi.org/10.1371/journal.pone.0051954
Cui, H., S. Yang, M. Zheng, R. Liu, G. Zhao et al., 2017 High-salt intake negatively regulates
fat deposition in mouse. Sci. Rep. 7: 2053. https://doi.org/10.1038/s41598-017-01560-
3
Cui, Y., W. Chen, J. Chi, and L. Wang, 2016 Comparison of Transcriptome between Type 2
Diabetes Mellitus and Impaired Fasting Glucose. Med.Sci. Monit. 22: 4699–4706.
https://doi.org/10.12659/MSM.896772
Dalpé, G., M. Mathieu, A. Comtois, E. Zhu, S. Wasiak et al., 1999 Dystonin-deficient mice
exhibit an intrinsic muscle weakness and an instability of skeletal muscle
cytoarchitecture. Dev. Biol. 210: 367–380. https://doi.org/10.1006/dbio.1999.9263
Davenport, E. R., D. A. Cusanovich, K. Michelini, L. B. Barreiro, C. Ober et al., 2015
Genome-Wide Association Studies of the Human Gut Microbiota. PLoS One 10:
e0140301. https://doi.org/10.1371/journal.pone.0140301
Fan, Y., Y. Xing, Z. Zhang, H. Ai, Z. Ouyang et al., 2013 A further look at porcine
chromosome 7 reveals VRTN variants associated with vertebral number in Chinese and
Western pigs. PLoS One 8: e62534. https://doi.org/10.1371/journal.pone.0062534
Fujii, J., K. Otsu, F. Zorzato, S. de Leon, V. Khanna et al., 1991 Identification of a mutation
in porcine ryanodine receptor associated with malignant hyperthermia. Science 253:
448–451. https://doi.org/10.1126/science.1862346
Haller, T., T. Tasa, and A. Metspalu, 2019 Manhattan Harvester and Cropper: a system for
GWAS peak detection. BMC Bioinformatics 20: 22. https://doi.org/10.1186/s12859-
019-2600-4
Hayes, B., and M. E. Goddard, 2001 The distribution of the effects of genes affecting
quantitative traits in livestock. Genetics, Selection. Evolution GSE 33: 209–229.
Hu, Z., C. A. Park, and J. M. Reecy, 2019 Building a livestock genetic and genomic
information knowledgebase through integrative developments of Animal QTLdb and
CorrDB. Nucleic Acids Res. 47: D701–D710. https://doi.org/10.1093/nar/gky1084
Huang da, W., B. T. Sherman, and R. A. Lempicki, 2009 Systematic and integrative analysis
of large gene lists using DAVID bioinformatics re-sources. Nat. Protoc. 4: 44–57.
https://doi.org/10.1038/nprot.2008.211
74
Huang, D. W., B. T. Sherman, Q. Tan, J. R. Collins, and W. G. Alvord et al., 2007 The DAVID
Gene Functional Classification Tool: a novel biological module-centric algorithm to
functionally analyze large gene lists. Genome Biol. 8: R183. https://doi.org/10.1186/gb-
2007-8-9-r183
Knott, S. A., 2005 Regression-based quantitative trait loci mapping: robust, efficient and
effective. Philos. Trans. R. Soc. Lond. B Biol. Sci. 360: 1435–1442.
https://doi.org/10.1098/rstb.2005.1671
Koslowski, M., U. Sahin, R. Mitnacht-Kraus, G. Seitz, C. Huber et al., 2007 A placenta-
specific gene ectopically activated in many human cancers is essentially involved in
malignant cell processes. Cancer Res. 67:9528–9534. https://doi.org/10.1158/0008-
5472.CAN-07-1350
Krishnan, R., N. Boddapati, and S. Mahalingam, 2018 Interplay between human nucleolar
GNL1 and RPS20 is critical to modulate cell proliferation. Sci. Rep. 8: 11421.
https://doi.org/10.1038/s41598-018-29802-y
Li, G., X.-J. Wu, X.-Q. Kong, L. Wang, and X. Jin, 2015 Cytochrome coxidase subunit VIIb
as a potential target in familial hypercholesterolemia by bioinformatical analysis. Eur.
Rev. Med. Pharmacol. Sci. 19:4139–4145.
Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan et al., 2009 The Sequence
Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079.
https://doi.org/10.1093/bioinformatics/btp352
Liu, Y., P. Zhan, Z. Zhou, Z. Xing, S. Zhu et al., 2016 The overexpression of KIFC1 was
associated with the proliferation and prognosis of non-small cell lung cancer. J. Thorac.
Dis. 8: 2911–2923. https://doi.org/10.21037/jtd.2016.10.67
McKenna, A., M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis et al., 2010 The Genome
Analysis Toolkit: a Map Reduce framework for analyzing next-generation DNA
sequencing data. Genome Res. 20: 1297–1303. https://doi.org/10.1101/gr.107524.110
McLaren, W., L. Gil, S. E. Hunt, H. S. Riat, G. R. S. Ritchie et al., 2016 The Ensembl Variant
Effect Predictor. Genome Biol. 17: 122. https://doi.org/10.1186/s13059-016-0974-4
Nagamine, Y., C. S. Haley, A. Sewalem, and P. M. Visscher, 2003 Quantitative trait loci
variation for growth and obesity between and within lines of pigs (Sus scrofa). Genetics
164: 629–635.
Nezer, C., L. Moreau, B. Brouwers, W. Coppieters, J. Detilleux et al., 1999 An imprinted QTL
with major effect on muscle mass and fat deposition maps to the IGF2 locus in pigs.
Nat. Genet. 21: 155–156. https://doi.org/10.1038/5935
75
Ng, P. C., and S. Henikoff, 2003 SIFT: Predicting amino acid changes that affect protein
function. Nucleic Acids Res. 31: 3812–3814. https://doi.org/10.1093/nar/gkg509
Nguyen D. Truong, K. Lee, H. Choi, M.-k. Choi, M. Thong Le, et al., 2012 The complete
swine olfactory subgenome: expansion of the olfactory gene repertoire in the pig
genome. BMC genomics 13: 584. https://doi.org/10.1186/1471-2164-13-584
Ober, U., W. Huang, M. Magwire, M. Schlather, H. Simianeret al., 2015 Accounting for
genetic architecture improves sequence based genomic prediction for a Drosophila
fitness trait. PLoS One 10: e0126880. https://doi.org/10.1371/journal.pone.0126880
Porto-Neto, L. R., W. Barendse, J. M. Henshall, S. M. McWilliam,S. A. Lehnert et al., 2015
Genomic correlation: harnessing the benefit of combining two unrelated populations for
genomic selection. Genetics, Selection, Evolution GSE 47: 84.
Ramos, A. M., R. P. M. A. Crooijmans, N. A. Affara, A. J. Amaral,A. L. Archibald et al.,
2009 Design of a high density SNP genotyping assay in the pig using SNPs identified
and characterized by next generation sequencing technology. PLoS One 4: e6524.
https://doi.org/10.1371/journal.pone.0006524
Rohrer, G. A., D. J. Nonneman, R. T. Wiedmann, and J. F. Schneider, 2015 A study of vertebra
number in pigs confirms the association of vertnin and reveals additional QTL. BMC
Genet. 16: 129. https://doi.org/10.1186/s12863-015-0286-9
Rothschild, M. F., Z. Hu, and Z. Jiang, 2007 Advances in QTL mapping in pigs. Int. J. Biol.
Sci. 3: 192–197. https://doi.org/10.7150/ijbs.3.192
Rückert, C., and J. Bennewitz, 2010 Joint QTL analysis of three connected F2-crosses in pigs.
Genetics, Selection. Evolution GSE 42: 40.
Sánchez-Pozos K., M. G. Ortíz-López, B. I. Peña-Espinoza, M. de Los Ángeles Granados-
Silvestre, V. Jiménez-Jacinto, et al., 2018 Whole-exome sequencing in maya indigenous
families: variant in PPP1R3A is associated with type 2 diabetes. Molecular genetics and
genomics MGG 293:1205–1216. https://doi.org/10.1007/s00438-018-1453-2
Schaefer, I.-M., P. Dal Cin, L. M. Landry, C. D. M. Fletcher, G. J. Hanna et al., 2018 CIC-
NUTM1 fusion: A case which expands the spectrum of NUT-rearranged epithelioid
malignancies. Genes Chromosomes Cancer 57: 446–451. https://doi.org/10.1002/gcc.3
Schmid, M., R. Wellmann, and J. Bennewitz, 2018 Power and precision of QTL mapping in
simulated multiple porcine F2 crosses using whole-genome sequence information. BMC
Genet. 19: 22. https://doi.org/10.1186/s12863-018-0604-0
76
Shah T. M., N. V. Patel, A. B. Patel, M. R. Upadhyay, A. Mohapatra, et al.,2016 A genome-
wide approach to screen for genetic variants in broilers (Gallus gallus) with divergent
feed conversion ratio. Molecular genetics and genomics MGG 291: 1715–1725.
Sphranger, J., J. Gehler, and M. Cantz, 1977 Mucolipidosis I–a sialidosis. Am. J. Med. Genet.
1: 21–29. https://doi.org/10.1002/ajmg.1320010104
Stratz, P., M. Schmid, R. Wellmann, S. Preuß, I. Blaj et al., 2018 Linkage disequilibrium
pattern and genome-wide association mapping for meat traits in multiple porcine F2
crosses. Anim. Genet. 49: 403–412. https://doi.org/10.1111/age.12684
Stuible, M., N. Dubé, and M. L. Tremblay, 2008 PTP1B Regulates Cortactin Tyrosine
Phosphorylation by Targeting Tyr446S. J. Biol. Chem. 283:15740–15746.
https://doi.org/10.1074/jbc.M710534200
Umene, K., M. Yanokura, K. Banno, H. Irie, and M. Adachi et al., 2015 Aurora kinase A has
a significant role as a therapeutic target and clinical biomarker in endometrial cancer.
Int. J. Oncol. 46: 1498–1506. https://doi.org/10.3892/ijo.2015.2842
van den Berg, I., D. Boichard, and M. S. Lund, 2016 Sequence variants selected from a multi-
breed GWAS can improve the reliability of genomic predictions in dairy cattle.
Genetics, Selection. Evolution GSE 48: 83.
Vimaleswaran, K. S., A. Cavadino, D. J. Berry, J. C. Whittaker, C. Power et al., 2013 Genetic
association analysis of vitamin D pathway with obesity traits. Int. J. Obes. 37: 1399–
1406. https://doi.org/10.1038/ijo.2013.6
Wang, L., P. Park, F. La Marca, K. Than, S. Rahman et al., 2013 Bone formation induced by
BMP-2 in human osteosarcoma cells. Int. J. Oncol.43: 1095–1102.
https://doi.org/10.3892/ijo.2013.2030
Wellmann, R., S. Preuß, E. Tholen, J. Heinkel, K. Wimmers et al., 2013 Genomic selection
using low density marker panels with application to a sire line in pigs. Genetics,
Selection. Evolution GSE 45: 28.
Wozney, J. M., V. Rosen, A. J. Celeste, L. M. Mitsock, M. J. Whitters et al., 1988 Novel
regulators of bone formation: molecular clones and activities. Science 242: 1528–1534.
https://doi.org/10.1126/science.3201241
Yang, J., S. H. Lee, M. E. Goddard, and P. M. Visscher, 2011 GCTA: a tool for genome-wide
complex trait analysis. Am. J. Hum. Genet. 88: 76–82.
https://doi.org/10.1016/j.ajhg.2010.11.011
Yip, L., R. Fuhlbrigge, C. Taylor, R. J. Creusot, T. Nishikawa-Matsumura et al., 2015
Inflammation and hyperglycemia mediate Deaf1 splicing in the pancreatic lymph nodes
77
via distinct pathways during type 1 diabetes. Diabetes 64: 604–617.
https://doi.org/10.2337/db14-0803
Yoon, S., M. J. Molloy, M. P. Wu, D. B. Cowan, and E. Gussoni, 2007 C6ORF32 is
upregulated during muscle cell differentiation and induces the formation of cellular
filopodia. Dev. Biol. 301: 70–81. https://doi.org/10.1016/j.ydbio.2006.11.002
Zatkova, A., J.-M. Rouillard, W. Hartmann, B. J. Lamb, R. Kuick et al., 2004 Amplification
and overexpression of the IGF2 regulator PLAG1 in hepatoblastoma. Genes
Chromosomes Cancer 39: 126–137. https://doi.org/10.1002/gcc.10307
Zhang, L., X. Zhou, J. J. Michal, B. Ding, R. Li et al., 2014 Genome Wide Screening of
Candidate Genes for Improving Piglet Birth Weight Using High and Low Estimated
Breeding Value Populations. Int. J. Biol. Sci. 10:236–244.
https://doi.org/10.7150/ijbs.7744
Zhou Z.-Y., A. Li, N. O. Otecko, Y.-H. Liu, D. M. Irwin, et al. 2017 PigVar: a database of pig
variations and positive selection signatures. Database the journal of biological databases
and curation.
79
Chapter 4
Recombination landscape in multiple F2 pig crosses between
genetically diverse founder breeds
Iulia Blaj1, Jens Tetens2, Robin Wellmann3, Siegfried Preuß3, Jörn Bennewitz3 and Georg
Thaller1
1Institute of Animal Breeding and Husbandry, Kiel University, Kiel, Germany
2Functional Breeding Group, Department of Animal Sciences, Göttingen University,
Göttingen, Germany
3Institute of Animal Husbandry and Breeding, University of Hohenheim, Stuttgart, Germany
Published in Proceedings of the 11th World Congress of
Genetics Applied to Livestock Production
80
Summary
In the present study, the recombination landscape of multiple pig F2 pedigrees is evaluated in
detail. Three of the pedigrees under investigation were generated from distantly related founder
breeds: Wild boar, Piétrain and Meishan and the fourth pedigree originated from closely related
founder breeds: Piétrain and Large white or crossbred sows Large white × Landrace.
Recombination rates and genetic maps were estimated from SNP chip data using marker
positions according to the current pig reference genome. The level of recombination events
varies within crosses, among breeds and individuals as well as across chromosomes or regions
within chromosomes. Although we observed a substantial heterogeneity in the pedigrees,
certain patterns specific to crosses, sex or chromosomes were identified. These patterns depend
on the extent of conservation of the local rate of recombination over time, on the levels of
diversity, efficiency or direction of selection, and the genome composition. The current findings
are aimed to have practical consequences for the genetic mapping of traits in pigs.
Keywords: recombination landscape, domestication, crossover inference, pig F2 cross
Introduction
Recombination is shaping the genomic architecture of organisms by producing new genetic
combinations every generation. This process of shuffling is the major source of genetic
variability upon which selection can operate in a natural or artificial manner. Among the various
domesticated pig breeds, either of European or Asian origin, the selection acted mostly on very
different traits thus specifically altering the genome landscape. In the present study, we
investigate aspects related to recombination rate (RR) and genetic maps in four pig populations
stemming from the following European and Asian founder breeds: Piétrain (P), Large white
(Lw), Landrace (L) and Meishan (M), as well as their wild ancestor: the Wild boar (W). The
aim of the investigation was three fold:
1. Estimate a high-density recombination map of the pig based on the new reference genome;
2. Evaluate cross, sex and chromosome specific differences;
3. Assess male specific rates and genetic map lengths.
Material and methods
Experimental populations
Four pedigrees were included in the analysis (Table 1). The population PxLwL/Lw was
generated from closely related founders (Boysen et al., 2010) and the other three originate from
81
distantly related founder breeds (Rückert and Bennewitz, 2010). The F2 individuals and the
respective F1 and F0 ancestors were genotyped with the Illumina PorcineSNP60 BeadChip. SNP
chromosomal positions were based on the current pig genome assembly (Sscrofa 11.1).
Genotype filtering was done using Illumina GenomeStudio software and Plink (Purcell et al.,
2007). Autosomal SNPs were further used and their number was on average 43K, except for
the WxM data set which contained 37K SNPs. The statistical analysis was conducted in R
(Team, R. Core, 2014).
Table 1. Description of the study designs.
Design/Generation F0 males F0 females F1 males F1 females F2
PxLwL/Lw 5 Piétrain 8 LwL/Lw1 8 88 1785
MxP 1 Meishan 8 Piétrain 3 19 304
WxP 1 Wild boar2 9 Piétrain 2 26 291
WxM 1 Wild boar2 4 Meishan 2 21 312 1 Large white x Landrace/Large white; 2 the same Wild boar founder
Haplotype reconstruction and inference of crossover events
The autosomal recombination events were inferred using LinkPhase3 software (Druet and
Georges, 2015) which performs phasing based on Mendelian segregation rules. The crossover
events (CO) are further identified as phase switches observed in the gametes. The output
consists of the crossover calls in each parent-child pair (F1-F2) and the genomic interval for
which the inference is made. Double CO occurring in windows of 1 Mb, CO intervals bigger
than half of the chromosome length and recombination fractions larger than 0.05 in 1 Mb
window were ignored. Moreover, only chromosomes with a maximum of 4 CO events were
further considered. Recombination fractions were estimated for every non-overlapping 1 Mb
window and converted into centiMorgans (cM) using the Haldane mapping function. For each
experimental population we calculated a sex-averaged, a female and a male recombination rate
and map. Additionally, individual genetic maps for 14 F1 males with more than one hundred
meiosis were constructed. Sex differences in the recombination rate distribution were evaluated
chromosome wise using the Kolmogorov-Smirnov test (KS test). Correlations between cross
specific, sex specific and male specific recombination rates were tested using Pearson’s
correlation coefficient at a chromosomal level as well as at a genome-wide level.
82
Results and discussion
Recombination rates and maps
The crossover events inference for the F1 individuals revealed an average of 1.03 CO per
chromosome in PxLwL/Lw, 1.13 in MxP, 1.12 in WxP and 1.08 in WxM. A higher number of
events was identified in the females as compared to the males which lead to sex specific
differences in the recombination rates and as a consequence in the linkage map (Table 2). The
longest genetic maps found were 1953 cM for the MxP and 1920 cM for the WxP cross,
respectively. In general, the sex-averaged, female and male genetic maps were shorter than
those previously published (Tortereau et al., 2012, Guo et al., 2009, Rohrer et al., 1996).
Furthermore, the estimated recombination rates were higher than reported by Tortereau et al.
while using similar pedigrees and Sscrofa 10.2 as reference. Therefore, we ran our analysis
pipeline for the four study designs using the SNP positions according to Sscrofa 10.2. CO
filtering acted more stringent and together with the overestimation of the physical length of the
previous assembly led to lower RR and longer linkage maps. Thus, higher RR estimates and
shorter maps can be mainly attributed to the fact that we used the latest reference genome.
Table 2. Characteristics of sex-averaged, female and male linkage maps.
Sex-averaged Female Male
Design Linkage
map (cM)
cM/Mb Linkage
map (cM)
cM/Mb Linkage
map (cM)
cM/Mb
PxLwL/Lw 1740 0.87 1952 1 1553 0.75
MxP 1953 0.96 2068 1.05 1643 0.79
WxP 1920 0.95 1905 0.97 1731 0.82
WxM 1864 0.92 1977 1.01 1591 0.75
The longest chromosome was SSC6 for PxLwL/Lw (120.39 cM), for MxP (149.84 cM) and for
WxM (137.49 cM). However, in the WxP cross, the longest was SSC1 with 146.64 cM. With
respect to map size and recombination rates, in general, female rates were higher. Nonetheless,
there are exceptions as previously described in other studies, specifically on SSC1 and SSC13
for which male maps and rates are surpassing the female one. Additionally, we identified a
similar behaviour for other chromosomes in WxP (for SSC14 and SSC15) and in WxM (for
SSC15), i.e. in the crosses stemming from the Wild boar.
Sex differences in the recombination rate distribution were compared between males and
females, stratified by chromosome via the KS test. For SSC1, SSC3, SSC13 and SSC18 the
distributions were consistently different in all four pedigrees, whereas for SSC5 and SSC11 the
83
RR came from the same distribution. Several factors such as chromosome length, number of
CO events identified in each sex, centromere position and particularity of the genomic regions
can be incriminated for the observed chromosomal sex differences.
Male specific differences
Different genetic lengths and recombination rates were observed for females and males in the
four pedigrees. In each of the crosses the male chromosomal genetic maps exhibited similar
lengths, except for SSC1, SSC2 and SSC14 (Figure 1). The observed differences are due to
sequence variation and different informative markers within the males used.
Figure 1. Male-averaged chromosomal genetic maps for the four pedigrees.
Individual recombination rates and maps were estimated for the F1 boars for which we had more
than one hundred meiosis available. The correlation of the RR at a genome level among the
eight boars in PxLwL/Lw varied between 0.51 and 0.62. For the two boars in WxP we
calculated a 0.55 correlation coefficient while between the males in MxP and WxP the
correlation was 0.42 for both. The WxP and WxM crosses stemmed from the same Wild boar
founder male therefore we also assessed the similarity among the four F1 boars. We found an
average correlation of 0.41 suggesting that the crossing with the female Piétrain and Meishan
founders reshaped the recombination landscape at a genome level. Regarding the overall
genetic map length, the shortest map was observed in one male from the European breed cross
(1461 cM) and the longest map was recorded in one of the F1 boars in WxP cross (1719 cM)
implying the individual male differences can be substantial.
Conclusion
We report in this study the first, to our knowledge, recombination map of the porcine genome
based on the newest reference genome assembly, with more precise localization of crossover
events and a broad coverage of segregating variation due to the various founder breeds. The
study is aimed to contribute to the recombination picture in the Sus scrofa population, to
84
understand how domestication and breed development impacted the recombination landscape
and, last but not least, to assist in the genetic mapping of relevant traits.
Acknowledgements
The study was funded by the Deutsche Forschungsgemeinschaft, DFG.
85
References
Boysen T.J., Tetens J. & Thaller G. (2010). Detection of a quantitative trait locus for ham
weight with polar overdominance near the ortholog of the callipyge locus in an
experimental pig F2 population. Journal of animal science 88, 3167-3172.
Druet, T., & Georges, M. (2015). LINKPHASE3: an improved pedigree-based phasing
algorithm robust to genotyping and map errors. Bioinformatics, 31(10), 1677-1679.
Guo, Y., Mao, H., Ren, J., Yan, X., Duan, Y., Yang, G., ... & Brenig, B. (2009). A linkage map
of the porcine genome from a large‐scale White Duroc×Erhualian resource population
and evaluation of factors affecting recombination rates. Animal genetics, 40(1), 47-52.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., ... & Sham, P.
C. (2007). PLINK: a tool set for whole-genome association and population-based
linkage analyses. The American Journal of Human Genetics, 81(3), 559-575.
Rückert C. & Bennewitz J. (2010). Joint QTL analysis of three connected F2-crosses in pigs.
Genetics, selection, evolution: GSE 42, 40.
Rohrer, G. A., Alexander, L. J., Hu, Z., Smith, T. P., Keele, J. W., & Beattie, C. W. (1996). A
comprehensive map of the porcine genome. Genome research, 6(5), 371-391.
Team, R. C. (2015). R: A language and environment for statistical computing [Internet].
Vienna, Austria: R Foundation for Statistical Computing; 2014.
Tortereau, F., Servin, B., Frantz, L., Megens, H. J., Milan, D., Rohrer, G., ... & Groenen, M. A.
(2012). A high-density recombination map of the pig reveals a correlation between sex-
specific recombination and GC content. BMC genomics, 13(1), 586.
87
Chapter 5
A systematic survey of sequence data variation in the founder
individuals of four F2 pig crosses
Iulia Blaj and Georg Thaller
Institute of Animal Breeding and Husbandry, Kiel University, Kiel, Germany
Manuscript in preparation
88
Abstract
We propose a reverse genetics approach for exploiting information in F2 resource populations
by surveying the sequence data variation that exists in the F0 generation. The rationale behind
is that almost all the variation that exists in F2 individuals (used normally for genome wide
association studies) is being propagated from the founder generation. The panel of animals
consists of 14 Piétrain, 7 crossbred Large white x Landrace sows, 1 Large white sow, 1 Meishan
boar and 1 Wild boar. We explore two approaches. The first one considers the pooled variation
while the second, the breed specific variation. The Meishan individual sets itself apart from the
rest of the European individuals with a high number of unique variants involved in various vital
biological processes. The two European pig groups (Piétrain group and Large white x Landrace
and Large white group) display breed specific variation in their olfactory receptor gene family.
Finally, a gene-based survey found that variation of high impact effect occurred in genes
relevant to domestication and breeding (e.g. KIT, IGF2, PRKAG3, and PRLR) which can be of
further interest.
Introduction
Associating phenotypic variation to genotypic variation has been an ongoing objective in pig
research from both a breeding and a biomedical perspective. One approach to establish this
association is by analyzing experimental populations. Common population structures are F2
crosses usually obtained by mating genetically divergent lineages (Geldermann et al., 1996),
but also from closely related founders (Borchers et al., 2000). This classical setting was
explored in the past using sparse genomic information (e.g. microsatellites) and linkage
mapping (Ernst and Steibel, 2013). Advancing into the genomics era, the now common usage
of the SNP chip array (Ramos et al. 2009) was a pivotal moment with major implications for
increasing resolution in quantitative trait loci (QTL) mapping experiments and the successful
implementation of genomic selection (Meuwissen et al., 2001). The resolution can be further
increased with the availability of whole genome sequence (WGS) data, information that opens
many opportunities to find causative variants influencing traits of interest.
QTL mapping was successfully employed in the past in the four F2 crosses considered in this
study and a number of QTLs and candidate genes were pinpointed for various traits (Boysen
et al., 2011; Rückert and Bennewitz, 2010; Stratz et al., 2018; Blaj et al., 2018). This forward
genetics approach usually implies coupling the F2 generation genotypes with the phenotypes
via linkage mapping or genome wide association studies. This study, in contrast, explores the
89
feasibility of employing a reverse genetics approach, scenario in which we survey the genomic
variation from WGS of the founder (F0) individuals. This variation from the diverse panel of
F0 individuals (European Wild boar, European and Asian breeds) is assessed systematically
using bioinformatics tools via a pooled and a breed based approach.
Material and methods
Founder individuals
Whole genome sequence data was available for 24 founder individuals from four F2 resource
populations (Falker-Gieske et al., 2019). The designs are described in detail by Geldermann et
al. (1996) and Borchers et al. (2000). A brief description of the crosses and founders is shown
in Table 1. The panel of animals consists of 14 Piétrain (5 males from D1 and 9 females used
in D2 and D3), 7 crossbred Large white x Landrace sows (D1), 1 Large white sow (D1), 1
Meishan boar (D2) and 1 Wild boar (D3 and D4).
Table 1. Description of the resource populations.
Cross WGS
Founders*
Sample IDs ∑ F2
D1 P x LwL/Lw 13 (13) P: 10345, 17118, 17123, 17161, 17165
LwL : 662, 690, 693, 735, 750, 756, 771
Lw: 728
1785
D2 M x P 8 (9) M: M199
P: P102, P107, P108, P113, P119, P130, P244
312
D3 W x P 6 (10) W: P181
P: P102, P108, P113, P115, P128
300
D4 W x M 1 (5) M: M199 304
*Four founders are in common among the crosses and ∑ founders in brackets; P = Piétrain, Lw
= Large white, L = Landrace, M = Meishan, W = Wild boar.
Population analysis and heterozygosity levels
With the aim to infer population structure, we performed principal component analysis (PCA)
in PLINK (Purcell et al., 2007) using the multisample vcf file containing variants for all 24
individuals. PC1 was plotted against PC2 using the R package ggplot2 (R Core Team, 2008;
Wickham, 2009). Observed (Ho) and expected heterozygosity (He) were estimated using
PLINK function –het from which the inbreeding coefficient of each individual was calculated.
90
Functional effect prediction and gene set analysis
To predict the coding effects of genetic variation (i.e. SNPs, indels) on genes, transcripts,
protein sequence and regulatory elements, we employed the SnpEff tool (Cingolani et al. 2012).
The database containing the genomic annotations for the latest reference Sscrofa 11.1
(GCA_000003025.6 provided by Swine Genome Sequencing Consortium on NCBI) was build
and utilized for effect prediction. We used two approaches: i) pooled based (using collectively
the variants from all 24 animals) and ii) breed based. For the latter, we prioritized on the private
doubletons (i.e. variants where the minor allele only occurs in a single individual and that
individual is homozygous for that allele) retained using the singleton command from vcftools
(Danecek et al., 2011).
In the pooled approach, we focused on high, moderate, and low impact variants (a detailed
explanation is given in the Results section, Table 3). The genes affected by these variants were
used for a gene set analysis using the ShinyGO Gene Ontology Enrichment Analysis tool (Ge
and Jung, 2018) according to biological processes (BP). The same tool provides graphical
visualization of the relationships among the enriched BP terms via hierarchical clustering tree
view. In general, for all the gene set analyses, we used false discovery rate (FDR) cutoff of
0.05.
The breed based approach relied on selecting exclusive doubletons from each individual. These
variants, as well as their associated genes and effects (according to SnpEff tool) were grouped
into four breed categories: Piétrain, Large white x Landrace or Large white, Meishan and Wild
boar. The datasets genes affected by high impact variants were considered for a gene set
analysis with ShinyGo as described above.
A comprehensive list incorporating genes associated with domestication, carcass composition,
reproduction, meat and fat quality traits was retrieved from the literature (Rothschild and
Ruvinsky, 2011; Groenen, 2016). We further evaluated the pooled approach SnpEff output for
the type of variation that exists within these genes, considered the severity of their predicted
variant effects and ultimately focused on high impact SNPs and indels.
Results
A total of ca. 33 Million (M) genomic variants were cumulatively available for the 24 samples
from which ca. 27M were SNPs and ca. 6M indels. The percentage breakdown per animal
according to the variant type is shown in Figure 1. Individual genomic inbreeding coefficients
derived from homozygous and heterozygous variants ranged from -0.71 for the Meishan to
91
0.27 for the Wild boar (Table SM1). We used PCA to identify the main axes of variance within
the data set (Figure 2). PC1 explained almost a quarter of the variance (22.12%) while PC2
captured another 10.01%.
Figure 1. Individual genome variant composition. nRefHom = percentage of variants that
are reference homozygous; nNonRefHom = percentage of variants that are non-reference
homozygous; nHets = percentage of heterozygous variants; nIndels = percentage of indels.
Figure 2. Principal component analysis. PC1 22.12% versus PC2 10.01% (Lw = Large white,
LwL = Large white x Landrace, M = Meishan, P = Piétrain, W = Wild boar).
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%6
62
69
0
69
3
72
8
73
5
75
0
75
6
77
1
P1
02
P1
07
P1
08
P1
13
P1
15
P1
19
P1
28
P1
30
P2
44
10
34
5
17
11
8
17
12
3
17
16
1
17
16
5
M1
99
P1
81
nRefHom nNonRefHom nHets nIndels
92
Variant analysis
We evaluated SNPs and indels from all samples in the pooled approach. From this large scale
analysis, according to variant effect, the majority were located in intronic (56.5%) and
intergenic regions (31.7%). A detailed summary containing all effect types, count and
percentage is presented in Table 2. A further classification was made based on the severity of
the variants’ impact at the transcript and protein level. While more then 99% had a modifier
impact, 0.03%, 0.22% and 0.48% had a high, moderate or low impact, respectivly (Table 3).
Interpretation of large gene lists was accomplished through enrichment analysis. In the pooled
strategy, the genes containing high, moderate and low impact variants were the input for
ShinyGO. This list contained 75% of the genes annotated on the current reference genome
Sscrofa 11.1 (total number of genes is 25,880, Ensembl Genes (Zerbino et al., 2018)). Due to
the large dataset used for gene enrichment analysis, we prioritize on the Top 30 GO terms
(more than 500 GO terms enriched) related to the biological processes (Figure 3 and Table
SM2). The most significant GO terms were Localization (GO: 0051179), Cellular component
organization or biogenesis (GO: 0071840), Cellular component organization (GO: 0016043),
Positive regulation of biological process (GO: 0048518) and Developmental process (GO:
0032502).
The breed specific analysis yielded four data sets based on the exclusive doubletons found in
each animal. The genes in which the high, moderate or low impact doubleton resided were used
as gene sets for enrichment analysis. For Meishan, 2,283 genes were identified which were
enriched at 240 GO terms. From the Top 30 terms (Table SM3), a hierarchical tree was obtained
(Figure 4). For the Piétrain group, 16 GO terms were significant from an input of 330 genes
(Figure 5 and Table SM4). For Large white x Landrace or Large white group with a gene list
of 272, 12 GO terms were identified (Table SM5, tree not shown due to high similarity to the
Piétrain group). For the latter two groups, many of the genes were involved in biological
processes related to Response to chemical (GO: 0042221) and Sensory perception (GO:
0007600). No enriched terms were found for the Wild boar gene list.
The list of preselected genes explored yielded eleven genes on which high impact variants have
a disruptive impact in the protein formation (Table 4). The genes were related to the following
traits: two color coat genes (KIT and MC1R), two growth related genes (IGF2 and PRKAG3),
one early domestication genes (SMOC2), LDHA (for pH and meat color), one reproduction
associated gene (PRLP) and one gene related to the vertebrae number (VRTN). The type of
93
effects were frameshift variants, splice acceptor variant and splice donor variant mostly
resulting from an indel type genomic variation.
Table 2. Number of effects by type from SnpEff (for pooled data).
Type Count Percent
3_prime_UTR_variant 580,514 0.888
5_prime_UTR_premature_start_codon_gain_variant 13,918 0.021
5_prime_UTR_truncation 1 0
5_prime_UTR_variant 108,000 0.165
bidirectional_gene_fusion 102 0
conservative_inframe_deletion 926 0.001
conservative_inframe_insertion 1,183 0.002
disruptive_inframe_deletion 1,648 0.003
disruptive_inframe_insertion 1,285 0.002
downstream_gene_variant 3,281,085 5.02
exon_loss_variant 9 0
frameshift_variant 12,132 0.019
gene_fusion 30 0
initiator_codon_variant 25 0
intergenic_region 20,747,527 31.745
intragenic_variant 3 0
intron_variant 36,905,738 56.469
missense_variant 140,085 0.214
non_canonical_start_codon 1 0
non_coding_transcript_exon_variant 18,757 0.029
non_coding_transcript_variant 317 0
splice_acceptor_variant 1,766 0.003
splice_donor_variant 2,433 0.004
splice_region_variant 64,794 0.099
start_lost 370 0.001
stop_gained 1,622 0.002
stop_lost 309 0
stop_retained_variant 130 0
synonymous_variant 247,209 0.378
transcript_ablation 2 0
upstream_gene_variant 3,224,049 4.933
94
Table 3. Number of effects by impact from SnpEff (for pooled data).
Type and definitions* Count Percent
HIGH = The variant is assumed to have high (disruptive)
impact in the protein, probably causing protein truncation, loss
of function or triggering nonsense mediated decay.
17,746 0.027
MODERATE = A non-disruptive variant that might change
protein effectiveness. 145,006 0.222
LOW = Assumed to be mostly harmless or unlikely to change
protein behavior. 315,545 0.483
MODIFIER = Usually non-coding variants or variants
affecting non-coding genes, where predictions are difficult or
there is no evidence of impact.
64,810,011 99.267
*According to SnpEff tool documentation (Cingolani et al. 2012)
Figure 3. Pooled analysis: Hierarchical clustering of Top 30 GO Biological Process terms
from the enrichment analysis (genes containing high, moderate and low impact variants).
95
Figure 4. Meishan breed specific analysis: Hierarchical clustering of Top 30 GO Biological
Process terms from the enrichment analysis (genes containing high, moderate and low impact
doubleton variants).
Figure 5. Piétrain breed specific analysis: Hierarchical clustering of GO Biological Process
terms from the enrichment analysis (genes containing high, moderate and low impact doubleton
variants).
96
Table 4. Domestication or breeding related genes retrieved from the literature with high
impact effects variants in the founders.
Gene Transcript ID
High
impact
variants
Type of effects Type of
Variants Trait*
KIT
ENSSSCT00000009679
ENSSSCT00000062378
4 frameshift variant,
splice acceptor
variant, splice
donor variant
1 SNP
and 2
indels
Coat color
MC1R ENSSSCT00000022534 1 frameshift variant 1 indel Coat color
IGF2
ENSSSCT00000039341
ENSSSCT00000044712
ENSSSCT00000049151
21 frameshift variant,
splice donor
variant
1 SNP
and 4
indel
Growth and
fat deposition
PRKA
G3
ENSSSCT00000017641
ENSSSCT00000036402
3 frameshift variant,
splice acceptor
variant
2 SNPs
and 3
indels
Lean growth
SMOC
2
ENSSSCT00000004437 1 frameshift variant 1 indels Initial stages
of
domestication
LDHA ENSSSCT00000046190 1 frameshift variant 1 indels pH and meat
colour
PRLR
ENSSSCT00000018325
ENSSSCT00000036206
4 splice acceptor
variant, splice
donor variant
2 indels Reproduction
VRTN ENSSSCT00000002625 1 frameshift variant 1 indels Vertebrae
number
*According to literature review (Groenen, 2016; Rothschild and Ruvinsky, 2011)
Discussion
In this paper, we evaluated the genomic variants from WGS data generated for the founder
individuals of four F2 crosses. The panel of 24 animals consists of diverse pig breeds (European
and Asian) as well as the European ancestor (the European Wild boar).
The amount of genomic variation encountered depended on the origin of the individuals.
Having representatives from both European and Asian lineages, which have been
geographically separated for more than one million years (Frantz et al., 2013), leads to specific
individual levels of variation. The Meishan individual is the most genetically distinct (Figure
2) and has the highest number of variants (Figure 1). On the one hand, this is a result of the
breed evolving from the Asian Wild boar that is known to be fixed for the alternative allele at
97
over one million locations, as compared the European counterpart (Groenen et al., 2012). On
the other hand, this is also a consequence of using the reference genome assembly Sscrofa 11.1,
which is from a Duroc individual, thus a European breed. Interestingly, we observe slightly
higher amount of variation in Piétrain, Large white and the crossbred sows (expected due to
the increased heterozygosity levels in these seven crossbred individuals) as compared to the
Wild boar. Two types of events, occurring in the last centuries, explain the observation of
increased genetic diversity in European breeds than in their ancestor: i) domestic pigs
hybridized with local wild populations due to farming practices (Zeder et al., 2006) and ii)
human driven introgression of Asian domestic pigs into the European stocks (Bosse et al.,
2014).
Given the high genomic variation usually encountered in WGS data, the importance of
prioritizing on relevant variants can be at times challenging. In the pooled based analysis, more
than 60 million effects were predicted out of which less than 1% had a disruptive effect at the
protein level, nevertheless the count of these effects was still high, with 17,746 effects residing
in 5026 genes. At the top of the list with more than 20 high impact effect variants, were
ENSSSCG00000001229 (patr class I histocompatibility antigen, A-126 alpha chain-like, a
member of the major histocompatibility complex MHC), ENSSSCG000000031998 (olfactory
guanylyl cyclase GC-D-like) and ENSSSCG00000038461 (taste receptor type 2 member 20-
like). Genes related to immunity, from the MHC complex, and genes related to sensory
perception are families known to be actively evolving and expanding in pigs (Groenen et al.,
2012), thus harboring a high amount of variation. Gene set analysis (which translates gene lists
into enriched functions) used in the pooled strategy proved rather unspecific in its output
because 75% of the entire Sus Scrofa genes were included. Nevertheless, the most significantly
enriched GO terms, represented by more than 4,000 genes, were related to fundamental or basic
biological processes such as localization (GO: 0051179) and development (GO: 0032502).
We observed an increase in the level of specificity with the breed based approach, where the
focus is based on groups of animals representing the European commercial breeds, the
European Wild boar, or the Meishan breed. The genes containing the exclusive variation found
in the Meishan individual clustered in GO terms related to cell adhesion, localization,
developmental processes, anatomical structure morphogenesis, nervous system development,
animal organ development, and others (Figure 4 and Table SM3). This supports further the idea
that the Asian pigs have many fixed variants that are part of vital biological functions.
The private variation for the Piétrain and Large white x Landrace or Large white groups were
mostly related to olfactory receptor genes (Figure 4, Table SM4 and SM5). The swine olfactory
98
subgenome is the largest gene superfamily and it includes 1,113 functional olfactory receptor
genes and 188 pseudogenes based on Sscrofa 10.2 (Dinh Truong Nguyen et al., 2012). We
compare the exclusive variation, its associated genes, and the enriched biological processes in
the two European Breed groups and, while a hand full were common, the majority of the
olfactory receptor genes were breed specific.
Finally, we explored the variation from specific genes previously associated with
domestication, production and reproduction traits (Rothschild and Ruvinsky, 2011; Groenen,
2016). Four mutations had high impact effect on KIT, gene related to coat color and known to
display an extensive genetic heterogeneity (Fontanesi et al., 2010). For IGF2 (Joen et al., 1999;
Nezer et al., 1999), we detected 21 high impact variants, which could be of further interests as
different phenotypes have been associated with this gene in these crosses (Boysen et al., 2011;
Blaj et al., 2018). In addition, to be noticed is that many of the variants with high impact listed
in Table 4 are actually indels, which are commonly disregarded in genome wide association
studies, even though they have the ability to cause a disruption at the protein level.
Conclusion
This reverse genetic approach used in this study relied on an exploratory analysis conducted
using sequence data for a panel of founder individuals of F2 populations. The effect prediction
tools and gene enrichment analysis are powerful instruments in the genomics area. The
complexity of deciphering the functional implications of genomic variations can be dissected
with such tools and can assist in providing further directions of research.
99
References
Blaj, I., Tetens, J., Preuß, S., Bennewitz, J., & Thaller, G. (2018). Genome-wide association
studies and meta-analysis uncovers new candidate genes for growth and carcass traits
in pigs. PLoS ONE, 13(10), e0205576.
Borchers, N., Reinsch, N., & Kalm, E. (2000). Familial cases of coat colour‐change in a Piétrain
cross. Journal of Animal Breeding and Genetics, 117(4).
Bosse, M., Megens, H. J., Madsen, O., Frantz, L. A., Paudel, Y., et al. (2014). Untangling the
hybrid nature of modern pig genomes: a mosaic derived from biogeographically distinct
and highly divergent Sus scrofa populations. Molecular Ecology, 23(16), 4089–4102.
Boysen, T. J., Tetens, J., & Thaller, G. (2011). Evidence for additional functional genetic
variation within the porcine IGF2 gene affecting body composition traits in an
experimental Piétrain × Large White/Landrace cross. Animal: An International Journal
of Animal Bioscience, 5(5), 672–677.
Cingolani, P., Platts, A., Le Wang, L., Coon, M., Nguyen, T., et al. (2012). A program for
annotating and predicting the effects of single nucleotide polymorphisms, SnpEff:
SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2),
80–92.
Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., et al. (2011). The variant call
format and VCFtools. Bioinformatics, 27(15), 2156–2158.
Dinh T. N., Kyooyeol L., Hojun C., Min-kyeung C., Minh T. L., et al. (2012). The complete
swine olfactory subgenome: expansion of the olfactory gene repertoire in the pig
genome. BMC Genomics, 13, 584.
Ernst, C. W., & Steibel, J. P. (2013). Molecular advances in QTL discovery and application in
pig breeding. Trends in Genetics, 29(4), 215-224.
Fontanesi, L., D'Alessandro, E., Scotti, E., Liotta, L., Crovetti, A., et al. (2010). Genetic
heterogeneity and selection signature at the KIT gene in pigs showing different coat
colours and patterns. Animal Genetics, 41(5), 478–492.
Frantz, L. A. F., Schraiber, J. G., Madsen, O., Megens, H.-J., Bosse, M., et al. (2013). Genome
sequencing reveals fine scale diversification and reticulation history during speciation
in Sus. Genome Biology, 14, R107.
Ge, S., & Jung, D. (2018). ShinyGO: a graphical enrichment tool for animals and plants.
bioRxiv, doi: 10.1101/315150.
100
Geldermann, H., Müller, E., Beeckmann, P., Knorr, C., Yue, G., et al. (1996). Mapping of
quantitative‐trait loci by means of marker genes in F2 generations of Wild boar, Pietrain
and Meishan pigs. Journal of Animal Breeding and Genetics, 113(1‐6), 381-387.
Falker-Gieske C., Blaj I., Preuß S., Bennewitz J., Thaller G., et al. (2019). GWAS for meat and
carcass traits using imputed sequence level genotypes in pooled F2-designs in pigs. G3:
Genes, Genomes, Genetics, 9(9), 2823-2834.
Groenen, M. A. M. (2016). A decade of pig genome sequencing: a window on pig
domestication and evolution. Genetics Selection Evolution, 48, 23.
Groenen, M. A. M., Archibald, A. L., Uenishi, H., Tuggle, C. K., Takeuchi, Y., et al. (2012).
Analyses of pig genomes provide insight into porcine demography and evolution.
Nature, 491(7424), 393–398.
Jeon, J.-T., Carlborg, O., Törnsten, A., Giuffra, E., Amarger, V., et al. (1999). A paternally
expressed QTL affecting skeletal and cardiac muscle mass in pigs maps to the IGF2
locus. Nature Genetics, 21(2), 157–158.
Meuwissen, T. H., Hayes, B. J., & Goddard, M. E. (2001). Prediction of Total Genetic Value
Using Genome-Wide Dense Marker Maps. Genetics, 157(4), 1819–1829.
Nezer, C., Moreau, L., Brouwers, B., Coppieters, W., Detilleux, J., et al. (1999). An imprinted
QTL with major effect on muscle mass and fat deposition maps to the IGF2 locus in
pigs. Nature Genetics, 21, 155–156.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., et al. (2007). PLINK:
a tool set for whole-genome association and population-based linkage analyses.
American Journal of Human Genetics, 81(3), 559–575.
Ramos, A. M., Crooijmans, R. P. M. A., Affara, N. A., Amaral, A. J., Archibald, A. L., et al.
(2009). Design of a high density SNP genotyping assay in the pig using SNPs identified
and characterized by next generation sequencing technology. PLoS ONE, 4(8), e6524.
Rothschild, M. F., & Ruvinsky, A. (2011). The Genetics of the Pig. 2nd Edition. Oxfordshire,
UK, Cambridge: USA: CAB International.
Rückert, C., & Bennewitz, J. (2010). Joint QTL analysis of three connected F2-crosses in pigs.
Genetics, Selection, Evolution, 42, 40.
Stratz, P., Schmid, M., Wellmann, R., Preuß, S., Blaj, I., et al. (2018). Linkage disequilibrium
pattern and genome-wide association mapping for meat traits in multiple porcine F2
crosses. Animal Genetics, 49(5), 403–412.
R Core Team (2018). R: A language and environment for statistical computing. R Foundation
for Statistical Computing, Vienna, Austria.
101
Wickham, H. (2009). Ggplot2. Elegant graphics for data analysis.
Zeder, M. A., Emshwiller, E., Smith, B. D., & Bradley, D. G. (2006). Documenting
domestication: the intersection of genetics and archaeology. Trends in Genetics, 22(3),
139-55.
Zerbino, D. R., Achuthan, P., Akanni, W., Amode, M. R., Barrell, D., et al. (2018). Ensembl
2018. Nucleic Acids Research, 46(D1), D754-D761.
103
General discussion
The present study investigates the potential of four existing F2 pig resource populations and it
aims to demonstrate how the availability of tens of thousands of single nucleotide
polymorphisms (SNPs) (i.e. SNP arrays) and whole genome sequence data (WGS) can raise
and answer questions in the context of both population and quantitative genomics. A substantial
proportion of the phenotypic variation in pig populations, namely the health, production and
reproduction related traits, holds a genetic basis that is complex or quantitative. To decipher the
convoluted nature of the quantitative traits, the processes leading to their genetic architecture
have to be addressed. The forces that shape the genomes are various genetic mechanisms and
fundamental population processes (e.g. mutation, selection, gene drift, gene flow, and
recombination) and, in the case of livestock, the humans’ interference upon these forces as a
result of the domestication process and selective breeding.
In the following, the main findings of the thesis are discussed along with additional
investigations and results that were not included in the chapters but have direct implications.
Four main topics are covered: the release of the latest pig reference genome, recombination
rates and gene density, pooling data from several F2 crosses for quantitative trait loci (QTL)
mapping purposes and the shift from SNP array to whole genome sequence level data.
Reference genome Sscrofa 11.1
The quality of the reference genome assemblies is a critical aspect for a successful analysis of
genomic data. At the beginning of the data analyses included in this thesis only the Sscrofa 10.2
reference genome was available and used but genetic marker positions were updated with the
release of the new reference genome Sscrofa 11.1 in 2017 by Swine Genome Sequencing
Consortium (Schook et al., 2005). An evaluation of the potential biases and implications for
downstream analysis followed. Figure 1 shows a comparison between the old and the new
assembly SNPs numbers for the Sus scrofa chromosomes (SSC) (i.e. 1 to 18 autosomal, 19
unplaced SNPs and 20 X SNPs). Attention is drawn to SSC19 which in the 10.2 version
comprises more than 4,000 variants, while in the 11.1 version the number decreased
substantially as the SNPs were assigned to the autosomal chromosomes. These findings are
evidence of a denser and more accurate genome wide SNP coverage as a result of using a higher
quality reference genome.
To demonstrate the benefits of minimizing the downstream analyses bias, the impact on the
genome wide association studies (GWAS) was evaluated with emphasis on the genomic region
harboring the IGF2 gene on SSC2. The gene has been reported to have an effect on muscle
104
Figure 1. Number of SNP markers per chromosome: Sscrofa 10.2 in blue versus Sscrofa
11.1 in red. Chromosomes are 1 to 18 autosomal, 19 unplaced and 20 X SNPs.
mass and fat deposition (Nezer et al., 1999). The association studies considering additive,
imprinting and paternal effects (in Chapter 1, Chapter 2, and Chapter 3) confirmed this gene as
a strong candidate gene. The IGF2 was not assembled in the Sscrofa 10.2 genome version, but
now has been positioned on the Sscrofa 11.1 reference genome (comparison of GWAS results
shown exemplarily for one trait in Figure 2). While the statistical significance of the SNPs that
are in linkage disequilibrium (LD) with the causative mutation was not influenced, the
ambiguous genomic context of these SNPs hindered the interpretation of the results. If further
exploratory analysis would aim to define clusters incorporating strong evidence for trait-
associated chromosomal regions (following the methodology in Chapter 1 and 3), some clusters
would be erroneous (e.g. Figure 2, end of SSC2 upper Manhattan plot for 10.2). On the other
hand, having the statistically significant variants in the correct genomic context ensures a higher
precision when defining regions of interest (Figure 2, lower Manhattan plot for 11.1).
However, a minor inconvenience was observed when shifting to an updated reference genome.
The release of a new assembly triggers many changes in the available public online resources
as Pig QTL database (Hu et al., 2018), Ensembl variation resources (Hunt et al., 2018), and
DAVID (Dennis et al., 2003). Until they are updated accordingly, results interpretation and data
mining tends be cumbersome when using various bioinformatics tools.
105
Figure 2. Genome wide association study for meat to fat ratio trait in the European breeds
F2 cross. Sscrofa 10.2 up versus Sscrofa 11.1 down.
Recombination and gene density
Recombination is a major drive in shaping the genetic diversity of populations creating new
allelic combinations each generation. The four F2 crosses constitute highly informative resource
populations for estimating recombination rates, because crossover events can be accurately
inferred in the F1 individuals. Findings related to the recombination landscape are presented in
Chapter 4. Yet again, the implications of utilizing the latest reference genome for deriving the
marker positions is highlighted. The usage of 11.1 led to higher recombination rates as
compared to 10.2 because of a more accurate downstream analysis and overall improved quality
of the assembly (details in the paper).
The nature and spatial pattern of recombination depend on certain chromosomal and sequence
features (e.g. GC content, Tortereau et al., 2012). A genomic feature closely related to GC
content was considered, namely gene density. In several species, including mice (Paigen et al.,
2008) and humans (Freudenberg et al., 2009), a positive correlation between gene density and
recombination rates has been reported. Interestingly, a statistically significant weak positive
correlation (p < 0.05) was only observed for the crosses that included Meishan as a founder
breed (Table 1). The Asian breed and the rest of the European breeds diverged around 1 million
years ago (Groenen et al., 2012) which could explain, in general terms, the different distribution
of recombination rates across the genome. In Chapter 4, other breed specific dependent
106
characteristics are described, emphasizing on patterns encountered in the individuals stemming
from the Wild boar founder.
Table 1. Genome wide correlation between gene density and recombination rates in 1 Mb
bins (P = Piétrain, Lw = Large white, L = Landrace, M = Meishan and W = Wild boar).
Statistically significant correlations in bold.
Correlation coefficient p value
P x (LwxL)/Lw 0.25 (0.13) 0.057
M x P 0.31 (0.13) 0.018
W x P 0.15 (0.14) 0.287
W x M 0.32 (0.13) 0.014
For a coarse inspection of the landscape at the chromosome level, estimates of the
recombination rates for all four crosses together with the gene density are shown in Figure 3.
In general, higher rates tended to be located in distal regions of the chromosomes, which has
been described previously in various other mammal species (Jensen-Seaman et al., 2004), but
also in plants (Jordan et al., 2018). The evolutionary significance suggests highly conserved
mechanisms involving, for example, chromosomal features (e.g. centromere location) and their
influence during the meiosis process. The gene density followed a similar distribution to the
recombination estimates for some of the chromosomes (e.g. SSC10, SSC18), however a
potential mechanism is more difficult to devise in this case. An investigation at a higher
resolution would be required to understand what are the major drivers influencing correlation
between recombination rates and various chromosomal or sequence features.
In the recent past, there has been a grown interest in investigating whether the distribution of
the recombination events along the chromosomes is genetically controlled. An association
study in cattle (Sandor et al., 2012), for instance, provided evidence for genetic markers
influencing genome-wide recombination rates and hotspot usage (REC8, RNF212 and
PRDM9). In the same line of ideas, a preliminary analysis on the four F2 crosses was conducted.
Two limiting factors were identified: i) the small sample size (less than 200 F1 individuals) and
ii) the difficulty to define the phenotype (lack of standard definition). Nevertheless, as two of
the chapters (i.e. 1 and 3) in the thesis prove, the sample size can be increased by pooling data
available from other informative populations on which recombination rates can be accurately
estimated.
107
Figure 3. Kernel density estimates of the recombination rates for all four crosses
compared to gene density. Autosomal chromosomes from 1 to 18.
Pooling data in F2 crosses
Genome-wide association studies are a powerful tool to identify phenotype-associated variants.
The outcome of a GWAS is determined by several parameters such as marker density, mapping
population, sample size as well as the genetic architecture of the quantitative trait. With respect
to increasing the sample size, Chapter 1 and 3 address and use two suitable approaches to
collectively investigate data: a joint analysis, in which the phenotypic and the genotypic of
multiple datasets is combined directly, and a meta-analysis. The latter method does not require
access to the original datasets as it relies on combining information from the summary statistics
of a GWAS, specifically the effects and the p values (Evangelou and Ioannidis, 2013). The
meta-analysis can increase the detection power and reduce false-positive findings while it
allows to efficiently account for population substructure and for study specific covariates
(Willer and Abecasis, 2010).
108
The joint analysis was proven to be efficient for increasing the mapping power while pooling
data for several F2 designs (Ruckert and Bennewitz, 2010; Bennewitz and Wellmann, 2014;
Stratz et al., 2018). To complement the work of the previous studies, Chapter 1 compares the
results from a joint and a meta-analysis for growth and carcass traits and concludes that the
outputs are similar and that latter approach can be a valuable tool whenever access to raw
datasets is limited. For the F2 crosses here, the joint analysis is a more straightforward approach
(thus used in Chapter 3) because the data can be combined in one dataset and analyzed at once.
In contrast, in the meta-analysis approach several GWAS (depending on the number of
populations) need to be conducted. An additional statistical step to derive the final output is
needed, thus this approach could be more time-consuming. A GWAS for cattle stature
(Bouwman et al., 2018) considered 17 populations and the meta-analysis as a method of choice,
suggesting that the benefit of employing the MA is apparent when combining data from a higher
number of populations. Moreover, in the context having populations with sequence level data
available (also the case of the cattle populations mentioned above), the meta-analysis renders
more feasibility.
From SNP array to whole genome sequence data
The thesis transitions throughout its chapters from using SNP array to sequence data based
analyses as it demonstrates the advantages of increasing the genome wide marker density. In
livestock species, the usefulness of the SNP array is undeniable as proved by the constant
success of genomic selection, association and genetic diversity studies.
The design of the chip (Ramos et al., 2009) aims to be suitable and informative for various pig
breeds. The preselected variants are based on a validation population mostly composed of
European and US commercial breeds whilst the Asian pigs are underrepresented (Meishan
individuals account for ca. 5% of 554 animals). This ascertainment bias lead to a smaller
number of polymorphic SNPs in the two F2 populations derived from Meishan and an increased
number of variants in the high minor allele frequency spectrum (Figure 4, left hand side). In
contrast, sequence data does not imply any preselection and has a minor allele frequency
(Figure 5, right hand side) which is favorable for lower values. The relevance of these variants
can be of interest for GWAS purposes, as a significant proportion of the phenotypic variance
of a quantitative trait may depend on rare variants (Visscher et al., 2017). Additionally, given
the high density of variants in WGS data, the genetic variation is thus captured in a more
complete manner than with common genotyping arrays.
109
Figure 4. Minor allele frequency (MAF) distribution. Upper row is Piétrain x (Large white
x Landrace)/Large white, lower row is Meishan x Piétrain, left column is MAF from SNP array
and right column is MAF from WGS.
GWAS are usually the first step, based on statistical models, in elucidating the underlying
responsible molecular mechanism for the phenotypes of interest. The post GWAS analysis
relies rather on bioinformatics tools to prioritize variants for further functional validation
(hypothesis-based experiment) (Lappalainen, 2015). Following these lines, the association
analysis in Chapter 1 (based on SNP array) was successful in pinpointing genomic regions
associated to growth and carcass and, as a result of an exploratory analysis, candidate genes
were nominated. Chapter 3, based on imputed sequence data, adds to the resolution of the study.
Using additional tools such as the Variant Effect Predictor (McLaren et al., 2016) and various
databases, several causative mutations are incriminated to have an effect on the traits of interest
(the average daily gain, backfat thickness, meat to fat ratio and carcass length). These findings
pave the way to design experiments (e.g. by means of eQTL studies; Nica and Demitzakis,
2013) to prove the causality of the variants and will be considered in follow-up studies.
The final chapter is a further proof of the power that the sequence data holds. One noteworthy
aspect is that, in the context of WGS, the Meishan is no longer underrepresented with respect
to the number of variants, mainly due to the reference genome, i.e. TJ Tabasco, which belongs
to the Duroc European breed (Groenen et al., 2012). Based on exclusive variants, part of the
110
olfactory receptor gene family was found to be breed specific when comparing the two
European breeds groups (Piétrain group and Large white x Landrace and Large white group).
The survey provides valuable insight into how the Asian and European lineages diverged and
how breed formation shaped the pigs’ genome.
Concluding remarks
F2 crosses provide a powerful tool for genetic research. The multitude of analysis designed to
exploit such populations demonstrate the remarkable potential for investigating additive,
dominance and imprinting effects and for proposing putative causative mutations for further
functional validation. They also allow to estimate recombination rates across the genome and
to determine respective patterns at population, breed and individual level. The directions for
future applications of the four F2 crosses are manifold. The populations are well characterized
for additional phenotypes. Therefore, an imputed sequence data GWAS addressing these traits
is further planned. Moreover, the recombination rates and the drivers shaping the genome wide
recombination landscape will require an in-depth investigation for a better understanding of the
underlying mechanisms.
This thesis emphasizes on the strengths of genomic data with substantial implications for
research and practical breeding. Genotyping of the animals at birth and technology-aided
phenotypic collection will likely become routine at the farm level. Data-driven precision
breeding will come into the scene ensuring sustainable farming for the future. However, this
will only be possible if there are methods to analyze big data (e.g. machine learning) as well as
the ability to interpret high scale data sets that are generated by various emerging technologies.
111
References
Bennewitz, J., & Wellmann, R. (2014). Mapping Resolution in Single and Multiple F2
Populations using Genome Sequence Marker Panels. Proceedings of the 10th World
Congress on Genetics Applied to Livestock Production, 17-22.
Dennis, G., Sherman, B. T., Hosack, D. A., Yang, J., Gao, W., et al. (2003). DAVID: Database
for annotation, visualization, and integrated discovery. Genome Biology, 4(9), R60.
Evangelou, E., & Ioannidis, J. P. A. (2013). Meta-analysis methods for genome-wide
association studies and beyond. Nature Review Genetics, 14, 379.
Freudenberg, J., Wang, M., Yang, Y., & Li, W. (2009). Partial correlation analysis indicates
causal relationships between GC-content, exon density and recombination rate in the
human genome. BMC Bioinformatics, 10(1), S66.
Groenen, M. A., Archibald, A. L., Uenishi, H., Tuggle, C. K., Takeuchi, Y., et al. (2012).
Analyses of pig genomes provide insight into porcine demography and evolution.
Nature, 491(7424), 393.
Hu, Z. L., Park, C. A., & Reecy, J. M. (2018). Building a livestock genetic and genomic
information knowledgebase through integrative developments of Animal QTLdb and
CorrDB. Nucleic Acids Research, 47(D1), D701-D710.
Jensen-Seaman, M. I., Furey, T. S., Payseur, B. A., Lu, Y., Roskin, K. M., et al. (2004).
Comparative recombination rates in the rat, mouse, and human genomes. Genome
Research, 14(4), 528-538.
Jordan, K. W., Wang, S., He, F., Chao, S., Lun, Y., et al. (2018). The genetic architecture of
genome‐wide recombination rate variation in allopolyploid wheat revealed by nested
association mapping. The Plant Journal, 95(6), 1039-1054.
Nezer, C., Moreau, L., Brouwers, B., Coppieters, W., Detilleux, J., et al. (1999). An imprinted
QTL with major effect on muscle mass and fat deposition maps to the IGF2 locus in
pigs. Nature Genetics, 21, 155.
Lappalainen, T. (2015). Functional genomics bridges the gap between quantitative genetics and
molecular biology. Genome Research, 25(10), 1427-1431.
McLaren, W., Gil, L., Hunt, S. E., Riat, H. S., Ritchie, et al. (2016). The ensembl variant effect
predictor. Genome biology, 17(1), 122.
Nica, A. C., & Dermitzakis, E. T. (2013). Expression quantitative trait loci: present and future.
Philosophical Transactions of the Royal Society B, 368(1620), 20120362.
Paigen, K., Szatkiewicz, J. P., Sawyer, K., Leahy, N., Parvanov, E. D., et al. (2008). The
recombinational anatomy of a mouse chromosome. PLoS Genetics, 4(7), e1000119.
112
Ramos, A. M., Crooijmans, R. P., Affara, N. A., Amaral, A. J., Archibald, A. L., et al. (2009).
Design of a high density SNP genotyping assay in the pig using SNPs identified and
characterized by next generation sequencing technology. PLoS ONE, 4(8), e6524.
Rückert, C., & Bennewitz, J. (2010). Joint QTL analysis of three connected F2-crosses in pigs.
Genetics Selection Evolution, 42(1), 40.
Hunt, S. E., McLaren, W., Gil, L., Thormann, A., Schuilenburg, H., et al. (2018). Ensembl
variation resources Database, Volume 2018, doi:10.1093/database/bay119.
Sandor, C., Li, W., Coppieters, W., Druet, T., Charlier, C., et al. (2012). Genetic variants in
REC8, RNF212, and PRDM9 influence male recombination in cattle. PLoS Genetics,
8(7), e1002854.
Schook, L. B., Beever, J. E., Rogers, J., Humphray, S., Archibald, A., et al. (2005). Swine
genome sequencing consortium (SGSC): A strategic roadmap for sequencing the pig
genome. Comparative and Functional Genomics, 6, 251-255.
Stratz, P., Schmid, M., Wellmann, R., Preuß, S., Blaj, I., et al. (2018). Linkage disequilibrium
pattern and genome‐wide association mapping for meat traits in multiple porcine F2
crosses. Animal Genetics, 49(5), 403-412.
Tortereau, F., Servin, B., Frantz, L., Megens, H. J., Milan, D., et al. (2012). A high density
recombination map of the pig reveals a correlation between sex specific recombination
and GC content. BMC Genomics, 13(1), 586.
Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., et al. (2017). 10 years of
GWAS discovery: biology, function, and translation. The American Journal of Human
Genetics, 101(1), 5-22.
Willer, C. J., Li, Y., & Abecasis, G. R. (2010). METAL: fast and efficient meta-analysis of
genomewide association scans. Bioinformatics, 26(17), 2190-2191.
113
General summary
Advances in pig genomics, for both research and practical breeding, rely now greatly on the
availability of genome-wide dense single nucleotide polymorphism (SNP) panels and next
generation sequencing. Mapping populations, such as F2 crosses, together with technological
progress were building blocks for the development of the genomic selection concept. This type
of selection is fundamentally an extension at a genome-wide level of the marker-assisted
selection which was facilitated by various quantitative trait loci (QTL) mapping experiments.
The present thesis investigates the potential of four existing F2 pig resource populations and it
aims to demonstrate how the availability of tens of thousands of SNP markers (i.e. SNP arrays)
and whole genome sequence data (WGS) can raise and answer questions in the context of both
population and quantitative genomics. Every generation level (F0, F1, and F2) conveys different
types of information that is accessible by using phenotypes coupled with genotypes or
phenotypes coupled with sequence data but also by considering genotypic or sequence data
alone.
The main purpose of an F2 cross is to map QTLs and to locate putative causative mutations
associated with phenotypes of interests, specifically for this thesis, related to growth, carcass
and fat deposition. Briefly, the F2 resource populations here included comprise one cross
stemming from European type breeds (Piétrain, Landrace, and Large white) while the other
three crosses originate from distantly related founder breeds (Piétrain, the Asian breed Meishan,
and the European ancestor, the Wild boar). The method of choice to relate phenotypic to
genotypic information is the genome-wide association study (GWAS).
Chapter 1 explores aspects related to how a collective investigation of data from several F2
resource populations is advantageous. This study performs, based on SNP array data, an
individual population GWAS, a joint population GWAS and a meta-analysis in three of the pig
F2 designs (with Piétrain as a common founder) for growth and carcass traits. The benefit of
pooling the data is an increased mapping resolution that narrowed the genomic regions
harboring causative variants. Many genes previously associated with the traits were confirmed
and further new candidate genes were suggested (e.g. BMP2 bone morphogenetic protein 2 for
carcass length). An extension of this work goes beyond the additive genetic effects and looks
into dominance and imprinting effects in all four F2 crosses (taken separately) by means of
variance component estimation and various GWAS models (Chapter 2). The contribution of
the imprinting effects to the total phenotypic variance ranges from zero to 19% while the
dominance effects account for up to 34%. Significant associations from the imprinting and
paternal GWAS exist in the IGF2 (insulin-like growth factor 2) region for the traits related to
114
growth and fat deposition. To further dissect the genetic architecture of quantitative traits at an
even higher resolution, Chapter 3 presents a GWAS in the four pooled F2 resource populations
that have been imputed to WGS data based on high coverage founder and low coverage F1
sequencing. Besides providing directions of further research by uncovering information on
putative causative mutations in candidate genes as well as pathways, this research demonstrates
a convenient approach to efficiently exploit well-characterized experimental designs
established in the past.
The last two chapters of the thesis consider levels of information often overlooked in an F2
population, specifically at the F1 and the founder (F0) generation level. Chapter 4 covers
investigations on recombination events, a process involved in maintaining genetic variability
and the evolution of genomes. The basis of the constructed recombination maps is the F1
generation because crossover calls can be inferred in each parent-child pair (F1-F2), due to
population setting in which many full sibs stem from one F1 parent. The level of recombination
events varies within crosses, among breeds and individuals as well as across chromosomes or
regions within chromosomes. Although substantial heterogeneity is observed in the designs,
certain patterns specific to crosses, sex or chromosomes do exist and are influenced by the
extent of conservation of the local rate of recombination over time, by the levels of diversity,
efficiency or direction of selection, and the genome composition. Finally, in Chapter 5 a
reverse genetics approach explores information about the F2 resource populations by surveying
the sequence data variation within the founder generation. The rationale behind is that almost
all the variation which exists in F2 (used for GWAS purposes) is being propagated from the
founder generation. This exploratory analysis indicates how large-scale genomic data can offer
insights into the founder population structure and breed specific variation and how appropriate
bioinformatics tools and databases can lead to a knowledge driven variant selection for further
functional validation.
115
Allgemeine Zusammenfassung
Die Fortschritte in der Genomik von Schweinen hängen, sowohl für die Forschung als auch für
die praktische Zucht, von der Verfügbarkeit genomweiter SNP-Panels und der Next-
Generation-Sequenzierung ab. Die Kartierung von Populationen, wie z.B. von F2-Kreuzungen,
und der technologische Fortschritt waren die Bausteine für die Entwicklung der genomischen
Selektion. Diese Art der Selektion ist eine Erweiterung der markergestützten Selektion auf
genomweiter Ebene, die durch verschiedene QTL-Kartierungsexperimente erleichtert wurde.
Die vorliegende Forschungsarbeit untersucht das Potential des Einsatzes von SNP-Arrays und
der Gesamt-Genomsequenzierung (WGS) von vier vorhandenen F2-Schweine-
Ressourcenpopulationen. Dieses Vorgehen wirft Fragen in Bezug auf Populations- und
quantitative Genetik auf und kann diese beantworten. Jede Generationsebene (F0, F1, F2)
vermittelt verschiedene Arten von Informationen, die durch die Kopplung von Genotypen mit
Phänotypen, sowie durch die alleinige Verwendung von genetischen Informationen zugänglich
sind.
Der Hauptzweck einer F2-Kreuzung besteht darin, QTLs zu kartieren und vermeintliche kausale
Mutationen mit Phänotypen zu assoziieren. In der vorliegenden Arbeit wurden Phänotypen
betrachtet, die mit dem Wachstum, der Schlachtkörperqualität und der Fettverteilung assoziiert
sind. Eine Kreuzung der F2-Ressourcenpopulation besteht aus europäischen Rassen (Piétrain,
Landrasse und Deutsches Edelschwein), während die anderen drei Kreuzungen von entfernt
verwandten Gründerrassen stammen (Piétrain, die asiatische Rasse Meishan und der
europäische Vorfahre, das Wildschwein). Um phänotypische und genotypische Informationen
zu verbinden, wird die genomweite Assoziationsstudie (GWAS) verwendet. In Kapitel 1 wird
eruiert, wie die Beobachtungen der verschiedenen F2-Populationen kombiniert und gemeinsam
analysiert werden können. Dafür wurde für jede Population eine separate GWAS, sowie eine
gemeinsame GWAS aller Populationen durchgeführt. Weiterhin wurde eine Meta-Analyse für
drei der beschriebenen F2-Designs (mit Piétrain als gemeinsame Gründerrasse) für Wachstums-
und Schlachtkörpermerkmale durchgeführt. Der Vorteil der zusammengefassten Daten ist eine
erhöhte Kartierungsauflösung, die auf engere Genomregionen mit kausalen Varianten hinweist.
Viele Gene, die zuvor mit den Merkmalen assoziiert wurden, werden bestätigt und weitere neue
Kandidatengene werden vorgeschlagen (z.B. BMP2 bone morphogenetic protein 2 für die
Schlachtkörperlänge). Eine Erweiterung dieser Arbeit geht über die additiven genetischen
Effekte hinaus und untersucht Dominanz- und Imprinting-Effekte in allen vier F2-Kreuzungen
mit Hilfe der Varianzkomponentenschätzung und verschiedenen GWAS-Modellen (Kapitel 2).
Der Anteil der Imprinting-Effekte an der gesamten phänotypischen Varianz liegt zwischen Null
116
und 19%, während die Dominanzeffekte bis zu 34% der phänotypischen Varianz erklären. Die
Ergebnisse zeigen signifikante Assoziationen in der IGF2-Region (insulin-like growth factor
2) für die mit dem Wachstum und der Fettverteilung zusammenhängenden Merkmale. Kapitel
3 stellt eine GWAS der vier gepoolten F2-Ressourchenpopulationen dar. Hierfür wurden die
verfügbaren Genotypen auf Sequenzdaten unter Berücksichtigung verfügbarer
Genominformationen von Gründertieren und der F1-Generation imputiert. Es wurden
Kandidatengene, sowie vermeintlich kausale Mutationen identifiziert, die für die weitere
Forschung richtungsweisend sein können. Weiterhin demonstriert die Arbeit einen praktikablen
Ansatz zur effizienten Nutzung von bereits etablierten experimentellen Designs.
Die letzten beiden Kapitel dieser Thesis befassen sich mit Informationen der F1- und der
Gründergeneration (F0), die oftmals nicht beachten werden. Kapitel 4 beinhaltet
Untersuchungen zur Rekombination, ein Prozess zur Aufrechterhaltung der genetischen
Variabilität und zur Evolution von Genomen. Die Rekombinationskarten wurden auf Basis der
F1-Generation konstruiert, da in jedem Eltern-Kind-Paar (F1-F2) Crossing-Over auftritt. Die
Anzahl an Rekombinationen variierte innerhalb der Kreuzungspopulationen, zwischen Rassen
und Individuen sowie zwischen Chromosomen und innerhalb von Genomregionen. Obwohl in
den Designs eine beträchtliche Heterogenität beobachtet wird, existieren für die Kreuzungen,
Geschlechter oder Chromosomen bestimmte Muster. Dabei waren die identifizierten Muster
beeinflusst von der Rekombinationsrate im Zeitverlauf, der Diversität, der Effizienz oder der
Richtung der Selektion und der Genomzusammensetzung beeinflusst werden. Kapitel 5
betrachtet die Variation der Sequenzdaten in der Gründergeneration mit einem reversen
genetischen Ansatz, um die Informationen der F2-Ressourcenpopulationen zu untersuchen. Die
Begründung dafür ist, dass der Großteil der genetischen Variationen in der F2-Population von
der Gründerpopulation aus weitergegeben wurde. Diese explorative Analyse zeigt, wie anhand
von genomweiten Daten Rückschlüsse auf die Populationsstrukturen von Gründerrassen
gezogen werden können und wie geeignete bioinformatische Methoden und Datenbanken zu
einer wissensbasierten Selektion der Varianten für die weitere funktionale Validation führen
können.
117
Appendix
Supplementary Material for Chapter 2 is available upon request in a digital format.
Supplementary Material for Chapter 5:
Table SM1. Heterozygosity levels and genomic inbreeding coefficient (F). P = Piétrain, Lw
= Large white, L = Landrace, M = Meishan, W = Wild boar.
Sample ID Breed O(HOM) E(HOM) N(NM) F
10345 P 25619605 2.50E+07 32492844 0.08659
17118 P 26334896 2.50E+07 32515253 0.1794
17123 P 26114274 2.50E+07 32498664 0.1518
17161 P 25770702 2.50E+07 32507073 0.1052
17165 P 25508230 2.50E+07 32497963 0.07124
662 Lw x L 24740919 2.50E+07 32461390 -0.02733
690 Lw x L 24832741 2.50E+07 32464041 -0.01523
693 Lw x L 24715282 2.49E+07 32453620 -0.02998
728 Lw 25731335 2.49E+07 32451402 0.1057
735 Lw x L 24759779 2.49E+07 32456585 -0.02433
750 Lw x L 24777990 2.50E+07 32462484 -0.02254
756 Lw x L 24714906 2.49E+07 32454557 -0.03012
771 Lw x L 24698781 2.50E+07 32463989 -0.03328
M199 M 19403359 2.46E+07 32045079 -0.7073
P102 P 26146023 2.50E+07 32510064 0.1549
P107 P 25866702 2.50E+07 32519648 0.1166
P108 P 26182812 2.50E+07 32511118 0.1596
P113 P 25538474 2.50E+07 32495620 0.07541
P115 P 25910374 2.50E+07 32482869 0.1264
P119 P 25980850 2.50E+07 32496978 0.1343
P128 P 26652961 2.50E+07 32491478 0.2242
P130 P 25822097 2.50E+07 32505442 0.1122
P181 W 26962335 2.49E+07 32450507 0.2698
P244 P 25513125 2.50E+07 32500177 0.07163
118
Table SM2. Pooled analysis: Top 30 GO Biological Process enriched terms.
Enrichment
FDR
Genes in
list
Total
genes Functional Category
1.00E-20 4410 4834 Localization
7.20E-20 3899 4264 Developmental process
5.80E-19 4128 4527 Positive regulation of biological process
2.50E-18 3678 4024 Anatomical structure development
1.60E-16 3235 3535 Transport
2.50E-16 3657 4011 Positive regulation of cellular process
5.70E-16 3337 3653 Establishment of localization
4.10E-15 3269 3581 Multicellular organism development
5.80E-15 4347 4798 Cellular component organization or biogenesis
3.10E-14 4221 4660 Cellular component organization
4.80E-14 2703 2951 Cellular developmental process
9.70E-14 3014 3302 System development
1.90E-13 2554 2787 Cell differentiation
6.70E-13 1647 1776 Organic substance transport
1.80E-11 1781 1931 Regulation of localization
2.00E-11 1940 2109 Macromolecule localization
3.00E-11 2624 2877 Regulation of biological quality
5.70E-11 1777 1929 Cell surface receptor signaling pathway
6.20E-11 1826 1984 Anatomical structure morphogenesis
6.40E-11 2687 2950 Regulation of response to stimulus
7.90E-11 3527 3898 Negative regulation of biological process
8.90E-11 2218 2424 Animal organ development
1.30E-09 1897 2070 Cellular response to chemical stimulus
1.30E-09 2021 2209 Regulation of multicellular organismal process
1.40E-09 702 742 Organic acid metabolic process
1.60E-09 2236 2451 Response to stress
1.80E-09 1782 1942 Response to organic substance
2.50E-09 684 723 Oxoacid metabolic process
3.40E-09 649 685 Carboxylic acid metabolic process
3.70E-09 2506 2757 Positive regulation of metabolic process
119
Table SM3. Meishan breed specific analysis: Top 30 GO Biological Process enriched terms.
Enrichment
FDR
Genes in
list
Total
genes Functional Category
2.0E-11 169 913 Cell adhesion
2.0E-11 642 4834 Localization
2.5E-11 169 919 Biological adhesion
4.2E-10 568 4264 Developmental process
2.1E-09 536 4024 Anatomical structure development
4.8E-09 483 3581 Multicellular organism development
6.0E-09 450 3302 System development
9.8E-08 288 1984 Anatomical structure morphogenesis
9.8E-08 595 4660 Cellular component organization
1.2E-06 602 4798 Cellular component organization or
biogenesis
4.5E-06 391 2951 Cellular developmental process
5.5E-06 227 1557 Nervous system development
7.0E-06 271 1931 Regulation of localization
8.5E-06 328 2424 Animal organ development
2.6E-05 213 1474 Movement of cell or subcellular component
2.7E-05 366 2787 Cell differentiation
2.8E-05 282 2060 Intracellular signal transduction
4.1E-05 226 1595 Cell development
6.3E-05 271 1987 Cellular localization
1.0E-04 32 122 Homophilic cell adhesion via plasma
membrane adhesion molecules
1.1E-04 116 720 Regulation of cellular component movement
1.1E-04 154 1023 Plasma membrane bounded cell projection
organization
1.2E-04 444 3535 Transport
1.2E-04 124 786 Circulatory system development
1.2E-04 96 569 Cell-cell adhesion
1.3E-04 41 180 Cell-cell adhesion via plasma-membrane
adhesion molecules
1.3E-04 160 1079 Cytoskeleton organization
1.3E-04 155 1038 Cell projection organization
1.8E-04 281 2109 Macromolecule localization
1.8E-04 455 3653 Establishment of localization
120
Table SM4. Piétrain breed specific analysis: GO Biological Process enriched terms.
Enrichment
FDR
Genes in
list
Total
genes Functional Category
1.50E-15 75 1713 Sensory perception of smell
1.50E-15 75 1716 Detection of chemical stimulus involved in
sensory perception
1.50E-15 75 1689 Detection of chemical stimulus involved in
sensory perception of smell
1.50E-15 76 1758 Sensory perception of chemical stimulus
1.50E-15 76 1737 Detection of chemical stimulus
1.50E-15 83 2027 Sensory perception
4.10E-15 75 1753 Detection of stimulus involved in sensory
perception
8.50E-15 76 1822 Detection of stimulus
1.20E-13 86 2353 Nervous system process
8.70E-13 85 2394 G protein-coupled receptor signaling
pathway
1.50E-12 93 2796 System process
1.30E-05 105 4429 Response to chemical
7.50E-03 9 122 Homophilic cell adhesion via plasma
membrane adhesion molecules
2.90E-02 10 180 Cell-cell adhesion via plasma-membrane
adhesion molecules
4.60E-02 2 3 Negative regulation of vascular smooth
muscle contraction
4.60E-02 2 3 Protein localization to nuclear inner
membrane
121
Table SM5. Large white x Landrace/Large white breed specific analysis: GO Biological
Process enriched terms.
Enrichment
FDR
Genes in
list
Total
genes Functional Category
1.70E-12 62 1689 Detection of chemical stimulus involved in
sensory perception of smell
1.70E-12 62 1713 Sensory perception of smell
1.70E-12 62 1716 Detection of chemical stimulus involved in
sensory perception
2.20E-12 62 1737 Detection of chemical stimulus
2.50E-12 62 1753 Detection of stimulus involved in sensory
perception
2.50E-12 62 1758 Sensory perception of chemical stimulus
1.10E-11 62 1822 Detection of stimulus
3.10E-11 65 2027 Sensory perception
2.40E-10 77 2796 System process
1.70E-09 68 2394 G protein-coupled receptor signaling
pathway
2.00E-09 67 2353 Nervous system process
1.40E-03 83 4429 Response to chemical
Acknowledgements
This dissertation would not have been possible without the support of many people, both at a
professional and personal level.
To start I would like to thank Prof. Dr. Georg Thaller for the constant guidance, for the
productive discussions, valuable suggestions and encouragement throughout the doctoral time.
Additional thanks go to Prof. Dr. Jens Tetens who was of great support in data analysis and lab
work but also in shaping many of the research ideas. Further, for the lab work, I greatly
appreciate the support from Gabi and Fabian who eased the workload.
I also want to thank my colleagues for the pleasant work environment and for the fun times at
conferences and other institute activities. Mitze, Katharina and Sowah, thanks for putting up
with me as your office mate. A big “gracias” goes to Edson for all the constructive
conversations, for his friendship and the last minute support for the thesis. Laura and Christin,
many thanks for the German translations.
Kiel was home in the last years due to having many warmhearted people around. I believe
people make the places (unless it is Iceland) therefore special thanks go to Asli, Luca and Utku
(in no particular order, it is alphabetic), for their true friendship and unconditional love. I would
also like to mention the Ola Mensch crew for the companionship, for all the lunches, concerts
and life stories. For the soundtrack of my life, thank you Ane Brun.
Last, but not least, I would like to thank my family for their love, support and encouragement.
We grew a lot together in the last years and I am extremely grateful for having you in my life.
Vă iubesc.
Curriculum Vitae
Iulia Georgiana Blaj
Birth date, place: 23/04/86, Suceava, Romania
Nationality: Romanian
Employment
11/2014 – present
Research Assistant at the Institute for Animal Breeding and Husbandry, Faculty of
Agricultural and Nutritional Sciences, Kiel University, Germany
Education
10/2012 – 11/2014
AgriGenomics Masters, Kiel University, Germany (M.Sc.)
Thesis: Inferring Demography from Whole-Genome Sequence Data in Herens and
Tyrolean Grey Cattle Breeds
10/2005 – 07/2011
Veterinary Medicine, University of Agronomic Sciences and Veterinary Medicine,
Bucharest, Romania (DVM)
Thesis: Delayed puberty in gilts
09/2001 – 07/2005
National College Petru Rares, Suceava, Romania (Baccalaureate)
Focus: Mathematics – Informatics, Romanian – English bilingual major