Potential of F2 pig crosses - Institut für Tierzucht und Tierhaltung

138

Transcript of Potential of F2 pig crosses - Institut für Tierzucht und Tierhaltung

Schriftenreihe des Instituts für Tierzucht und Tierhaltung der

Christian-Albrechts-Universität zu Kiel, Heft 235, 2020

©2020 Selbstverlag des Instituts für Tierzucht und Tierhaltung

der Christian-Albrechts-Universität zu Kiel

Olshausenstraße 40, 24098 Kiel

Schriftleitung: Prof. Dr. J. Krieter

ISSN: 0720-4272

Gedruckt mit Genehmigung des Dekans der Agrar- und Ernährungswissen-

schaftlichen Fakultät der Christian-Albrechts-Universität zu Kiel

The Institute of Animal Breeding and Husbandry, Faculty of Agricultural and

Nutritional Sciences of the Christian-Albrechts-Universität Kiel

Potential of F2 pig crosses: perspectives

from population and quantitative genomics

Dissertation

submitted for the Doctoral Degree

awarded by the Faculty of Agricultural and Nutritional Sciences

of the Christian-Albrechts-Universität Kiel

by

M.Sc. Iulia Georgiana Blaj

born in Suceava, Romania

Kiel, 2019

Dean: Prof. Dr. Dr. Christian Henning

1. Examiner: Prof. Dr. Georg Thaller

2. Examiner: Prof. Dr. Jens Tetens

Day of oral examination: 26th of June 2019

The dissertation was supported by a grant from the German Research

Foundation (Deutsche Forschungsgemeinschaft, DFG).

To my family

Table of contents

General introduction 1

Chapter 1

Genome-wide association studies and meta-analysis uncovers new candidate

genes for growth and carcass traits in pigs 7

Chapter 2

Non-additive effects in four diverse F2 pig crosses for growth, carcass and fat

related traits 31

Chapter 3

GWAS for meat and carcass traits using imputed sequence level genotypes in

pooled F2-designs in pigs 49

Chapter 4

Recombination landscape in multiple F2 pig crosses between genetically diverse

founder breeds 79

Chapter 5

A systematic survey of sequence data variation in the founder individuals of four

F2 pig crosses 87

General discussion 103

General summary 113

Allgemeine Zusammenfassung 115

Appendix 117

1

General introduction

F2 resource populations are the basis of many valuable findings in biology, medicine, and

agriculture. The onset of using such experimental population structure was the pioneering work

of Gregor Mendel, in the late 19th century, who derived, based on visual inspection of plants,

the fundamental laws of inheritance (Mendel, 1865). Throughout the beginning and middle of

the 20th century, few resource populations were established but with limited success to detect

genomic regions associated with a phenotype, i.e. quantitative trait loci (QTL). The major

constrain was the lack of segregating markers (Weller, 2009), nevertheless, in the last four

decades, several discoveries removed this constrain, thus offering unprecedented access to

deoxyribonucleic acid (DNA) level information. Noteworthy milestones are the detection of

first DNA polymorphisms (namely the restricted fragment length polymorphism by Grodzicker

et al. in 1974 and microsatellites by Mullis et al. in 1986) and, in 1995, the most prevalent

genetic marker came into use: the single nucleotide polymorphisms (SNP) (Brookes, 1999).

The pig (Sus Scrofa) has been vitally important for human development for thousands of years,

since the domestication process started around 10,000 year ago (Larson et al., 2007). Human

interference played a great role in shaping the genome of this major livestock species by

influencing processes such as selection, demographic history and gene flow (Groenen, 2016).

Insights into the genotype-phenotype relationship can be obtained by means of mapping

populations (e.g. F2 intercrosses) that have been efficiently developed in the last decades and

used to map QTLs (Rothschild et al., 2007). The initial way of exploiting such population

structure was by establishing the relationship between the phenotypes of the F2 individuals and

their genetic markers (mostly microsatellites) via linkage mapping.

The first QTL study in pigs considered a Wild boar x Large white F2 experimental population

and aimed at identifying QTLs responsible for fat deposition and growth (Andersson et al.,

1994). A significant locus identified on chromosome 4 explained 20% of the phenotypic

variance for abdominal and back fat. Many other QTL studies followed and most of them were

the result of F2 experimental designs produced by crossing breeds from genetically divergent

lineages. The Pig QTL database is a comprehensive catalogue enclosing results for curated QTL

mapping experiments from the last 25 years (Hu et al., 2018). Initial implications of finding

genetic markers associated with loci influencing traits of interest lead to transferring this

information into breeding programs by means of marker-assisted selection (Fernando and

Grossman, 1989). This methodology provided the basis for a groundbreaking development

named genomic selection in which high density markers covering the entire genome are used

in order to enable all QTLs to be in linkage disequilibrium with at least one marker (Meuwissen

2

et al., 2001). Mapping populations coupled with technological progress, specifically the

Illumina PorcineSNP60 (Ramos et al., 2009), were thus building blocks for the implementation

of genomic selection in pig breeding.

The present research investigates the potential of four existing F2 pig populations. Specifically,

it aims to demonstrate how the availability of tens of thousands of SNP markers and whole

genome sequence (WGS) data can raise and answer questions in the context of both population

and quantitative genomics. Firstly, a brief description of the F2 resource population is given.

The largest experimental design considered was established by Borchers et al. (2000) and

originates from Piétrain (P) boars and Large white x Landrace (Lw x L) crossbred or Large

white (Lw) sows. The remaining three populations developed by Geldermann et al. (1996) are

based on a European breed (Piétrain), an Asian breed (Meishan, M), and the European pig

ancestor, the Wild boar (W). Specifically, the populations are M x P, W x P and W x M. The

well-characterized F2 generation is the outcome of repeatedly crossing F1 boars with F1 sows

in order to obtain large full sib families. These population, their phenotypic (growth, carcass

and fat related), genotypic (SNP array), and sequence data provide the basis of the statistical

and exploratory analysis carried on in each chapter of this thesis as shown in Figure 1.

F0 F1 F2

Phenotype - - Growth, carcass and fat traits

SNP array PorcineSNP60 PorcineSNP60 PorcineSNP60

Sequence* ~ 20x ~1x Imputed to sequence

Figure 1. Scheme of the F2 crosses with phenotypic and genomic data available per

generation. Data usage in the thesis for chapters 1 to 5. *high coverage sequence data = 19x,

low coverage sequence data = 1x.

Genome-wide associations studies (GWAS) are widely used in the genetic dissection of

complex or quantitative traits and, as compared to linkage mapping, they rely on LD and reflect

1,2,3

1,2,3,4

3

2,3 2,3,4

3 3,5

3

historical recombination events (Goddard and Hayes, 2009). The potential of a GWAS is highly

influenced by the sample size of the experiment. Therefore, an increase in sample size implies

a higher statistical power. Chapter 1 demonstrates how a collective investigation of data

(phenotype and SNP array genotypes) from three of the F2 resource populations, with the

Piétrain breed as founder, is advantageous offering an increased mapping resolution. While this

chapter considers the additive effects of the genetic markers, the following work in Chapter 2

investigates dominance and imprinting effects based on SNP array data in the four F2 crosses,

each considered separately. The availability of sequence data provides further precision when

conducting a GWAS. High coverage sequenced founders F0 and low coverage sequenced F1s

facilitated the imputation step leading to F2 WGS individuals. Therefore, Chapter 3 conducts

an association study employing such sequence level information on the pooled imputed F2.

The resource populations are comprised of three generations and are established with the main

objective of discovering phenotype to genotype connections in the F2. Nevertheless, additional

layers of information, often disregarded, exists in the grandparental (F0) and parental generation

(F1). Chapter 4 covers investigations on recombination events, a process involved in

maintaining genetic variability and the evolution of genomes. The basis of characterizing the

recombination landscape are the maps built in the F1 generation. Here crossover calls can be

appropriately inferred in each parent-child pair (F1-F2) due to the high numbers of full sibs

allocated to an F1 parent. Finally, the research in Chapter 5 is motivated by the fact that almost

all the genomic variation that exists at the F2 level (used normally for GWAS) is being

propagated from the F0 generation. Thus, a reverse genetics approach explores the WGS

information in the founders. Employing various bioinformatics tools, a pooled and a breed-

based analysis of the F0 individuals was conducted and finally both population and breed

specific (e.g. related to the olfactory receptor gene family) conclusions were derived.

A general discussion subsumes the thesis. The topics covered are mostly beyond the above-

mentioned chapters and with implications to their outcomes. Particularly the following topics

are considered: the release of the latest pig reference genome, recombination rates and gene

density, pooling data from several F2 crosses for QTL mapping purposes and finally the shift

from SNP array to WGS level data that paves the way into the field of big data.

4

References

Andersson, L., Haley, C. S., Ellegren, H., Knott, S. A., Johansson, M., et al. (1994). Genetic

mapping of quantitative trait loci for growth and fatness in pigs. Science, 263(5154),

1771-1774.

Borchers, N., Reinsch, N., & Kalm, E. (2000). Familial cases of coat colour change in a

Piétrain cross. Journal of Animal Breeding and Genetics, 117(4), 285-287.

Brookes, A. J. (1999). The essence of SNPs. Gene, 234(2), 177-186.

Fernando, R. L., & Grossman, M. (1989). Marker assisted selection using best linear unbiased

prediction. Genetics Selection Evolution, 21(4), 467.

Geldermann, H., Müller, E., Beeckmann, P., Knorr, C., Yue, G., & Moser, G. (1996). Mapping

of quantitative trait loci by means of marker genes in F2 generations of Wild boar,

Piétrain and Meishan pigs. Journal of Animal Breeding and Genetics, 113(1-6), 381-

387.

Goddard, M. E., & Hayes, B. J. (2009). Mapping genes for complex traits in domestic animals

and their use in breeding programs. Nature Review Genetics, 10, 381–391.

Groenen, M. A. (2016). A decade of pig genome sequencing: a window on pig domestication

and evolution. Genetics Selection Evolution, 48(1), 23.

Grodzicker, T., Williams, J., Sharp, P., & Sambrook, J. (1974). Physical mapping of

temperature-sensitive mutations of adenoviruses. Cold Spring Harbor Symposia on

Quantitative Biology, 39, 439-446.

Hu, Z. L., Park, C. A., & Reecy, J. M. (2018). Building a livestock genetic and genomic

information knowledgebase through integrative developments of Animal QTLdb and

CorrDB. Nucleic Acids Research, 47(D1), D701-D710.

Larson, G., Albarella, U., Dobney, K., Rowley-Conwy, P., Schibler, J., et al. (2007). Ancient

DNA, pig domestication, and the spread of the Neolithic into Europe. Proceedings of

the National Academy of Sciences, 104(39), 15276-15281.

Mendel, G. (1866). Versuche über Pflanzenhybriden Verhandlungen des naturforschenden

Vereines in Brünn, Bd. IV für das Jahr, 1865. Abhandlungen, 3-47.

Meuwissen, T. H., Hayes, B. J., & Goddard, M. E. (2001). Prediction of total genetic value

using genome-wide dense marker maps. Genetics, 157(4), 1819-1829.

Mullis, K., Faloona, F., Scharf, S., Saiki, R. K., Horn, G. T., et al. (1986). Specific enzymatic

amplification of DNA in vitro: the polymerase chain reaction. Cold Spring Harbor

Symposia on Quantitative Biology, 51, 263-273.

5

Ramos, A. M., Crooijmans, R. P. M. A., Affara, N. A., Amaral, A. J., Archibald, A. L., et al.

(2009). Design of a high density SNP genotyping assay in the pig using SNPs identified

and characterized by next generation sequencing technology. PLoS ONE, 4(8), e6524.

Rothschild, M. F., Hu, Z. L., & Jiang, Z. (2007). Advances in QTL mapping in pigs.

International Journal of Biological Sciences, 3(3), 192.

Weller, J. I. (2009). Quantitative trait loci analysis in animals. CABI.

6

7

Chapter 1

Genome-wide association studies and meta-analysis uncovers

new candidate genes for growth and carcass traits in pigs

Iulia Blaj1, Jens Tetens2, Siegfried Preuß3, Jörn Bennewitz3 and Georg Thaller1

1Institute of Animal Breeding and Husbandry, Kiel University, Kiel, Germany

2Functional Breeding Group, Department of Animal Sciences, Göttingen University,

Göttingen, Germany

3Institute of Animal Husbandry and Breeding, University of Hohenheim, Stuttgart, Germany

Published in PLoS One

8

Abstract

Genome-wide association studies (GWAS) have been widely used in the genetic dissection of

complex traits. As more genomic data is being generated within different commercial or

resource pig populations, the challenge which arises is how to collectively investigate the data

with the purpose to increase sample size and implicitly the statistical power. This study

performs an individual population GWAS, a joint population GWAS and a meta-analysis in

three pig F2 populations. D1 is derived from European type breeds (Piétrain, Large White and

Landrace), D2 is obtained from an Asian breed (Meishan) and Piétrain, and D3 stems from a

European Wild Boar and Piétrain, which is the common founder breed. The traits investigated

are average daily gain, backfat thickness, meat to fat ratio and carcass length. The joint and the

meta-analysis did not identify additional genomic clusters besides the ones discovered via the

individual population GWAS. However, the benefit was an increased mapping resolution which

pinpointed to narrower clusters harboring causative variants. The joint analysis identified a

higher number of clusters as compared to the meta-analysis; nevertheless, the significance

levels and the number of variants in the meta-analysis were generally higher. Both types of

analysis had similar outputs suggesting that the two strategies can complement each other and

that the meta-analysis approach can be a valuable tool whenever access to raw datasets is

limited. Overall, a total of 20 genomic clusters were identified on chromosomes 2, 7 and 17,

many confirming previously identified quantitative trait loci. Several new candidate genes are

being proposed and, among them, a strong candidate gene to be taken into account for

subsequent analysis is BMP2 (bone morphogenetic protein 2).

Background

In pig breeding, the search for quantitative trait loci (QTLs) and the underlying causative

mutations has been in progress for more than two decades. The onset was the landmark

publication on genetic mapping of QTL for growth and fatness by [1]. Up to date, according to

the latest release of the AnimalQTLdb (Release 35, 29th April, 2018), the Pig QTL database

stores 27,465 pig QTLs curated from 620 publications and representing a wide range of

economically important phenotypes (PigQTLdb; https://www.animalgenome.org/cgi-

bin/QTLdb/SS/summary).

In the beginning of pig QTL experiments, the mapping was carried out using crosses between

outbred lines and the statistical test employed was linkage analysis. This approach has proven

to be efficient for pinpointing numerous QTLs in pigs [2, 3]. However, those studies had

reduced mapping resolution and statistical power due to several factors as: the limited number

9

of individuals and genetic markers, and linkage analysis usage which considers only recent

recombination events.

The release of the Illumina PorcineSNP60 Beadchip [4] represents a key advance in

overcoming the above-mentioned impediments. The SNP array facilitates the implementation

of genome-wide association studies (GWAS) in which the historical recombination events are

taken into account when reflecting the associations between markers and phenotypes. To

surpass the limitation given by small sample size in some mapping experiments, analyzing

several F2 resource populations jointly has been proven to be a suitable approach [5]. Via

stochastic simulations, [5] demonstrated that pooling data from multiple F2 populations can

increase power and mapping precision compared to single association analysis under the

scenario in which the crosses share at least one common founder breed. Thus, one approach to

combine data from several populations is a joint population GWAS (JA) which considers a

single dataset comprising the merged information from the individual population level. This

approach requires access to the complete original dataset (i.e. genotypes and phenotypes).

Another strategy for combining information from multiple genetic mapping studies is the meta-

analysis (MA) of the GWAS summary statistics. This method can increase the detection power

and reduce false-positive findings [6] while also allowing to efficiently account for population

substructure and for study specific covariates [7]. The MA is widely used in human genetics,

where access to the original datasets is usually limited due to privacy protection policies. In the

last years, the meta-analysis has been employed for pig association studies in an effort to

maximize the use of available genomic information from commercial or experimental pig

populations [8-10].

The current study considers three pig F2 resource populations which share a common founder

breed [11, 12]. The connecting breed, Piétrain, is an extensively used sire line in pig breeding.

Having a constant demand for improving traits related to growth and carcass composition,

transferring knowledge from genome-wide association studies results into practice is of utter

importance. For this purpose, the classical G-BLUP (genomic best linear unbiased prediction)

statistical framework, used for predicting genomic estimated breeding values, was extended to

incorporate prior information on QTLs and related biological knowledge via the genomic

feature BLUP (GF-BLUP) model [13]. The proposed model is an extension of the linear mixed

model used in standard G-BLUP which includes additional genetic effects, previously

unraveled by association studies. According to [13] the GF-BLUP can contribute to the

prediction accuracy improvement in genomic selection schemes. Considering the above

mentioned reasoning, the aim of the study was twofold. Firstly, to conduct a joint design

10

analysis (JA) and a meta-analysis (MA) of the two pig F2 resource populations and compare

the results yielded by these different approaches. Secondly, to identify candidate genes in

genomic regions associated with average daily gain, backfat thickness, meat to fat ratio and

carcass length.

Materials and methods

Description of resource populations

The three-generation experimental populations comprise a total of 2,380 animals. The designs

were established three decades ago by [14] and [15] and for this study, blood samples were

available from which DNA was extracted for genotyping purposes. [14] and [15] characterized

the populations in detail and will only be described briefly further. The first resource population

(D1) considered for this study has 1,785 individuals. It was obtained from five purebred Piétrain

(P) boars (all of them homozygous stress resistant) and one Large White (LW) and six crossbred

sows Landrace (L) x Large White. Large F2 families were generated by repeatedly crossing

seven F1 boars to full sib F1 sows. The second population (D2) is composed of 304 pigs

stemming from mating one Meishan (M) boar with eight Piétrain sows. The third population

(D3) with 291 individuals had as founders a European Wild Boar (WB) crossed with nine

Piétrain sows. Three of the Piétrain sows were common among the latter two families. For both

the D2 and the D3, the F2 individuals were the result of two or three F1 boars mated with F1

sows. Generally, each sow had two litters from different boars. The Piétrain founder females

were homozygous stress susceptible and the Meishan and wild boar males were homozygous

stress resistant.

Phenotypic trait data

For the current study the following growth and carcass composition traits were considered:

average daily gain (ADG), back fat thickness (BFT), meat to fat ratio (MFR) and carcass length

(CRCL). The ADG [g] is the daily weight gain in the fattening period, the BFT [mm] is

calculated as the average of three measurements: shoulder fat depth, back fat depth and loin fat

depth, the MFR [ratio] is the fat area in relation to the meat area at 13th/14th rib and CRCL [cm]

is measured from the first cervical vertebrae to the pubis symphysis. The methods of

measurement and the calculations employed for D1, D2 and D3 were in conformity with the

performance testing directive of the Central Association for German Pig Production [16, 17].

Table 1 contains a brief description of the contributing F2 designs and the summary of the

phenotypic data indicating mean and standard deviation (SD) for each trait. The animals were

11

slaughtered at 211.01 ± 22.3 days, 211.73 ± 6.92 days and 210.63 ± 3.22 for D1, D2 and D3,

respectively.

Table 1. Description of the experimental F2 populations and investigated traits.

D1: Px(LxLW)/LWa

Males F0 = 5

Females F0 = 7

D2: MxPb

Males F0 = 1

Females F0 = 8

D3: WBxPc

Males F0 = 1

Females F0 = 9

Mean (SD) Nd Mean (SD) Nd Mean (SD) Nd

ADG[g]e 675.9 (92.74)

311 - 1039 1769

590.1 (130.03)

174.0 - 951.0 304

527.4 (109.17)

125.0 - 790.0 291

BFT[mm]e 27.49 (3.84)

16 - 42.3 1766

27.89 (6.73)

8.70 - 46.00 304

22.8 (4.97)

10.3 - 40.0 291

MFR[ratio]e 0.38 (0.11)

0.14 - 0.85 1765

0.7248 (0.22)

0.28 - 1.39 304

0.516 (0.08)

0.19 - 1.07 289

CRCL[cm]e 100 (2.93)

91 - 111 1765

91.33 (6.12)

63.50 - 106.00 304

79.85 (5.20)

62.50 - 94.00 291

aPiétrain x (Landrace x Large White)/Large White; bMeishan x Piétrain; cWild Boar x Piétrain; dNumber of

individuals with phenotypic data; eAverage daily gain, backfat thickness, meat to fat ratio, carcass length.

Genotyping and quality control

The F2 individuals were genotyped with Illumina PorcineSNP60 BeadChip (61,565 SNPs).

SNP chromosomal positions were based on the current pig genome assembly (Sus Scrofa build

11.1 provided by Swine Genome Sequencing Consortium on NCBI). Genotypes were filtered

with respect to the following quality control (QC) criteria: i) removing SNPs with a minor allele

frequency less than 5% and ii) excluding individuals and SNPs with call rates lower than 90%.

The process of quality control was carried out using Plink [18]. The autosomal chromosomes

were further considered. The final set consisted of 44,457 SNPs in D1 design, 40,738 SNPs in

D2 design, 37,145 in D3 design and 31,299 SNPs in the joint design (D1D2D3). The latter was

obtained by merging common SNPs from D1, D2 and D3 after the QC step. In addition, the

RYR1:g.1843C>T [19] mutation status was available for the individuals in D2 and D3.

Persistence of linkage disequilibrium phase

The extent of linkage disequilibrium (LD) in the D1, D2, D3 and D1D2D3 populations was

characterized in detail by [20]. Of further interest for the joint analysis was to examine how

consistent is the LD phase in the designs. The statistical parameter chosen for the LD

measurement was r2 [21], which is the correlation coefficient between SNP pairs. A total of

12

31,299 common SNP across the individual populations and the joint data set was used to

compute the r2 values. Using Plink, r² was obtained for all SNP pairs located less than 5 Mb

apart. The average r2 values of the SNP pairs in classes of inter-marker distances 100 Kb

starting with interval [0-100] Kb up to [4900-5000] Kb was calculated and finally used to

compute correlation of phase between two populations according to the formula [22]:

𝑅𝐷𝑘,𝐷𝑘′=

𝛴(𝑖,𝑗)𝜖𝑝(𝑟𝑖𝑗(𝐷𝑘) − �̅�(𝐷𝑘))(𝑟𝑖𝑗(𝐷𝑘′) − �̅�(𝐷𝑘′))

𝑆(𝐷𝑘)𝑆(𝐷𝑘′)

where 𝑅𝐷𝑘,𝐷𝑘′ is the correlation of phase between 𝑟𝑖𝑗(𝐷𝑘) in population 𝐷𝑘 and 𝑟𝑖𝑗(𝐷𝑘′) in

population 𝐷𝑘′, 𝑆(𝐷𝑘) and 𝑆(𝐷𝑘′) are the standard deviation of 𝑟𝑖𝑗(𝐷𝑘) and 𝑟𝑖𝑗(𝐷𝑘′), respectively,

and the average 𝑟𝑖𝑗 across all SNP 𝑖 and 𝑗 within interval 𝑝 for 𝐷𝑘 and 𝐷𝑘′, accordingly, is

denoted with �̅�(𝐷𝑘) and �̅�(𝐷𝑘′). The 𝑅𝐷𝑘,𝐷𝑘′estimate was evaluated for the six population pairs:

D1-D2, D1-D3, D1-D1D2D3, D2-D3, D2-D1D2D3 and D3-D1D2D3.

Estimating genetic variance and genetic correlations

A general linear model was used to pre-adjust the animal phenotypes with the following fixed

effects: sex, stable and slaughter month class (i.e. 15 classes for D1, 6 classes for D2 and 8

classes for D3). The phenotype pre-adjustment analysis was carried out with R [23]. For D2

and D3 the RYR1 status was also incorporated as fixed effect for ADG, BFT and MFR traits.

Weight at slaughter was included as a covariate for all traits except for ADG. The residual

values of the phenotypes were subsequently used. The Genome-wide Complex Trait Analysis

tool (GCTA) [24] was utilized for estimating the variance components and the genetic

correlations. The genomic relationship matrix (GRM) was calculated between all pairs of

individuals using all the autosomal SNPs. Applying a univariate analysis, the variance of the

traits was partitioned using restricted maximum likelihood (REML) into an additive genetic

and a residual component. Following the classical definition of narrow-sense heritability, the

SNP-based heritability was obtained via ℎ𝑆𝑁𝑃2 = 𝜎𝑆𝑁𝑃

2 /𝜎𝑃2, representing the proportion of

phenotypic variance (𝜎𝑃2) explained by the additive effects of the common SNPs on the chip

array and/or by the unknown causal variants correlated with the SNPs. A bivariate GREML

analysis led to the assessment of the genetic correlations between traits.

GWAS and meta-analysis

Single marker association tests were performed using the GCTA tool for the individual

populations (i.e. D1, D2 and D3). . A mixed linear model analysis including the candidate SNP

(MLMi) was set up as follows: 𝑦∗ = 𝑥𝑏 + 𝑢 + 𝑒, where 𝑦∗ is the phenotype corrected for

13

systematic environmental effects and genetic effect (i.e. the RYR1 status for D2 and D3), 𝑥 is

the genotypes of the marker, 𝑏 is the additive effect size (fixed effect) of the candidate SNP to

be tested for association, 𝑢 is the random polygenic effect given by the construction of the

genomic relationship matrix (GRM) and 𝑒 is the residual effect.

For the joint design (D1D2D3), the same MLMi was used. The 𝑦∗, in this case, consisted of

pooled pre-corrected phenotypes from D1, D2 and D3. From the unique genotype file,

constructed based on the merged common SNPs among the three populations, the GRM was

assembled and then the marker genotypes tested for association. Given the fact that D1D2D3

contains strong familial relatedness (due to full-sib families) and weak population stratification,

observed in a multidimensional scaling analysis by [20], the mixed linear model analysis should

be efficient in capturing sample structure via the GRM as the random effect included in the

model [25]. Nevertheless, a fixed effect with three classes representing the three individual

designs was added to the model to account for the diverse genetic backgrounds.

A meta-analysis aims to statistically combine the information from multiple independent

studies and therefore to increase the power and reduce the false-positive results [6]. From the

several approaches to conduct a meta-analysis, the fixed effects meta-analysis is the most

powerful method and within this group, the inverse variance based strategy is predominant [6].

This strategy was employed for synthesizing the association studies summary statistics for the

common variants of D1, D2 and D3 populations. Specifically, using the METAL software [7].

each study was weighted according to the inverse of its squared standard error resulting in

newly derived effect sizes and standard errors estimates further used for calculating an overall

Z score and finally the overall p values.

Manhattan plots for D1, D2, D3 and D1D2D3 GWAS as well as for the MA results were created

via qqman R package [26]. By using Bonferroni correction, the genome-wide significance line

was set to 𝑝𝑔𝑒𝑛𝑜𝑚𝑒−𝑤𝑖𝑑𝑒 ≤ 0.05. Because Bonferroni correction acts in a stringent manner, an

additional nominal significant level was used for which the threshold was set up to 𝑝 ≤

5𝑥10−5. The R package qvalue [27] facilitated the calculation of the false discovery rate (FDR)

𝑞 value for each association test. The FDR 𝑞 value of the significant SNP with the largest 𝑝

value provided an assessment of the proportion of false positives among the significant SNPs.

Clusters incorporating strong evidence for trait-associated chromosomal regions were defined

based on the LD structure and the significant SNPs similar to [20]. A cluster enclosed a

minimum of two genome wide significant SNPs with a maximum distance of 2 Mb between

them. From the center point of the initially defined cluster, the upper and the lower boundaries

were assigned to the last nominally significant variant situated at a maximum of 1 Mb in both

14

directions. The jvenn tool [28] was used to draw Venn diagrams for all the SNPs surpassing the

nominal significance threshold to show all possible relations for each trait and between the four

different sets (D1, D2, D3, D1D2D3 and MA).

Exploratory analysis of clusters

The clusters identified were explored using BioMart tool [29], the Ensembl Genes 91 database

[30] and Gene Ontology [31, 32]. The interrogations were carried out using the latest genome

reference, the Sscrofa 11.1 assembly (GCA_000003025.6) and the latest gene annotation

(Genebuild released on July 2017).

Results

Genotypic data and individuals qualified for the analysis

All genotyped animals passed the quality control procedure, i.e. 1,785 in D1, 304 in D2 and

291 in D3. The final autosomal number of SNPs was 44,457, 40,738 and 37,145 for D1, D2

and D3, respectively. Based on the reference genome assembly (Sscrofa 11.1) the average

physical spacing between adjacent markers was 50,872 bp in D1, 55,495 bp in D2 and 60,821

bp for D3. For both the joint design and the meta-analysis, 31,299 SNPs were used with an

average physical distance among adjacent markers of 72,161 bp.

LD and persistence of phase

Average genome-wide 𝑟2 value for adjacent markers was 0.40, 0.44, 0.45 and 0.38 for D1, D2,

D3 and D1D2D3, respectively. The correlation of phase𝑅𝐷1,𝐷1𝐷2𝐷3exhibited high concordance

at all marker intervals, starting with 0.95 for the interval [0:100] Kb and maintained values

above 0.86 until the maximum interval length considered for the analysis (Figure 1). The

remaining five design pairs had visibly lower correlations levels ranging from 0.65 for D3-

D1D2D3 pair to 0.31 for D2-D3 when considering the first interval [0:100] Kb. The D2 design

was the least correlated with the joint design. Among the individual design pairs (i.e. D1-D2,

D1-D3 and D2-D3), the highest persistence of phase was observed for D1-D3 and the lowest

for D2-D3 pair.

Heritabilities and genetic correlations

The SNP-based heritabilities calculated were moderate to high, ranging from 0.31 for CRCL in

D3 to 0.74 for CRCL in D2 (Table 2). The ℎ𝑆𝑁𝑃2 estimates for ADG were highest in D3 and for

BFT and CRCL in D2. The values for MFR were constant in all three designs. The traits BFT

15

and MFR were strongly positive correlated with a coefficient 𝑟𝐺= 0.77 in D1, 𝑟𝐺= 0.53 in D2

and 𝑟𝐺= 0.83 in D3, while for BFT and CRCL a mild to strong negative correlation were

observed (i.e. 𝑟𝐺= -0.29 in D1, to 𝑟𝐺=-0.78 in D2 and 𝑟𝐺=-0.46 in D3).

Figure 1. Correlation of phase between D1-D2, D1-D3, D1-D1D2D3, D2-D3, D2-D1D2D3

and D3-D1D2D3 populations for SNP pairs at varying distances.

Table 2. SNP-based heritabilities (𝒉𝑺𝑵𝑷𝟐 ) on the diagonal and genetic correlations on the

lower triangle (with standard errors).

Design Trait ADG BFT MFR CRCL

D1 ADG 0.36 (0.04)

BFT -0.10 (0.09) 0.45 (0.04)

MFR -0.12 (0.09) 0.77 (0.04) 0.52 (0.03)

CRCL -0.07 (0.08) -0.29 (0.07) -0.04 (0.07) 0.60 (0.03)

D2 ADG 0.33 (0.10)

BFT -0.35 (0.17) 0.73 (0.07)

MFR -0.41 (0.19) 0.53 (0.12) 0.47 (0.09)

CRCL 0.42 (0.17) -0.78 (0.06) -0.20 (0.15) 0.74 (0.06)

D3 ADG 0.57 (0.09)

BFT -0.36 (0.17) 0.42 (0.09)

MFR -0.31 (0.16) 0.83 (0.08) 0.47 (0.09)

CRCL 0.31 (0.20) -0.46 (0.21) -0.26 (0.21) 0.31 (0.10)

16

Variants identified in D1, D2, D3, D1D2D3 GWAS and MA

A GWAS was conducted on the D1, D2, D3 and D1D2D3 designs for four phenotypic traits

(i.e. ADG, BFT, MFR and CRCL). The summary statistic from the D1, D2 and D3 association

analysis was further used for the MA step. The global view of 𝑝 values for all SNP markers of

each trait was visualized via a Manhattan plot (Figure 2 and 3). Cumulated over the five

analyses (i.e. D1 GWAS, D2 GWAS, D3 GWAS, D1D2D3 GWAS and MA), for ADG a total

of36SNPs surpassed the nominal significance level from which 14 SNPs were above the

genome-wide significance level. These variants were situated on the Sus Scrofa chromosome

(SSC) 2, 7, 8, 13 and 16. For BFT, 299 (148) variants were identified on the following SSC: 1,

2, 4, 5, 7, 10 and 13 and for MFR, 223 (137) SNPs located on SSC 1, 2 and 12. Lastly, for

CRCL, 368 (193) SNPs on SSC 1, 7, 13, 16 and 17 were discovered via the GWAS and the

MA. The top significant variants in all the four analyses conducted are presented in Table 3

together with their –log10 𝑝 value and the associated q value. The genomic regions with

genome-wide significant and nominal significant SNPs (S1 Table) were assigned to a total of

20 clusters (Table 4).

The concordance of the nominally significant variants identified was assessed via Venn

diagrams (Figure 4). The meta-analysis revealed more unique SNPs associated to the traits as

compared to the joint analysis. Specifically, five SNPs for ADG and CRCL, two SNP for BFT

and seven SNPs for MFR were identified exclusively by the meta-analysis. One variant for

BFT, four variants for MFR and two variants for CRCL were common elements found only by

the joint and meta-analysis. While many variants from the individual population association

studies overlapped with variants from JA and MA, the individual populations showed a higher

number of unshared variants for the all traits.

By using Pearson's product-moment correlation, the accordance of the 𝑝 values was estimated

by selecting the common set of markers between the output from the association studies

conducted for D1, D2, D3, D1D2D3 and the meta-analysis results (S2 Table). The correlation

value between JA and MA 𝑝 values was 0.83 for ADG, 0.81 for BFT, 0.74 for MFR and CRCL

(all significant with 𝑝 < 0.05). The D1 p values were highest correlated with the JA and the

MA as compared to the D2 and D3 designs’ p values.

17

Figure 2. Manhattan plots of the −log10 𝒑 values for association of SNPs with ADG and

BFT in D1, D2, D3 and D1D2D3 GWAS and MA. The top horizontal line indicates the

genome-wide significance level 𝑝𝑔𝑒𝑛𝑜𝑚𝑒−𝑤𝑖𝑑𝑒 ≤ 0.05, and the bottom line indicates the

nominal level of significance 𝑝𝑛𝑜𝑚𝑖𝑛𝑎𝑙 ≤ 5𝑥10−5.

18

Figure 3. Manhattan plots of the −log10 𝒑 values for association of SNPs with MFR and

CRCL in D1, D2, D3 and D1D2D3 GWAS and MA. The top horizontal line indicates the

genome-wide significance level 𝑝𝑔𝑒𝑛𝑜𝑚𝑒−𝑤𝑖𝑑𝑒 ≤ 0.05, and the bottom line indicates the

nominal level of significance 𝑝𝑛𝑜𝑚𝑖𝑛𝑎𝑙 ≤ 5𝑥10−5.

19

Table 3. List of the highest significant SNP from all five analyses and for all four traits.

Trait GWAS

or MA Top SNP SSCa

Location

bp

-log10

(𝒑

value)

q value Total

SNPsb

Other

SSCc

ADG

D1 ALGA0123907 2 2,556,939 7.26 0.00206 16 (6) 7, 13, 16

D3 H3GA0024295 8 11,805,802 4.47 0.97386 1 -

D1D2D3 ALGA0123907 2 2,556,939 6.85 0.00224 8 (3) 7, 8

MA ALGA0123907 2 2,556,939 8.16 0.00020 11 (5) 7

BFT

D1 ASGA0085597 2 1,083,343 10.71 3.29e-07 48 (22) 1, 10

D2 INRA0024524 7 26,069,284 13.56 1.12e-09 179 (93) 5,13

D3 INRA0015172 4 79,915,989 5.47 0.09920 10 1, 2

D1D2D3 ASGA0085597 2 1,083,343 12.14 1.68e-08 34(16) 1, 7

MA ASGA0008415 2 3,895,569 14.01 2.74e-10 28 (16) 1, 7

MFR

D1 MARC0044928 2 2,494,326 21.28 1.93e-17 103 (76) 1

D2 ASGA0089068 2 3,237,229 5.35 0.16883 7 12

D3 ALGA0011643 2 7,557,050 5.70 0.04177 15 -

D1D2D3 ASGA0085597 2 1,083,343 20.84 2.89e-17 43 (22) 1

MA ASGA0085597 2 1,083,343 24.04 1.63e-20 55 (39) 1

CRCL

D1 MARC0070553 17 15,827,832 28.68 9.25e-25 98 (47) 7,16

D2 DIAS0000554 7 34,166,932 14.04 3.68e-10 146 (93) 13

D3 H3GA0004878 1 265,179,997 5.65 0.04440 11 -

D1D2D3 ALGA0093478 17 16,919,581 17.88 4.08e-14 56 (29) 1, 7

MA ALGA0093478 17 16,919,581 19.08 2.59e-15 57 (26) 1, 7 aSus Scrofa chromosome; bTotal number of SNPs significant at a nominal level (total number SNPs significant at

a genome-wide level); cOther chromosomes on which associations surpassing the nominal significance level were

detected.

20

Table 4. Number of genomic clusters, localization and number of significant SNPs.

Trait

Analysis

GWAS

or MA

Cluster

number SSCa

Cluster boundaries

(bp)

Lengt

h in

Mb

Number of

significant

SNPsb

ADG

D1 1 2 631,324-2,586,096 1.94 7 (5)

D1D2D3 2 2 2,556,939-2,586,096 0.03 3 (3)

MA 3 2 141,798-2,586,096 2.44 9 (5)

BFT

D1 4 2 236,179-5,189,397 4.95 31 (23)

D2 5 7 19,567,933-37,145,252 17.58 165 (92)

D1D2D3 6 2 141,798-3,895,569 3.75 13 (13)

7 7 26,522,116-28,252,780 1.73 5 (2)

MA 6 2 141,798-3,895,569 3.75 17 (15)

MFR

D1 8 2 70,140-13,307,467 13.24 98 (76)

D1D2D3

6 2 141,798-3,895,569 3.75 18 (16)

9 2 7,536,991-8,647,689 1.11 6 (2)

10 2 12,168,412-13,294,789 1.13 11 (3)

MA 11 2 141,798-10,254,380 10.11 37 (30)

12 2 12,275,048-13,294,789 1.02 9 (8)

CRCL

D1

13 7 97,195,350-99,887,568 2.69 15 (10)

14 17 12,361,530-19,474,175 7.11 42 (27)

15 17 21,832,087-23,773,788 1.94 11 (10)

D2 16 7 19,567,933-36,795,710 17.23 135 (92)

D1D2D3

17 7 23,659,424-26,522,116 2.86 8 (5)

18 7 97,147,161-99,424,987 2.28 9 (7)

19 17 13,692,477-19,474,175 5.78 21 (16)

MA 20 7 97,147,161-99,491,117 2.34 10 (7)

19 17 13,692,477-19,474,175 5.78 27 (17) aSus Scrofa chromosome; bTotal number of SNPs significant at a nominal level (total number SNPs significant at

a genome-wide level).

21

Figure 4. Venn diagram displaying common variants identified for the four traits via the

five analyses (i.e. D1, D2, D3, D1D2D3 GWAS and MA).

Discussion

Linkage disequilibrium between markers and quantitative trait loci is fundamental for

conducting a successful genome-wide association study. In order to disentangle variants

associated with complex traits, the LD pattern in the populations under investigation must be

evaluated. This particular analysis was carried out by [20] for the D1, D2 and D3 populations

included in this study. The main findings were that there is a faster LD decay in the European

type breeds cross (D1) as compared to the Asian/Wild Boar and European breeds cross (D2 and

D3), while the fastest breakdown of LD is observed by pooling the data. The latter finding is

supportive of the fact that the joint design (D1D2D3) could have a positive impact on the

mapping resolution. Also in accordance to this study were the results by [33] and [5] obtained

via stochastic simulations of populations with a similar phylogeny as D1, D2 and D3.

The linkage phases between SNPs and the causative mutations that underlie the detected QTL

are not always identical across populations. The D1, D2 and D3 F2 populations are established

from genetically divergent breeds of Asian (Meishan) and European origin (Landrace, Large

22

White, Piétrain – the common founder breed) as well as the European Wild Boar ancestor.

Introgression of Asian pigs into the European stocks has been well documented during the 18th

and 19th century, fact which led to the existence of Asian haplotypes within the European

commercial breeds for traits such as backfat and litter size [34]. Therefore, the premise of

having shared QTLs for some of the investigated traits among the populations is supported.

Considering this, an additional method to assess the feasibility of conducting a joint analysis

depends on how consistent is the LD phase in the individual designs as compared to the LD

phase in the joint dataset. Across all population pairs (i.e. D1-D2, D1-D3, D1-D1D2D3, D2-

D3, D2-D1D2D3 and D3-D1D2D3), the phase correlation decreased with increasing marker

distance (Figure 1). Or alternatively stated, the shorter the chromosomal segment, the greater

the chance of the LD phase to be similar. As longer distances are considered, there is a higher

chance for recombination events to disrupt the LD which was present in the ancestral population

as new LD is formed within the derived subpopulations [21]. The phase agreement between

D1-D1D2D3 had correlation values ranging from 0.86 to 0.96 due to the fact that 75% of the

joint design is composed of D1 individuals and their overall allele frequencies prevail when

pooling the three designs. The second most correlated pair, D3-D1D2D3, contains only

individuals of European ancestry derived from breeds Piétrain, Large White, Landrace and the

European Wild Boar. The least correlated individual design with the joint design was the D2

population which stems from Meishan (Asian) and Piétrain. Nevertheless, for the classes of

inter-marker distances less than 500 Kb the correlation of phase was higher than 0.44 suggesting

that for shorter chromosome lengths there are LD similarities among these populations. When

considering the individual population pairs, the different genetic background was responsible

for the low levels of phase agreement in D1-D2 and D3-D2.

Meta-analysis of genome-wide association studies results can increase the power to detect

association signals by increasing sample size. The use of this approach grew substantially in

the genomics field in the last decade as the scientific community recognized the value of

collaborating to combine genetic resources [6, 8-10]. The output of the inverse variance based

meta-analysis strategy is dependent on the standard errors of each SNP in each of the study

because the weight assigned to each variant is being calculated as the inverse of the squared

standard error. Therefore, studies with higher standard errors will have a smaller weight in the

meta-analysis. Considering the individual populations GWAS summary statistics, the D2

population has overall higher standard errors as a result of higher phenotypic variance (Table

1). This implies that this study has a smaller weight in the MA, however, this aspect is

compensated to a certain degree by the high effect sizes of the associated variants particularly

23

when considering the traits BFT and CRCL, where highly significant associations were

detected. Hence, factors influencing the meta-analysis output are the standard error and the

effect size which greatly depend on the genetic architecture of the trait under investigation.

One of the main objectives of this study was to compare the meta-analysis with the joint

analysis, in which the common individual level genotypes are combined into a single dataset

before the association study. Therefore, the agreement among the 𝑝 values was assessed. The

correlation value between JA and MA 𝑝 values was higher than 0.7 (𝑝 < 0.05) for all traits

suggesting that the significance levels were similar in the two analysis. Moreover, the 𝑝 values

from the individual population association studies were also compared with the results from the

joint and meta-analysis. It was observed that the D1 𝑝 values were the most correlated to the

JA (strong persistence of LD phase) and the MA 𝑝 values, while D2 and D3 showed low levels

of correlation. A limitation when assessing the agreement of the individual designs with the JA

and MA is that the correlation only considers common variants between all five analyses. Some

variants which could be highly associated in the individual populations might be disregarded

due to being monomorphic in the others; however the correlations value gives a valuable

overview at a genome-wide scale of the majority of the SNPs (i.e. 31,299).

From the joint analysis summary statistic a total of eight clusters were assigned and from the

meta-analysis output, six (Table 4). Clusters 7 and 17 were identified only by the JA. Many of

the significant regions overlapped (S1 Figure) or were identified via both analysis (i.e. Cluster

6 and 19). Except for CRCL, the MA had higher significance levels of the SNPs surpassing the

nominal threshold and the clusters were supported by a higher number of variants. There were

more unique variants identified via MA (Figure 4), yet none of them surpassed the genome

wide significance level. The size of the clusters identified was generally smaller for the joint

and meta-analysis as compared to the individual population clusters. This suggests that both

these approaches have a positive impact on the mapping resolution, pinpointing to narrower

locations of causative variants.

Genetic variance and correlations

In the current study, growth (ADG) and carcass traits (related to fatness: BFT and to anatomy:

MFR and CRCL) were investigated. The previous reported heritabilities range from 0.03-0.49

for ADG, 0.12-0.74 for BFT and 0.55-0.60 for CRCL [35]. There are limited resources for the

MFR trait represented by only seven QTL listed up to date in the PigQTLdb. The SNP-based

heritabilities (Table 2) were overall moderate to high and mostly in accordance to the literature,

except for CRCL in D2 (ℎ𝑆𝑁𝑃2 = 0.74), ADG in D3 (ℎ𝑆𝑁𝑃

2 =0.57) and CRCL in D3 (ℎ𝑆𝑁𝑃2 =0.31).

24

The genetic correlations were high between BFT and MFR (0.77 in D1, 0.53 in D2 and 0.83 in

D3) as both traits have a genetic architecture composed of genes involved in the fat metabolism.

Cluster identification and candidate genes

A total number of 20 genomic clusters were identified (Table 4 and S1 Table). They were

located on SSC2, SSC7 and SSC17. Three clusters were found for ADG, four for BFT, six for

MFR (one overlapping with a BFT cluster) and eight for CRCL in the D1 GWAS, D2 GWAS,

D3 GWAS, JA and MA. The length of the segments varied substantially from 0.03 Mb (Cluster

2 supported by 3 significant SNPs) to 17.58 (Cluster 5 supported by 165 significant SNPs). The

long size of the clusters can be attributed to the fact that there are high levels of LD between

SNPs which leads to positive signals of association over large genomic regions. The results of

the joint and meta-analysis did not reveal any new non overlapping clusters with the ones

identified via the single population association study (S1 Figure). Nevertheless, the clusters for

the joint and meta-analysis span over shorter genomic regions, pinpointing to more precise

locations to identify candidate genes.

The traits in this study are influenced by several genes expressed during the prenatal and

postnatal development. Carcass length is a trait mostly determined prenatally and proportional

to the length of the spine, as well as the individual length of the vertebrae [36]. Average daily

gain, backfat thickness and meat to fat ratio are primarily influenced by tissue growth which

can be obtained through cell hyperplasia (e.g. cell proliferation) or hypertrophy (i.e. growth in

size) [37].

Several known genes that have their functions previously reported were associated to the traits.

One of the most prominent genes, the IGF2 was not assembled in the Sscrofa 10.2 genome

version, but now has been positioned on the Sscrofa 11.1 reference genome. IGF2 has been

described to have an effect on muscle mass and fat deposition [38]. The region where IGF2

resides is included on six of the clusters identified for ADG, BFT and MFR (1, 3, 4, 6, 8 and

11) which are partially overlapping (S1 Figure). The highest significant SNP in the vicinity of

IGF2 was identified for MFR via meta-analysis: ASGA0085597 (with –log10 (𝑝 value) = 24.04

and q value = 1.63e-20). Moreover, the partially overlapping clusters located on SSC2 (Table

4) harbor genes with growth factor activity (GO: 0008083): FGF3, FGF4, FGF19 and VEGFB

as well as genes responsible for the maintenance of gastrointestinal epithelium (GO: 0030277):

MUC6, MUC2, MUC5AC and MUC5B. Other genes underlined for ADG, BFT and MFR are

HRAS (positive regulation on cell proliferation GO: 0008284) and DHCR7 (lipid metabolic

process GO: 0006629). Cluster 10 and 12 found via the joint and meta-analysis narrowed down

25

a region specific for MFR which contains associations for several olfactory receptors which

reside in the genomic region SSC2: 12-14 Mb, gene family which is known to have significant

expansion throughout time within the pig genome [39].

On SSC7 from 19 to 38 Mb, clusters 5 and 7 showed associations with BFT for the D2

population and in the joint design. Several other studies pinpointed QTLs related to fat traits in

the same region [40, 41]. The gene PPARD was found as a good gene candidate for fat

deposition traits [42]. One of the significant SNPs on cluster 5 (H3GA0020846 with –log10 (𝑝

value) = 6.11 and q value = 3.5e-4) is located in one intron of the gene. Furthermore, the critical

region on SSC7 harboring the two clusters for BFT overlaps with the clusters 16 and 17 which

were assigned to CRCL in D2 and in the joint analysis, respectively. The common significant

variants associated within these clusters for BFT and CRCL also demonstrate discordant

direction of effects. This suggests the great interplay between BFT and CRCL associated

variants leading to pleiotropic consequences on the phenotypes. It is then reasonable to believe

that in prenatal developmental stages the horizontal growth of the animal is already mainly

determined by the number of vertebrae and their length while the same or other genes contribute

conversely in postnatal existence of the individual ensuring the vertical growth of the backfat

thickness. For that reason, the following genes which are located in a highly associated region

for both BFT and CRCL (SSC7: 24 – 26 Mb) were suggested: BMP5 part of the transforming

growth factor-beta (TGF-beta) signaling pathway, HMGCLL1 (ketone body biosynthetic

process GO: 0046951), GFRAL, a receptor required for GDF15 (Growth differentiation factor

15) mediated reductions in food intake and body weight in mice with obesity [43] and HCRTR2

(feeding behavior GO: 0007631).

Cluster 13, 18 and 20, which mostly overlapped, contained significant associations for carcass

length for D1 design, for the joint analysis and the meta-analysis. This genomic region contains

genes which have been already associated with carcass length: VRTN [44], LTBP2 [45] and

TGFB3 [46]. The three genes influence the development of vertebrae and ribs in mammalian

embryos and thus having a direct influence on the carcass length. Moreover, an inspection of

SSC17 which contained regions (i.e. cluster 14, 15 and 19) highly associated to CRCL from

D1, joint and meta-analysis revealed genes potentially influencing this trait of interest: PLCB1

(positive regulation of developmental growth GO: 0048639), FLRT3 (embryonic

morphogenesis GO: 0048598) and FERMT1 (positive regulation of transforming growth factor

beta receptor signaling pathway GO:0030511). Nevertheless, the most interesting finding was

situated close to the highest significant SNP for CRCL in D1 (MARC0070553 with –log10 (𝑝

value) = 28.68 and q value = 9.25e-25) and is represented by BMP2. The bone morphogenetic

26

protein 2 (BMP2) belongs to the same family as BMP5 and is involved in the transforming

growth factor-beta (TGF-beta) signaling pathway, playing a role in bone and cartilage

development.

Conclusion

A genome-wide association study was conducted for growth and carcass traits using SNP-chip

information from two populations sharing a common founder breed (Piétrain). An individual

population GWAS was conducted and two strategies for combining the datasets were

employed: a joint population GWAS and a meta-analysis of the individual population GWAS

summary statistics. While the joint population GWAS and the meta-analysis did not identify

new associated regions besides the ones identified in the individual populations, both

approaches had a positive impact on the mapping resolution which implies that causative

mutations can be identified with higher accuracy. Depending on the access to the complete

original datasets, the strategies can complement or substitute each other. A total of 20 genomic

clusters were pinpointed and they contained genes previously associated with the traits (e.g.

IGF2, VRTN and TGFB3). Finally, among the additional candidate genes being suggested,

BMP2 is being proposed as a strong candidate gene for carcass length. The findings of this

study provide novel insights into approaches of dissecting the genetic basis of growth and

carcass traits and indicate directions of further research which will lead to the identification of

causal mutations affecting traits relevant in pig breeding programs.

27

References

1. Andersson L, Haley C, Ellegren H, Knott S, Johansson M, Andersson K, et al. Genetic

mapping of quantitative trait loci for growth and fatness in pigs. Science. 1994;

263:1771-4.

2. Rothschild MF, Hu Z-l, Jiang Z. Advances in QTL Mapping in Pigs. Int J Biol Sci..

2007;3(3):192-7. PubMed PMID: PMC1802014.

3. Ernst CW, Steibel JP. Molecular advances in QTL discovery and application in pig breeding.

Trends Genet. 2013;29(4):215-24. doi: https://doi.org/10.1016/j.tig.2013.02.002.

4. Ramos AM, Crooijmans RPMA, Affara NA, Amaral AJ, Archibald AL, Beever JE, et al.

Design of a High Density SNP Genotyping Assay in the Pig Using SNPs Identified and

Characterized by Next Generation Sequencing Technology. PLoS One.

2009;4(8):e6524. doi: 10.1371/journal.pone.0006524.

5. Schmid M, Wellmann R, Bennewitz J. Power and precision of QTL mapping in simulated

multiple porcine F2 crosses using whole-genome sequence information. BMC Genet.

2018;19:22. doi: 10.1186/s12863-018-0604-0. PubMed PMID: PMC5883302.

6. Evangelou E, Ioannidis JPA. Meta-analysis methods for genome-wide association studies

and beyond. Nat Rev Genet. 2013;14:379. doi: 10.1038/nrg3472.

7. Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide

association scans. Bioinformatics. 2010;26(17):2190-1. doi:

10.1093/bioinformatics/btq340.

8. Bernal Rubio YL, Gualdrón Duarte JL, Bates RO, Ernst CW, Nonneman D, Rohrer GA, et

al. Implementing meta-analysis from genome-wide association studies for pork quality

traits. J Anim Sci. 2015;93(12):5607-17. doi: 10.2527/jas.2015-9502.

9. Guo Y, Qiu H, Xiao S, Wu Z, Yang M, Yang J, et al. A genome-wide association study

identifies genomic loci associated with backfat thickness, carcass weight, and body

weight in two commercial pig populations. J Appl Genet. 2017;58(4):499-508. doi:

10.1007/s13353-017-0405-6. PubMed PMID: 28890999.

10. Le TH, Christensen OF, Nielsen B, Sahana G. Genome-wide association study for

conformation traits in three Danish pig breeds. Genet Sel Evol. 2017;49:12. doi:

10.1186/s12711-017-0289-2. PubMed PMID: PMC5259967.

11. Ruckert C, Bennewitz J. Joint QTL analysis of three connected F2-crosses in pigs. Genet

Sel Evol. 2010;42:40. Epub 2010/11/03. doi: 10.1186/1297-9686-42-40. PubMed

PMID: 21040563; PubMed Central PMCID: PMCPMC2988712.

28

12. Boysen TJ, Tetens J, Thaller G. Detection of a quantitative trait locus for ham weight with

polar overdominance near the ortholog of the callipyge locus in an experimental pig F2

population. J Anim Sci. 2010;88(10):3167-72. Epub 2010/06/29. doi: 10.2527/jas.2009-

2565. PubMed PMID: 20581286.

13. Edwards SM, Sørensen IF, Sarup P, Mackay TFC, Sørensen P. Genomic Prediction for

Quantitative Traits Is Improved by Mapping Variants to Gene Ontology Categories in

Drosophila melanogaster. Genetics. 2016;203(4):1871.

14. Borchers N, Reinsch N, Kalm E. Familial cases of coat colour‐change in a Piétrain cross.

Journal of Animal Breeding and Genetics-Zeitschrift Fur Tierzuchtung Und

Zuchtungsbiologie. 2000;117(4):285-7. doi: doi:10.1111/j.1439-0388.2000.00255.x.

15. Geldermann H, Müller E, Beeckmann P, Knorr C, Yue G, Moser G. Mapping of

quantitative‐trait loci by means of marker genes in F2 generations of Wild boar, Pietrain

and Meishan pigs. Journal of Animal Breeding and Genetics-Zeitschrift Fur

Tierzuchtung Und Zuchtungsbiologie. 1996;113(1‐6):381-7. doi: doi:10.1111/j.1439-

0388.1996.tb00629.x.

16. Zentral Verband der Deutschen Schweinproduktion (ZDS). Richtlinie für die

Stationsprüfung auf Mastleistung, Schlachtkörperwert und Fleischbeschaffenheit beim

Schwein. 1981.

17. Zentral Verband der Deutschen Schweinproduktion (ZDS). Richtlinie für die

Stationsprüfung auf Mastleistung, Schlachtkörperwert und Fleischbeschaffenheit beim

Schwein. 1999.

18. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: A

Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. Am

J Hum Genet. 2007;81(3):559-75. doi: https://doi.org/10.1086/519795.

19. Fujii J, Otsu K, Zorzato F, de Leon S, Khanna V, Weiler J, et al. Identification of a mutation

in porcine ryanodine receptor associated with malignant hyperthermia. Science.

1991;253(5018):448-51. doi: 10.1126/science.1862346.

20. Stratz P, Schmid M, Wellmann R, Preuss S, Blaj I, Tetens J, et al. Linkage disequilibrium

pattern and genome-wide association mapping for meat traits in multiple porcine F2-

crosses. Anim Genet. 2018; doi: 10.1111/age.12684. In press.

21. Hill WG, Robertson A. Linkage disequilibrium in finite populations. Theor Appl Genet.

1968;38(6):226-31. doi: 10.1007/bf01245622.

29

22. Badke YM, Bates RO, Ernst CW, Schwab C, Steibel JP. Estimation of linkage

disequilibrium in four US pig breeds. BMC Genomics. 2012;13(1):24. doi:

10.1186/1471-2164-13-24.

23. R Core Team. R: A Language and Environment for Statistical Computing 2014.

24. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: A Tool for Genome-wide Complex

Trait Analysis. American Journal of Human Genetics. 2011;88(1):76-82. doi:

https://doi.org/10.1016/j.ajhg.2010.11.011.

25. Li G, Zhu H. Genetic Studies: The Linear Mixed Models in Genome-wide Association

Studies. The Open Bioinformatics Journal. 2013;7(1):27-33.

26. Turner SD. qqman: an R package for visualizing GWAS results using Q-Q and manhattan

plots. bioRxiv. 2014. doi: 10.1101/005165.

27. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad

Sci USA. 2003;100(16):9440-5. doi: 10.1073/pnas.1530509100.

28. Bardou P, Mariette J, Escudié F, Djemiel C, Klopp C. jvenn: an interactive Venn diagram

viewer. BMC Bioinformatics. 2014;15(1):293. doi: 10.1186/1471-2105-15-293.

29. Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, et al. BioMart and

Bioconductor: a powerful link between biological databases and microarray data

analysis. Bioinformatics. 2005;21(16):3439-40. doi: 10.1093/bioinformatics/bti525.

30. Zerbino DR, Achuthan P, Akanni W, Amode M R, Barrell D, Bhai J, et al. Ensembl 2018.

Nucleic Acids Res. 2018;46(D1):D754-D61. doi: 10.1093/nar/gkx1098.

31. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology:

tool for the unification of biology. Nat Genet. 2000;25:25. doi: 10.1038/75556.

32. The Gene Ontology Consortium. Expansion of the Gene Ontology knowledgebase and

resources. Nucleic Acids Res. 2017;45(D1):D331-D8. doi: 10.1093/nar/gkw1108.

PubMed PMID: PMC5210579.

33. Bennewitz J, Wellmann R. Mapping Resolution in Single and Multiple F2 Populations

using Genome Sequence Marker Panels. Proceedings of the 10th World Congress on

Genetics Applied to Livestock Production. 2014.

34. Bosse M, Lopes MS, Madsen O, Megens H-J, Crooijmans RPMA, Frantz LAF, et al.

Artificial selection on introduced Asian haplotypes shaped the genetic architecture in

European commercial pigs. Proc Biol Sci. 2015;282(1821):20152019. doi:

10.1098/rspb.2015.2019. PubMed PMID: PMC4707752.

35. Rothschild MF, Ruvinsky A. The Genetics of the Pig. 2nd Edition ed. Oxfordshire, UK,

Cambridge, USA: CAB International; 2011. 526 p.

30

36. Borchers N, Reinsch N, Kalm E. The number of ribs and vertebrae in a Piétrain cross:

variation, heritability and effects on performance traits. J Anim Breed Genet.

2004;121(6):392-403. doi: doi:10.1111/j.1439-0388.2004.00482.x.

37. Lawrence TLJ, Fowler VR. Growth of farm animals. 2nd edition ed. Oxon, UK, New York,

USA CABI Pub.; 2002. 347 p.

38. Nezer C, Moreau L, Brouwers B, Coppieters W, Detilleux J, Hanset R, et al. An imprinted

QTL with major effect on muscle mass and fat deposition maps to the IGF2 locus in

pigs. Nat Genet. 1999;21:155. doi: 10.1038/5935.

39. Nguyen DT, Lee K, Choi H, Choi M-k, Le MT, Song N, et al. The complete swine olfactory

subgenome: expansion of the olfactory gene repertoire in the pig genome. BMC

Genomics. 2012;13(1):584. doi: 10.1186/1471-2164-13-584.

40. de Koning DJ, Janss LLG, Rattink AP, van Oers PAM, de Vries BJ, Groenen MAM, et al.

Detection of Quantitative Trait Loci for Backfat Thickness and Intramuscular Fat

Content in Pigs (Sus Scrofa). Genetics. 1999;152(4):1679-90.

41. Bidanel J-P, Milan D, Iannuccelli N, Amigues Y, Boscher M-Y, Bourgeois F, et al.

Detection of quantitative trait loci for growth and fatness in pigs. Genet Sel Evol.

2001;33(3):289. doi: 10.1186/1297-9686-33-3-289.

42. Zhang Y, Gao T, Hu S, Lin B, Yan D, Xu Z, et al. The Functional SNPs in the 5’ Regulatory

Region of the Porcine PPARD Gene Have Significant Association with Fat Deposition

Traits. PLoS One. 2015;10(11):e0143734. doi: 10.1371/journal.pone.0143734.

43. Yang L, Chang C-C, Sun Z, Madsen D, Zhu H, Padkjær SB, et al. GFRAL is the receptor

for GDF15 and is required for the anti-obesity effects of the ligand. Nat Med.

2017;23:1158. doi: 10.1038/nm.4394

44. Mikawa S, Sato S, Nii M, Morozumi T, Yoshioka G, Imaeda N, et al. Identification of a

second gene associated with variation in vertebral number in domestic pigs. BMC

Genet. 2011;12(1):5. doi: 10.1186/1471-2156-12-5.

45. Zhang L, Liu X, Liang J, Zhao K, Yan H, Li N, et al. Genome-wide Association Study for

Number of Vertebrae in an F2 Large White × Minzhu Population of Pigs. bioRxiv. 2015.

doi: 10.1101/016956.

46. Zhang L, Yue J-W, Pu L, Wang L-G, Liu X, Liang J, et al. Genome-wide study refines the

quantitative trait locus for number of ribs in a Large White × Minzhu intercross pig

population and reveals a new candidate gene. Mol Genet Genomics. 2016;291(5):1885-

90. doi: 10.1007/s00438-016-1220-1.

31

Chapter 2

Non-additive effects in four diverse F2 pig crosses for growth,

carcass and fat related traits

Iulia Blaj1, Jens Tetens2, 3, Siegfried Preuß4, Jörn Bennewitz4 and Georg Thaller1

1Institute of Animal Breeding and Husbandry, Kiel University, Kiel, Germany

2Functional Breeding Group, Department of Animal Sciences, Göttingen University,

Göttingen, Germany

3Center for Integrated Breeding Research, Göttingen University, Göttingen, Germany

4Institute of Animal Husbandry and Breeding, University of Hohenheim, Stuttgart, Germany

Manuscript in preparation

32

Abstract

Non-additive effects, such as dominance and imprinting, have received substantial attention in

the recent past. Advances were in general coupled with technological progress regarding

genomic data generation. In this study, we focus on four F2 resource populations for which prior

investigations on non-additive effects were mostly done based on sparse genotyping and linkage

mapping. Thus, SNP array based genomic information complementary to phenotypic data are

here used to gain insights into mechanisms underlying non-additivity. Dominance and

imprinting are evaluated by means of variance components estimation and usage of various

genome-wide association models (i.e. dominance, imprinting, maternal and paternal) for

growth, carcass and fat related traits. Variance attributed to imprinting varies from 0% to 19%

and it prevails more in the larger F2 stemming from Piétrain boars crossed with Large white x

Landrace crossbred sow or Large white sow. Dominance is responsible for 0 to 34% of the

phenotypic variance and is more pronounced in the crosses originating from more distant

founders (due to elevated heterozygosity levels). Disentangling additive, dominance and

imprinting variances revealed the confounding nature of these various genetic partitions. High

levels of significance for imprinting and paternal were detected around IGF2, gene known to

be under epigenetic control. Attention is drawn to the fact that the statistical models used for

the non-additive effects can lead to spurious associations, as an artefact of the population

setting.

Background

The non-additive genetic effects field has recently generated considerable interest in the area of

livestock genomics (Varona et al., 2018). While dominance effects have been long included as

a component of the genetic variance (Falconer and Mackay, 1996), imprinting effects is a

younger area of research. Imprinting, an epigenetic mechanism, refers to a locus at which the

two identical alleles are not equivalent from a functional point of view leading to the preferential

expression from either the maternally or paternally inherited allele (Lawson et al., 2013). A

classic example for imprinting in pigs is the paternally expressed IGF2 gene (Jeon et al., 1999;

Nezer et al., 1999). While imprinting is best understood as a mechanism in the classical setting

where a complete parent-of-origin monoallelic expression (resembling dominance) is expected,

situations that deviate from complete imprinting in which e.g. genes display a tissues specific

and/or time dependent expression (Prickett and Oakey, 2012) are encountered as well. The

different variations of the imprinting and dominance (polar overdominance, polar

underdominance, bipolar dominance) and the interplay among these effects (O'Doherty et al.,

33

2015) contribute collectively to the difficulty of defining appropriate statistical framework

which could account for the various scenarios. Nevertheless, complex mechanisms can be

unraveled such as the classic example of dominance (polar overdominance) interacting with

imprinting in the callipyge phenotype in sheep (Cockett et al., 1996).

The estimation of non-additive effects proved to be challenging in the past due to technical

limitations that were overcome with the availability of dense single nucleotide polymorphisms

(SNP) arrays. In this context, estimation of dominance and imprinting effects became appealing

in the last decade and it has been under debate how much of the phenotypic variance can be

assigned to these effects and whether including such effects in the genetic evaluation models

could improve breeding value estimates (Nishio and Satoh, 2015; Jiang et al., 2017; Blunk et

al., 2019). From an applied perspective in livestock breeding, predicting breeding values, either

in a classical setting or by means of genomic selection (Meuwissen et al., 2001), focuses on the

additive genetic effects. Nevertheless, given the complex nature of the quantitative traits, a

deeper insight into the possible mechanisms that contribute in a non-additive manner to the

genetic variance can pave the way to employing such findings in practice.

For this study, we consider data from four F2 resource populations for which previous

investigations regarding dominance and imprinting were carried out using sparse genotyping

and linkage mapping. Three of the crosses were analyzed using 250 genetic markers (mostly

microsatellites) jointly and separately for additive, dominance and imprinting effects on fat and

metabolic traits (Rückert and Bennewitz, 2010). For the other cross, an interval mapping

approach using microsatellite data detected three quantitative trait loci (QTL) affecting carcass

traits comparable with the callipyge phenotype in sheep (Boysen et al., 2010). Using selected

genotypic subgroups in the IGF2 region in the same population and an interval mapping

approach, Boysen et al. (2011) reported additive and parent-of-origin evidence for additional

functional genetic variation within the IGF2 region affecting body composition traits.

The available phenotypic information from these resource populations coupled with the

information from the dense SNP arrays could facilitate a better understanding of the underlying

non-additive mechanisms. Thus, the aim of the study was twofold. Firstly, to decompose the

phenotypic variance for growth, carcass and fat related traits by means of genomic relationship

matrices accounting for an additive, dominance and imprinting partition. Secondly, to

investigate non-additive, i.e. dominance, imprinting, maternal and paternal SNP effects by

using different models for genome-wide association studies (GWAS).

34

Materials and methods

Resource populations

For this study, we used four F2 pig resource populations (Borchers et al., 1999; Geldermann et

al., 1996). The first design (D1) considered contained 1,785 F2 individuals originating from

Piétrain (P) boars and Landrace x Large white (LwxL) crossbred or Large white (Lw) sow

founder individuals. The remaining three populations (i.e. D2, D3 and D4) stemmed from a

Meishan (M) boar or Wild boar (W) crossed with either Piétrain or Meishan females. The F2

generation was the outcome of repeatedly crossing F1 boars with F1 sows in order to obtain

large full sib families. The D2 (MxP) had 304 F2s, the D3 (WxP) contained 291 F2s and finally

D4 (WxM) was comprised of 302 F2 individuals. Further details are available in Figure 1.

Figure 1. Description of the four F2 resource populations per generation.

Phenotypes and genotypes

The phenotypes available for all F2 individuals were average daily gain (ADG), average backfat

thickness (BFT), meat to fat ratio (MFR) and carcass length (CRCL). An additional set of

phenotypes was used for D1: fat thickness neck (SCF), fat thickness middle of the back (BFM),

fat thickness end of the back (BFTR), fat thickness at latissimus dorsi muscle (SCFLD), fat

thickness over the loin muscle (SCFLM) and belly fatness score (BFS). The phenotypes were

pre-corrected for systematic environmental effects and for the effect of RYR1 gene (Fujii et al.,

1991). All three generations have been genotyped with the Illumina PorcineSNP60 BeadChip

(Ramos et al., 2009). SNP chromosomal positions were based on the current pig genome

assembly (Sscrofa 11.1). The quality control step was done in PLINK (Purcell et al., 2007) and

it removed SNPs with MAF < 0.05 and with a call rate < 0.90. The final genotypes count were

44,451, 40,733, 37,139 and 35,887 for D1, D2, D3 and D4, respectively.

35

Variance component and association studies

To distinguish between the maternal and paternal origin of the alleles and implicitly to

discriminate between aA and Aa heterozygous genotypes, phasing was conducted using

SHAPEIT2 (O'Connell et al., 2014) and the flag --duoHMM which ensures that the haplotype

inference is consistent with the pedigree structure. For the phasing step, information from all

three generations was inputted (i.e. F0, F1 and F2). Having now the phased genotypes of the F2

individuals, aa, aA, Aa and AA were coded in the following manner: {1,0,0,-1} for additive,

{0,1,1,0} for dominance and {0,1,-1,0} for imprinting (Nishio and Satoh, 2015). The genomic

relationship matrices (i.e. gA, gD and gI) were constructed similar to Laurin et al. (2018) and the

phenotypic variance was partitioned in a step wise approach using three models:

i) y=µ+gA+e additive

ii) y=µ+gA+gD+e additive and dominance

iii) y=µ+gA+gD+gI+e additive, dominance and imprinting

Five mixed linear models were used, all incorporating the following random terms: gA, gD and

gI. Models were according to Mantey et al. (2005) and Mozaffari et al. (2019) and are shown

below. In the first three, the SNPs were tested for additivity, dominance and imprinting. The

latter two were a maternal and a paternal model where the phased alleles coming either from

the dam or from the sire were tested separately for association. The variance component

estimation and the mixed models were solved in R (R Core Team, 2018) using the package

‘sommer’ (Covarrubias-Pazaran, 2016).

i) y=µ+ xaddbadd+gA+gD+gI+e additive model

ii) y=µ+ xdombdom+gA+gD+gI+e dominance model

iii) y=µ+ ximpbimp+gA+gD+gI+e imprinting model

iv) y=µ+xmatbmat+gA+gD+gI+e maternal model

v) y=µ+xpatbpat+gA+gD+gI+e paternal model

Manhattan plots for the GWAS were created via ggplot2 R package (Wickham, 2009). By using

Bonferroni correction, the genome-wide significance threshold was set to 𝑝𝑔𝑒𝑛𝑜𝑚𝑒−𝑤𝑖𝑑𝑒 ≤

0.05. To balance the stringency of the Bonferroni correction, an additional nominal significant

level with a threshold of 𝑝 ≤ 5𝑥10−5 was applied.

36

Results

Variance component estimation

In the first instance, we partitioned the phenotypic variance into additive and environmental

component (residual) using genomic relationship matrices computed from genotypes.

Gradually, a component accounting for dominance and one accounting for imprinting was

added. Table 1 and Table 2 present how including an additional element influences the overall

phenotypic variance explained in all the four crosses for all ten traits considered. The

visualization of the percentage breakdown for the variance including all four components (i.e.

additive, dominance, imprinting and residual) is available in Figure SM1. The dominance

variance ranged from 0% to 34%, whereas the imprinting variance from 0% to 19%. In the D1

cross, the highest dominance variance was for ADG, contributing with 6% to the total variance.

For the trait SCFLM in D1, the imprinting variance explained 19% of the phenotypic variance.

Considering the other three crosses, a 34% dominance contribution was recorded for ADG in

D2 and an 11% imprinting contribution for BFT in D3.

Table 1. Additive (VA), dominance (VD) and imprinting (VI) variance contribution for D1

cross (where h2=VA/VP is the narrow sense heritability) in: i) additive, ii) additive +

dominance and iii) additive + dominance + imprinting.

Cross Trait VA / VP (VA + VD) / VP

VA / VP + VD / VP

(VA + VD + VI) / VP

VA / VP + VD / VP + VI / VP

D1

ADG

0.35 (0.04)

0.39 (0.04)

0.32 (0.04) + 0.07 (0.02)

0.41 (0.04)

0.29 (0.04) + 0.06 (0.02) + 0.06 (0.04)

BFT 0.46 (0.04)

0.46 (0.04)

0.45 (0.04) + 0.01 (0.02)

0.46 (0.04)

0.45 (0.04)+ 0.01 (0.02) + 0 (0.02)

MFR 0.51 (0.03) 0.52 (0.03)

0.52 (0.04) + 0 (0.01)

0.59 (0.03)

0.44 (0.04) + 0 (0.01) + 0.15 (0.04)

CRCL 0.60 (0.03)

0.60 (0.03)

0.60 (0.03) + 0 (0.01)

0.60 (0.03)

0.60 (0.03) + 0 (0.01) + 0 (0.02)

SCF 0.32 (0.04) 0.34 (0.04)

0.29 (0.04) + 0.04 (0.02)

0.34 (0.04)

0.29 (0.04) + 0.04 (0.02) + 0.01 (0.03)

BFM 0.31 (0.04)

0.32 (0.04)

0.30 (0.04) + 0.02 (0.02)

0.32 (0.04)

0.30 (0.04) + 0.02 (0.02) + 0 (0.02)

BFTR 0.42 (0.04)

0.42 (0.04)

0.42 (0.04) + 0 (0.02)

0.42 (0.04)

0.42 (0.04) + 0 (0.02) + 0 (0.02)

SCFLD 0.49 (0.03)

0.50 (0.04)

0.48 (0.04) + 0.02 (0.02)

0.51 (0.04)

0.46 (0.04) + 0.01 (0.02) + 0.03 (0.03)

SCFLM 0.46 (0.04)

0.47 (0.04)

0.45 (0.04) + 0.01 (0.02)

0.56 (0.03)

0.36 (0.04) + 0.01 (0.01) + 0.19 (0.04)

BFS 0.40 (0.04)

0.41 (0.04)

0.40 (0.04) + 0.01 (0.02)

0.42 (0.04)

0.39 (0.04) + 0.01 (0.02) + 0.02 (0.03)

37

Table 2. Additive (VA), dominance (VD) and imprinting (VI) variance contribution for D2,

D3 and D4 cross (where h2=VA/Vp is the narrow sense heritability) in: i) additive, ii)

additive + dominance and iii) additive + dominance + imprinting.

Cross Trait VA / VP (VA + VD) / VP

VA / VP + VD / VP

(VA + VD + VI) / VP

VA / VP + VD / VP + VI / VP

D2

ADG 0.32 (0.09)

0.63 (0.11)

0.29 (0.09) + 0.34 (0.12)

0.63 (0.11)

0.29 (0.09) + 0.34 (0.12) + 0 (0.07)

BFT 0.71 (0.07) 0.79 (0.08)

0.70 (0.07) + 0.08 (0.07)

0.79 (0.08)

0.69 (0.08) + 0.08 (0.07) + 0.02 (0.07)

MFR 0.46 (0.09) 0.50 (0.11)

0.46 (0.09) + 0.04 (0.09)

0.52 (0.11)

0.44 (0.10) + 0.03 (0.09) + 0.05 (0.08)

CRCL 0.73 (0.07) 0.73 (0.08)

0.73 (0.07) + 0 (0.06)

0.73 (0.08)

0.73 (0.08) + 0 (0.06) + 0 (0.06)

D3

ADG 0.54 (0.09) 0.60 (0.12)

0.53 (0.09) + 0.07 (0.12)

0.60 (0.12)

0.53 (0.12) + 0.07 (0.12) + 0 (0.11)

BFT 0.40 (0.10) 0.41 (0.14)

0.40 (0.10) + 0.01 (0.12)

0.44 (0.14)

0.30 (0.12) + 0.01 (0.12) + 0.11 (0.13)

MFR 0.46 (0.09) 0.53 (0.13)

0.43 (0.10) + 0.10 (0.13)

0.53 (0.13)

0.41 (0.12) + 0.09 (0.13) + 0.02 (0.11)

CRCL 0.30 (0.09) 0.40 (0.14)

0.24 (0.09) + 0.16 (0.15)

0.41 (0.15)

0.21 (0.11) + 0.14 (0.15) + 0.05 (0.12)

D4

ADG 0.40 (0.08) 0.46 (0.10)

0.39 (0.09) + 0.07 (0.08)

0.50 (0.10)

0.37 (0.09) + 0.07 (0.08) + 0.05 (0.06)

BFT 0.49 (0.08) 0.49 (0.09)

0.49 (0.08) + 0 (0.06)

0.51 (0.10)

0.48 (0.09) + 0 (0.06) + 0.03 (0.05)

MFR 0.59 (0.08) 0.59 (0.09)

0.59 (0.08) + 0 (0.05)

0.59 (0.09)

0.58 (0.08) + 0 (0.05) + 0.01 (0.05)

CRCL 0.56 (0.08) 0.56 (0.09)

0.56 (0.08) + 0 (0.05)

0.59 (0.09)

0.53 (0.08) + 0 (0.05) + 0.06 (0.06)

Genome-wide association studies

The majority of the genome-wide statistically significant SNPs for all linear mixed models used

were located on Sus Scrofa chromosomes (SSC): 2, 7 and 17. Table 3 and Table 4 contain the

top SNPs from the regions above the genome wide significance level for the non-additive

effects (dominance, imprinting, maternal and paternal). The SNPs exceeding the nominal

significance level (threshold of 𝑝 ≤ 5𝑥10−5) for all four populations and for all traits are

summarized in Table SM1. The most significant SNP for the dominance model was

H3GA0048042 on SSC17 for CRCL with –log10 𝑝 value =14.94 and for the imprinting model

H3GA0054053 on SSC2 for MFR (–log10 𝑝 value = 26.45, Figure 2). The latter variant was the

most significant SNP on SSC2 for MFR when using the paternal model (–log10 𝑝 value = 51.17)

and it often appeared at high levels of significance together with H3GA0005584 (SSC2:

38

4.37Mb) in the imprinting and paternal models for most of the traits in D1 (except CRCL).

Overall, the top maternal effect was found on SSC17 for CRCL in D1 for the variant

MARC0070553 (–log10 𝑝 value = 21.33).

Regional Manhattan plots for MFR in D1 (SSC2:0-14Mb) and for CRCL associations in D1

(SSC17: 0-30Mb) are depicted in Figure 3. A region around SSC7:30Mb had several significant

variants (either for dominance or maternal model) related to ADG, BFT and CRCL in D2 and

D4 and, exemplarily, this specific region for CRCL in D4 is shown in Figure 4. To investigate

the causality of the H3GA0054053 for MFR in D1 in the imprinting setting, a conditional

GWAS was conducted using the variant as a fixed effect (Figure 5).

Table 3. Top SNPs for traits with associations over the genome wide significance level in

D1. Type GWAS: I = imprinting, D = dominance, M = maternal and P = paternal.

Cross Trait SNP SSC Position –log10 𝒑 value Type

GWAS

D1

ADG ALGA0123907 2 2556939 9.31 P

BFT

H3GA0005584 2 4378975 11.19 I

H3GA0005584 2 4378975 19.75 P

MFR

ALGA0105438 2 631324 6.76 D

H3GA0054053 2 1100354 26.45 I

H3GA0054053 2 1100354 51.17 P

CRCL

H3GA0048042 17 19474175 14.94 D

H3GA0047609 17 1616893 8.38 I

MARC0070553 17 15827832 21.33 M

MARC0027977 17 17667594 10.12 P

SCF

H3GA0005584 2 4378975 8.38 I

H3GA0005584 2 4378975 14.60 P

BFM

H3GA0005584 2 4378975 6.26 I

H3GA0005584 2 4378975 11.90 P

BFTR

H3GA0005584 2 4378975 9.89 I

H3GA0054053 2 1100354 21.17 P

SCFLD

MARC0045154 2 4671904 8.24 D

H3GA0054053 2 1100354 16.60 I

H3GA0054053 2 1100354 36.47 P

SCFLM

MARC0045154 2 4671904 6.83 D

H3GA0054053 2 1100354 20.41 I

H3GA0005584 2 4378975 43.97 P

BFS

ALGA0105438 2 631324 6.41 D

H3GA0054053 2 1100354 16.97 I

H3GA0054053 2 1100354 33.72 P

39

Table 4. Top SNPs for traits with associations over the genome wide significance level in

D2, D3 and D4. Type GWAS: I = imprinting, D = dominance, M = maternal and P = paternal.

Cross Trait SNP SSC Position –log10 𝒑 value Type

GWAS

D2

BFT

H3GA0021185 7 36795710 6.28 D

ALGA0104042 2 2036226 6.72 P

MFR

ALGA0104042 2 2036226 12.28 I

MARC0008125 2 2984595 12.67 P

CRCL MARC0014933 7 33216744 6.75 M

D3

BFT

ASGA0085784 2 236179 8.18 I

H3GA0054053 2 1100354 9.53 P

MFR

ASGA0008415 2 3895569 7.05 I

H3GA0054053 2 1100354 8.59 P

D4

ADG INRA0024524 7 26069284 5.88 M

BFT

INRA0024835 7 31763707 6.39 D

ASGA0032187 7 26662865 8.17 M

CRCL

ALGA0040423 7 32827483 6.49 D

M1GA0009879 7 29268803 6.96 M

Figure 2. Box plots showing the distribution of the MFR residuals at top imprinting (–

log10 𝒑 value = 26.45) and paternal (–log10 𝒑 value = 51.17) variant H3GA0054053 in D1

for: additive, imprinting and parental models (i.e. maternal and paternal).

40

Figure 3. Regional Manhattan plots for GWAS in D1. Left: Associations for all five models

for MFR on SSC2:0-14Mb (blue vertical line indicates the location of IGF2 gene). Right:

Associations for all five models for CRCL on SSC17:0-30Mb. Horizontal dashed line is the

genome-wide significance threshold (Bonferonni corrected) set at –log10 (0.05/NSNPs).

Figure 4. Regional Manhattan plots for GWAS in D4. Associations for all five models for

CRCL on SSC7:0-60Mb. Horizontal dashed line is the genome-wide significance threshold

(Bonferonni corrected) set at –log10 (0.05/NSNPs).

Figure 5. Regional Manhattan plots for conditional GWAS in D1. Associations for

imprinting model and imprinting + top SNP (H3GA0054053) as fixed effect for MFR on

SSC2:0-14Mb (blue vertical line indicates the location of IGF2 gene). Horizontal dashed line

is the genome-wide significance threshold (Bonferonni corrected) set at –log10 (0.05/NSNPs).

41

Discussion

In this study, we utilize SNP array based genomic information coupled with a set of phenotypic

data from four pig F2 resource populations for a better understanding of non-additive effects.

We investigate dominance and imprinting by means of variance components estimation and

usage of various GWAS models (i.e. dominance, imprinting, maternal and paternal) for growth,

carcass and fat related traits.

In the variance partitioning process, the contribution of the environmental term decreased with

the inclusion of the dominance and imprinting effects, with few exceptions (for CRCL and

BFTR in D1, CRCL in D2 and MFR in D4) suggesting that the term captured part of these non-

additive effects prior to them being accounted for in the model. Furthermore, a reduction of the

additive variance was also observed (e.g. for MFR and SCFLM in D1, BFT and CRCL in D4)

proportional to the contribution of the phenotypic variance due to imprinting and dominance.

This reduction demonstrates the confounding nature among additivity and non-additivity (Hill

et al., 2008) that is emphasized here presumably as a consequence of the F2 population structure,

its linkage disequilibrium patterns and allele frequencies.

Looking into detail at variance partitions, a higher percentage of dominance variance were

observed in D2, D3 and D4 (Table 2) as a response to the increased heterozygosity levels

obtained by crossing breeds from different genetic backgrounds. Imprinting variance was more

pronounced in the D1 population (Table 1), accounting for a maximum of 19% of the

phenotypic variation encountered for MFR. The non-additive variances here derived were

compared to previous estimates in pigs available for backfat thickness and daily gain based on

SNP array information (Lopes et al., 2015; Guo et al., 2016). In the two studies mentioned, the

imprinting variance ranged from 1% to 2% and dominance from 4% to 16% for both traits. Our

ranges are broader with higher maximums (i.e. 11% imprinting share for BFT in D3 and 34%

dominance share for ADG in D2). Having said that, it is also noteworthy to mention that the

underlying populations used to obtain these estimates are different (F2 versus breeding or

commercial populations) and thus leading to the observed differences.

The SNP effects were coded for non-additivity, similar to Mantey et al. (2005) and Mozaffari

et al. (2019), and included into GWAS models to be tested for association. Some of the non-

additive effects detected for D2 and D4 are most probably an artefact of the F2 population

structure stemming from few founder individuals (Sandor and Georges, 2008; Lawson et al.,

2013). For example, the significant maternal effects for CRCL in D4 (Figure 4) is definitely

spurious and appears due to the fact that the top SNP (M1GA0009879 –log10 𝑝 value = 6.96) is

not fixed in the founders. Therefore, interpretation of these statistically detected effects has to

42

be made with caution because they can have no relevance for the statistically presumed

biological underlying mechanism. However, even in such population designs, imprinting and

parent-of-origin effects can be detected as it is the case for the D2 and D3 population. More

concrete, for BFT and MFR the significant variants for the imprinting and paternal model

(Table 4) are likely to have biological meaning as the SNPs are located in the vicinity of the

IGF2 gene on SSC2. This gene (insulin-like growth factor 2) is the most well documented case

of an imprinted gene in pigs (Jeon et al., 1999; Nezer et al., 1999) for which the maternal allele

is imprinted, thus having only the paternal allelic expression. With an influence on muscle mass

and fat deposition, significant associations (for imprinting and parent-of-origin) nearby IGF2

for the subsequent traits were detected: in D1 for all traits (except CRCL) and in D2 and D3 for

BFT and MFR (Table 3 and Table 4). A closer investigation was carried on for the IGF2 area

in D1. The trait with the highest imprinting and paternal association was considered,

specifically MFR with its lead variant (H3GA0054053: –log10 𝑝 value (I) = 26.45 and –log10 𝑝

value (P) = 51.17). The significant associations are stretching over a large genomic region

SSC2:0-14Mb (Figure 2). To check and/or to prove causality for the imprinting effects on

SSC2, a conditional imprinting GWAS was conducted, where H3GA0054053 (imprinting

coding) was included as a fixed effect (Figure 5). Only one variant exceeded the significance

threshold suggesting that the imprinting signal can be attributed to the lead SNP.

According to their significance levels, the pattern of the top significant SNPs on SSC2 was in

the order: paternal >> imprinting >> additive, emphasizing on the paternal expression of the

IGF2. While for D1, the signals in the IGF2 were discovered as well via the additive model but

with a lower intensity, in D2 and D3 there were no significant additive variants in the respective

area. Thus, to some extent, the additive parametrization is efficient in capturing other types of

gene actions. This is in line with the theory (Hill et al., 2008; Huang and Mackay, 2016) that

has at its core additivity according to the fundamentals of quantitative genetics (Fisher, 1918).

On SSC17, for CRCL in D1, although additive effects were the most significant, underlying

non-additive effect can be noticed (Figure 3). For example, the top additive SNP

(MARC0070553 –log10 𝑝 value = 30.12) was also at a high significance level in the maternal

model (–log10 𝑝 value = 21.33), but at lower level. This SNP is nearby the gene BMP2 (bone

morphogenetic protein 2) which was proposed already as a gene candidate for carcass length

(Blaj et al., 2018), thus would be of further interest to investigate whether the maternal origin

of this variant has a true physiological meaning.

43

Conclusions

The results of this study showed that both dominance and imprinting effects contributed to the

genetic variation encountered in four investigated F2 populations for a set of growth, carcass

and fat traits. Caution in interpreting non-additive effects must be taken in cross designs, as the

findings can be purely statistical with no physiological fundament. Highly significant variants

were found nearby the IGF2 gene when using the imprinting and the paternal association model

and mostly were uncovered by the additive model but at lower significance levels. This draws

attention on the confounding or overlapping nature of additive and non-additive. These findings

aim to contribute to a better understanding, in the context of F2 resource populations, of the

non-additive effects.

44

References

Aliloo, H., Pryce, J. E., González-Recio, O., Cocks, B. G., Goddard, M. E., & Hayes, B. J.

(2017). Including nonadditive genetic effects in mating programs to maximize dairy

farm profitability. Journal of dairy science 100 (2), pp. 1203–1222. DOI:

10.3168/jds.2016-11261.

Blaj, I., Tetens, J., Preuß, S., Bennewitz, J., & Thaller, G. (2018). Genome-wide association

studies and meta-analysis uncovers new candidate genes for growth and carcass traits

in pigs. PLoS ONE, 13(10), e0205576.

Blunk, I., Mayer, M., Hamann, H., & Reinsch, N. (2019). Scanning the genomes of parents

for imprinted loci acting in their un-genotyped progeny. Scientific reports 9 (1), p. 654.

DOI: 10.1038/s41598-018-36939-3.

Borchers, N., Reinsch, N., & Kalm, E. (2000). Familial cases of coat colour‐change in a Piétrain

cross. Journal of Animal Breeding and Genetics, 117(4), 285-287.

Boysen, T. J., Tetens, J., & Thaller, G. (2010). Detection of a quantitative trait locus for ham

weight with polar overdominance near the ortholog of the callipyge locus in an

experimental pig F2 population. Journal of animal science 88 (10), pp. 3167–3172.

DOI: 10.2527/jas.2009-2565.

Boysen, T. J., Tetens, J., & Thaller, G. (2011). Evidence for additional functional genetic

variation within the porcine IGF2 gene affecting body composition traits in an

experimental Piétrain × Large White/Landrace cross. Animal: an international journal

of animal bioscience 5 (5), pp. 672–677. DOI: 10.1017/S1751731110002466.

Cockett, N. E., Jackson, S. P., Shay, T. L., Farnir, F., Berghmans, S., Snowder, G. D. et al.

(1996). Polar Overdominance at the Ovine callipyge Locus. Science 273 (5272), pp.

236–238. DOI: 10.1126/science.273.5272.236.

Covarrubias-Pazaran, G. (2016). Genome-Assisted Prediction of Quantitative Traits using the

R Package sommer. In PLoS ONE 11 (6), e0156744. DOI:

10.1371/journal.pone.0156744.

Falconer D. S. & Mackay T. F. C. (1996). Introduction to Quantitative Genetics. 4th. Longmans

Green, Harlow, Essex, UK.

Fisher R.A. (1918). The Correlation between Relatives on the Supposition of Mendelian

Inheritance. Trans R Soc Edin 53, pp. 399–433.

Fujii, J., Otsu, K., Zorzato, F., Leon, S. de, Khanna, V., Weiler, J. et al. (1991). Identification

of a mutation in porcine ryanodine receptor associated with malignant hyperthermia.

Science 253 (5018), pp. 448–451. DOI: 10.1126/science.1862346.

45

Geldermann, H., Müller, E., Beeckmann, P., Knorr, C., Yue, G., & Moser, G. (1996). Mapping

of quantitative‐trait loci by means of marker genes in F2 generations of Wild boar,

Pietrain and Meishan pigs. Journal of Animal Breeding and Genetics, 113(1‐6), 381-

387.

Guo, X., Christensen, O. F., Ostersen, T., Wang, Y., Lund, M. S., & Su, G. (2016). Genomic

prediction using models with dominance and imprinting effects for backfat thickness

and average daily gain in Danish Duroc pigs. In Genetics, selection, evolution: GSE 48

(1), p. 67. DOI: 10.1186/s12711-016-0245-6.

Hill, W. G., Goddard, M. E. & Visscher, P. M. (2008). Data and theory point to mainly additive

genetic variance for complex traits. PLoS genetics 4 (2), e1000008. DOI:

10.1371/journal.pgen.1000008.

Huang, W. & Mackay T. F. C. (2016): The Genetic Architecture of Quantitative Traits

cannot be inferred from Variance Component analysis. PLoS genetics 12 (11),

e1006421. DOI: 10.1371/journal.pgen.1006421.

Jeon, J. T., Carlborg Ö., Törnsten A., Giuffra E., Amarger V., Chardon P., et al. (1999). A

paternally expressed QTL affecting skeletal and cardiac muscle mass in pigs maps to

the IGF2 locus. Nature genetics 21 (2), p. 157.

Jiang, J., Shen, B., O'Connell, J. R., VanRaden, P. M., Cole, J. B., & Ma, L. (2017). Dissection

of additive, dominance, and imprinting effects for production and reproduction traits in

Holstein cattle. BMC genomics 18 (1), p. 425. DOI: 10.1186/s12864-017-3821-4.

Laurin, C., Cuellar-Partida, G., Hemani, G., Smith, G. D., Yang, J., & Evans, D. M. (2018).

Partitioning Phenotypic Variance Due to Parent-of-Origin Effects Using Genomic

Relatedness Matrices. Behavior genetics 48 (1), pp. 67–79. DOI: 10.1007/s10519-017-

9880-0.

Lawson, H. A., Cheverud, J. M. & Wolf, J. B. (2013). Genomic imprinting and parent-of-origin

effects on complex traits. Nature reviews. Genetics 14 (9), pp. 609–617. DOI:

10.1038/nrg3543.

Lopes, M. S., Bastiaansen, J. W. M., Janss, L., Knol, E. F., & Bovenhuis, H. (2015). Estimation

of Additive, Dominance, and Imprinting Genetic Variance Using Genomic Data. G3

(Bethesda, Md.) 5 (12), pp. 2629–2637. DOI:10.1534/g3.115.019513.

Mantey, C., Brockmann, G. A., Kalm, E., & Reinsch, N. (2005). Mapping and exclusion

mapping of genomic imprinting effects in mouse F2 families. The Journal of heredity

96 (4), pp. 329–338. DOI: 10.1093/jhered/esi044.

46

Meuwissen T. H. E., Hayes B. J. & Goddard M. E. (2001). Prediction of Total Genetic Value

Using Genome-Wide Dense Marker Maps Genetics (157(4)), pp. 1819–1829.

Mozaffari, S. V., DeCara, J. M., Shah, S. J., Sidore, C., Fiorillo, E., Cucca, F. et al. (2019).

Parent-of-origin effects on quantitative phenotypes in a large Hutterite pedigree.

Communications biology 2, p. 28. DOI: 10.1038/s42003-018-0267-4.

Nezer C., Moreau L., Brouwers B., Coppieters W., Detilleux J. et al. (1999). An imprinted QTL

with major effect on muscle mass and fat deposition maps to IGF2 locus in pigs.

Nature genetics 21 (2), p. 155.

Nishio, M. & Satoh, M. (2015). Genomic best linear unbiased prediction method

including imprinting effects for genomic evaluation. Genetics, selection, evolution:

GSE 47, p. 32. DOI: 10.1186/s12711-015-0091-y.

O'Connell, J., Gurdasani, D., Delaneau, O., Pirastu, N., Ulivi, S., Cocca, M. et al. (2014). A

general approach for haplotype phasing across the full spectrum of relatedness. PLoS

genetics 10 (4), e1004234. DOI: 10.1371/journal.pgen.1004234.

O'Doherty, A. M., MacHugh, D. E., Spillane, C., & Magee, D. A. (2015). Genomic

imprinting effects on complex traits in domesticated animal species. Frontiers in

genetics 6, p. 156. DOI: 10.3389/fgene.2015.00156.

Prickett, A. R., & Oakey, R. J. (2012). A survey of tissue-specific genomic imprinting in

mammals. Molecular genetics and genomics: MGG 287 (8), pp. 621–630. DOI:

10.1007/s00438-012-0708-6.

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D. et al. (2007).

PLINK: a tool set for whole-genome association and population-based linkage analyses.

American journal of human genetics 81 (3), pp. 559–575. DOI: 10.1086/519795.

R Core Team (2018). R: A language and environment for statistical computing. R Foundation

for Statistical Computing, Vienna, Austria.

Ramos, A. M., Crooijmans, R. P. M. A., Affara, N. A., Amaral, A. J., Archibald, A. L., Beever,

J. E. et al. (2009). Design of a high density SNP genotyping assay in the pig using SNPs

identified and characterized by next generation sequencing technology. PloS one 4 (8),

e6524. DOI: 10.1371/journal.pone.0006524.

Rückert, C. & Bennewitz, J. (2010). Joint QTL analysis of three connected F2-crosses in pigs.

Genetics, selection, evolution: GSE 42, p. 40. DOI: 10.1186/1297-9686-42-40.

Sandor, C. & Georges, M. (2008). On the detection of imprinted quantitative trait loci in line

crosses: effect of linkage disequilibrium. Genetics 180 (2), pp. 1167–1175. DOI:

10.1534/genetics.108.092551.

47

Varona, L., Legarra, A., Toro, M. A. & Vitezica, Z. G. (2018). Non-additive Effects in Genomic

Selection. Frontiers in genetics 9, p. 78. DOI:10.3389/fgene.2018.00078.

Wellmann, R. & Bennewitz, J. (2012). Bayesian models with dominance effects for

genomic evaluation of quantitative traits. Genetics research 94 (1), pp. 21–37. DOI:

10.1017/S0016672312000018.

Wickham, H. (2009). Ggplot2. Elegant graphics for data analysis.

48

49

Chapter 3

GWAS for meat and carcass traits using imputed sequence

level genotypes in pooled F2-designs in pigs

Clemens Falker-Gieske 1,*, Iulia Blaj 2,*, Siegfried Preuß3, Jörn Bennewitz3, Georg Thaller2

and Jens Tetens1,4

1Department of Animal Sciences, Georg-August-University, Göttingen, Germany

2Institute of Animal Breeding and Husbandry, Kiel University, Kiel, Germany

3Institute of Animal Husbandry and Breeding, University of Hohenheim, Stuttgart, Germany

4Center for Integrated Breeding Research, Georg-August-University, Göttingen, Germany

*contributed equally

Published in G3 Genes|Genomes|Genetics

50

Abstract

In order to gain insight into the genetic architecture of economically important traits in pigs and

to derive suitable genetic markers to improve these traits in breeding programs, many studies

have been conducted to map quantitative trait loci. Shortcomings of these studies were low

mapping resolution, large confidence intervals for quantitative trait loci-positions and large

linkage disequilibrium blocks. Here, we overcome these shortcomings by pooling four large F2

designs to produce smaller linkage disequilibrium blocks and by resequencing the founder

generation at high coverage and the F1 generation at low coverage for subsequent imputation

of the F2 generation to whole genome sequencing marker density. This lead to the discovery of

more than 32 million variants, 8 million of which have not been previously reported. The

pooling of the four F2 designs enabled us to perform a joint genome-wide association study,

which lead to the identification of numerous significantly associated variant clusters on

chromosomes 1, 2, 4, 7, 17 and 18 for the growth and carcass traits average daily gain, back fat

thickness, meat fat ratio, and carcass length. We could not only confirm previously reported,

but also discovered new quantitative trait loci. As a result, several new candidate genes are

discussed, among them BMP2 (bone morphogenetic protein 2), which we recently discovered

in a related study. Variant effect prediction revealed that 15 high impact variants for the traits

back fat thickness, meat fat ratio and carcass length were among the statistically significantly

associated variants.

Introduction

Mapping experiments in livestock generally serve two purposes: The first is to understand the

genetic architecture of quantitative traits, and to derive and prove new hypotheses of trait

expression. The second is the identification of genetic markers that may be useful for livestock

breeding. There have been many quantitative trait loci (QTL) mapping experiments carried out

over the last decades (see review article by (Rothschild et al., 2007)), mainly in experimental

F2 crosses established from two outbred founder pig breeds. In early studies, genotyping was

mainly achieved using microsatellite markers and mapping was achieved through linkage

analysis (see overview in (Knott, 2005)). These designs were set up to enable QTL detection

with high power, but they suffered from a low mapping resolution and large confidence

intervals for QTL-positions. This was partly due to the limited number of meiosis cycles

exploited in these designs in conjunction with typically small numbers of 300 and 500 F2

individuals. Furthermore, this approach assumes the divergent fixation of the QTL alleles in the

founder breeds, and highly different gene frequencies and variation within these breeds were

51

not considered (Nagamine et al., 2003). The breed Piétrain, for instance, has been selected for

growth and meat yield for many generations and still exhibits a large genetic variation for these

traits (Wellmann et al., 2013). More recent QTL-mapping experiments utilized genome-wide

association studies (GWAS), which in contrast to linkage analyses, exploit historical meiosis

and rely on linkage disequilibrium (LD) requiring high marker densities. The precision of

GWAS is then a function of LD block lengths and the number of individuals analyzed, which

in turn limits the usefulness of its application in F2 designs (Hayes and Goddard, 2001).

However, enormous efforts have been made in the establishment of these mapping populations,

usually including extensive phenotyping far beyond what would be available in field

populations. It would thus be desirable to revisit these resources using current genotyping and

sequencing technologies, which would require an increase in the number of individuals and a

decrease in the LD block lengths. In a recent simulation study, it was shown to be possible by

pooling F2 designs, particularly when founder breeds are closely related and QTLs are

segregating in one founder breed (Schmid et al., 2018). This approach has already been

successfully applied based on medium density SNP chip data (Blaj et al., 2018; Stratz et al.,

2018).

With the aim to overcome the aforementioned limits in mapping resolution and to fully exploit

the potential of the resource populations, we pooled four well-characterized F2 designs (Table

1), three of them having the founder breed Piètrain in common. Twenty four founder animals

were genotyped by high coverage whole genome sequencing (WGS) and 91 of the F1 animals

were sequenced at a low coverage for subsequent imputation to a high coverage WGS level. A

total of 2,657 F2 animals that were genotyped with the 62K Illumina PorcineSNP60 BeadChip

(Ramos et al., 2009) were imputed to WGS levels with pedigree information and analyzed in a

joint GWAS (see workflow in Figure 1). As a proof of concept, four relevant production traits

were analyzed: Average daily gain (ADG), back fat thickness (BFT), meat to fat ratio (MFR),

and carcass length (CRCL).

Material and methods

Description of resource populations and phenotypes

Four well characterized experimental populations were pooled for this study. Detailed

descriptions of the resource populations were done by Borchers et al. (Borchers et al., 2000)

and Rückert et al. (Rückert and Bennewitz, 2010), hence they will only be described briefly.

The largest population was obtained from five purebred Piétrain boars and one Large white and

six crossbred sows Landrace x Large white. The other three populations stemmed from a

52

Figure 1. Genotyping workflow. 24 Founder animals were sequenced with high coverage,

variants were called with GATK 4.0 and phased with Beagle 5.0. 91 F1 animals were sequenced

with low coverage and variants were called with GATK 3.8 and BCFtools mpileup. The F1

dataset was imputed using Beagle 4.0 and pedigree information with phased Founders as a

reference-panel for haplotype structure. The imputed F1 was then merged with the F0 variant

call data set and phased with Beagle 5.0. Finally, the 2657 chip genotyped F2 individuals were

imputed to WGS levels with Beagle 4.0 and pedigree information with the merged and phased

Founder/F1-imputed dataset as the reference-panel.

Meishan boar or Wild boar crossed with either Piétrain or Meishan females. The Wild boar and

three Piétrain females were common founders in three of the crosses. The F2 generation was

the result of repeatedly crossing F1 boars with F1 sows in order to obtain large full-sib families.

From the crosses, a total number of 2772 animals (24 F0-generation pigs, 91 F1-generation pigs,

2657 F2-generation pigs) and blood samples were used to extract genomic DNA for genotyping

purposes (Table 1). The F0 and F1 animals selected for sequencing were generally chosen

according to the number of F2 individuals, i.e. we prioritized individuals from which large

families were derived. Four phenotypic traits were considered: ADG, BFT, MFR, and CRCL.

53

The phenotypes were pre-corrected for systematic effects (e.g. stable, slaughter month) and for

the effect of RYR1 gene (Fujii et al., 1991) using a general linear model. Trait definition,

descriptive statistics and information about the pre-adjustment and the fixed effects used per

cross can be found in (Blaj et al., 2018).

Table 1. Per cross information of the sequenced individuals (F0 and F1) and SNP array

genotyped individuals (F2). F0 and F1 animals served as the reference panel for the imputation

of the F2 generation to sequence level for subsequent genome wide association analyses.

Cross/Generation F0* F1 F2

Piétrain x (Large white x Landrace)/Large white 13 55 1750

Meishan x Piétrain 8 19 304

Wild boar x Piétrain 6 17 291

Wild boar x Meishan 1 0 312

Total 24* 91 2657

*Four founders are common among crosses

Sequencing

A total number of twenty four founder animals were sequenced with an average 19x coverage

at the sequencing facility University Hohenheim. Out of 17 F1 families, 91 animals were

sequenced with an average 0.9x coverage. All paired-end sequencing (read length 2 x 100 bp)

was done on an Illumina HiScan SQ using TruSeq SBS v3 Kits. For the library construction,

the DNA samples were fragmented on a Covaris S220 ultrasonicator. Parameters were adjusted

to yield 350 bp inserts. Fragment length was measured with High Sensitivity DNA Chips on an

Agilent Bioanalyzer. Sequencing adapters and indexes were ligated using Illumina’s TruSeq

DNA PCR-Free Library Prep Kits. Quantification of libraries was done by qPCR using KAPA

Library Quant Kits. Flow cells were prepared using an Illumina cBot and TruSeq PE v3 Cluster

kits. Raw sequencing data were demultiplexed and converted into FASTQ files using Illumina’s

CASAVA software.

Mapping and variant detection

Mapping and variant calling of the F0 generation was performed according to the GATK best

practice pipeline using GATK v. 4.0 (McKenna et al., 2010) and genome assembly Sus scrofa

11.1 (GCA_000003025.6 provided by Swine Genome Sequencing Consortium on NCBI). Base

quality score recalibration was performed with dbSNP build 150 as the knownSites dataset.

Truth datasets used for Variant Quality Score Recalibration (VQSR) were as follows. SNPs:

54

Illumina Infinium PorcineSNP60 v2 BeadChip and Affymetrix Axiom PorcineHD. INDELs:

High confidence fraction (filter settings: QD 15.0, FS 200.0, ReadPosRankSum 20.0) of the

PigVar database. Training dataset for SNP VQSR was also a high confidence fraction of the

PigVar database (filter settings: QD 21.5, FS 60.0, MQ 40.0, MQRankSum 12.5,

ReadPosRankSum 8.0, SOR 3.0) (Zhou et al., 2017). A truth sensitivity of 99.0 was chosen for

SNPs and INDELs. The known dataset for SNP and INDEL VQSR was dbSNP build 150. Since

SNPs were filtered with two truth datasets a Ti/Tv free recalibration according to the GATK

best practice guidelines was applied to the data. Low coverage sequencing reads of F1 animals

were processed according to the GATK best practice guidelines with the following deviations.

SNP Calling was performed using GATK HaplotypeCaller v. 3.8 in joint mode with the settings

minPruning 1 and minDanglingBranchLength 1 as well as BCFtools mpileup v 1.9 (Li et al.,

2009), respectively. INDELs in the F1 variant call dataset were neglected due to low sequencing

depth. An intersection variant call set between HaplotypeCaller, mpileup and the founder SNPs

was created and stringently filtered with the following settings: QD 30.0, FS 60.0, MQ 40.0,

QUAL 300.0.

Haplotype construction and imputation

To make use of the most recent phasing algorithms Beagle 5.0 was used for all phasing

operations (Browning et al., 2018). Beagle 4.0 was applied for genotype imputation since it is

the latest version that supports the usage of pedigree information (Browning and Browning,

2007). Haplotype phasing of the F0 generation variant call set was done using Beagle 5.0 and

subsequent imputation with pedigree information of the F1 low coverage SNPs was achieved

with Beagle 4.0. F0 and imputed F1 variants were merged with GATK CombineVariants and

phased with Beagle 5.0. F2 generation 60k SNP chip data was imputed with Beagle 4.0 and

pedigree information with merged and phased F0 and F1 WGS level variants as the reference

the panel. Imputation accuracy was determined by the construction of 24 F0 reference-panels

with one animal left out. Genotype data acquired with the 60k SNP chip from each F0 individual

was imputed with a reference-panel where the respective individual was missing utilizing

Beagle 4.0. The 24 individual datasets were merged and together with the F0 reference dataset

converted to additive coding with Plink 1.9 (Chang et al., 2015). Correlation (coefficient of

determination, R²) for each variant on QTL harboring chromosomes was calculated with an in

house R script.

55

Genome wide association studies and cluster assignment

Single-trait association analyses were performed with GCTA v. 1.92.4 beta 3 on the F2

population only (Yang et al., 2011). In order to perform a “leave one chromosome out” (LOCO)

analysis, multiple genomic relationship matrices (GRMs) were created from the F2 60k SNP

chip data by excluding each chromosome once with a minor allele frequency (MAF) cutoff of

1 %. Mixed linear model association analyses (MLMAs) were performed with imputed F2

variants for each chromosome separately using the GRM where the respective chromosome

was left out and a MAF cutoff of 1 %. To account for the pooled population structure, covariates

representing the different crosses (4 classes) were included in the MLMA. For further

downstream analysis, significance threshold was established by applying Bonferroni correction

(i.e. 0.05/number of independent tests). Manhattan plots were created with R. Clusters

incorporating potential genomic regions of interest were defined using the Manhattan Harvester

(MH) tool (Haller et al., 2019). MH provides quality assignment for each peak via a general

quality score (GQS) which can be used as the main parameter for peak assessment. The GQS

is generated based on a trained mixed-effects proportional odds model using 16 various

parameters (e.g. maximal slope, height to width ratio) and human peak identification data. For

this study, the variants with a p-value below 1.0x10−7 (option -inlimit) were included and further

the clusters with a GQS > 3.5 (1 is min and 5 is max) were taken into account. Conditional

association analyses were performed by including single highly associated variants as fixed

effects in a LOCO analysis.

Variant effect prediction and gene enrichment analysis

To predict variant effects the Ensembl Variant Effect Predictor (VEP) release 94 was utilized,

which is part of the Ensembl advanced programming interface (API) (McLaren et al., 2016).

The vep command using the clusters’ statistically significant variants was executed with the

following settings: --merged --force_overwrite --variant_class --symbol --nearest gene. To

provide further functional interpretation, the Database for Annotation, Visualization and

Integrated Discovery (DAVID) (Huang et al., 2009) was used for a systematic and integrative

analysis. The gene list from the VEP output was the input for DAVID and Sus scrofa genes

were considered as the background. Gene Ontology (GO) terms (i.e. cellular component,

molecular function, and biological process) from the functional annotation chart report which

were significantly overrepresented with an EASE Score (i.e. a modified Fisher Exact P-Value)

below 0.05 and with a gene count higher or equal to 5 were retained.

56

Statement on data and reagent availability

Sequencing data, which were used to conduct this study will be made publicly available upon

publication of the article in the NCBI Sequence Read Archive (SRA). Supplementary Tables

have been uploaded to GSA. Supplementary Table 1 contains the coefficients of determination

(R²) for each variant on QTL harboring chromosomes where calculation was possible.

Supplementary Table 2 contains the complete list of clusters identified in the GWAS with

additional supporting information for cluster assignment. GRMs from F2 60k genotypes

(File_S1.zip) were created by a "leave one chromsome out" approach using the program

"Genome-wide Complex Trait Analysis (GCTA) version 1.91.4 beta3". GWAS was conducted

with imputed sequence level F2 genotypes (Supplementary File_S2.zip) for each chromosome

using the GRM where the respective chromosome was left out (GRM command: gcta64 --bfile

SG_F2_chip_wo_chrNO --autosome --maf 0.01 --make-grm --out SG_F2_chip_wo_chrNO --

thread-num 10 --autosome-num 18 ; GWAS command: gcta64 --mlma --covar

Kiel_Hoh_cross.covar --bfile F2_beagle4.0_ped_ChrNO --grm SG_F2_chip_wo_chrNO --

pheno TRAIT.pheno --out TRAIT_chrNO --maf 0.01 --thread-num 10; replace NO with the

respective chromosome number and TRAIT with the respective trait to be analyzed). Phenotype

files located in Supplementary File_S2.zip for the traits ADG, BFT, MFR, and CRCL were

used in the GWAS. Crosses were used as covariates in the GWAS and provided as a gcta

compatible covar file in Supplemental File_S4.zip. SNP locations can be found in the bim files

of the genotype data (Supplementary File_S1.zip and Supplementary File_S2.zip). Population

structure information is provided in form of a Beagle 4.0 compatible pedigree file

(Supplementary File_S3.zip). Raw sequencing data is accessible via the NCBI Sequence Read

Archive (SRA) under BioProject ID PRJNA553106. File_S5 contains the 60k chip genotype

data in variant call file (VCF) format. Genomic positions have been lifted to genome assembly

Sus scrofa 11.1 (GCA_000003025.6) and annotated with dbSNP build 150. gcta compatible

covar files for the conditional association analyses with top variants are provided in File_S6.

Results

Whole genome sequencing and variant calling

An average of 592,788,350 (SD = 38,623,216, MIN = 525,921,083, MAX = 649,442,924)

sequencing reads per sample were aligned to the reference genome in the F0 generation with an

average mapping efficiency of 99.37 %. In the F1 generation an average of 26,562,876 (SD =

6,619,568, MIN = 15,915,385, MAX = 59,300,856) reads were mapped to the reference

assembly with an average mapping efficiency of 99.33 %.

57

With respect to the number of SNPs detected in the founder population, 22,671,759 were

previously reported and 3,950,955 were novel. Furthermore, 1,482,139 of the INDELs were

previously reported and 4,335,345 were novel. Per chromosome, average distances among the

variants are summarized in Table 2. The Ti/Tv of SNPs in the founder population was 2.39,

whereas known SNPs had a Ti/Tv of 2.44 and novel SNPs had a Ti/Tv of 2.08. Due to the low

sequencing coverage (average = 0.96 x, min = 0.58 x, max = 2.14 x) only autosomal SNPs were

called in the F1 population. The raw output of Haplotypecaller consisted of 20,055,697 known

and 3,529,441 novel SNPs and the raw output of mpileup contained 19,932,201 known and

3,291,758 novel SNPs. The intersection of the two datasets resulted in 19,264,662 known and

2,951,058 novel SNPs whereas removing all SNPs that were not present in the founder variant

calling dataset lead to a final number of 19,224,132 known and 2,911,780 novel raw SNPs.

After the application of a stringent filtering approach (see Material and Methods) 5,753,444

known and 741,155 novel SNPs remained in the variant calling dataset of the F1 population.

Table 2. Average distance between variants discovered in the founder population. A

number of 24 F0 animals were sequenced at high coverage and the average distances between

variants (SNPs and INDELs) were calculated per chromosome.

Chromosome Avg. distance (bp) SD

1 105,78729 196,8571

2 84,83889 201,4813

3 79,13779 183,627

4 80,16339 174,2767

5 73,37176 177,8913

6 85,34639 214,5176

7 79,12655 175,2067

8 78,90639 164,8448

9 79,61324 166,9157

10 56,5826 141,8648

11 67,6446 139,1412

12 65,41473 190,7484

13 102,70483 209,4908

14 83,58366 158,1296

15 93,02928 187,2321

16 71,18928 155,2467

17 65,91122 163,1281

18 76,95204 149,7542

Mean 82,21004 180,6988

58

Identification of local drops in imputation accuracy

To detect local inaccuracies in the imputed data, we imputed chip data from each founder with

the remaining 23 founders as a reference panel. The data does not provide information about

the imputation accuracy of the experiment since pedigree information could not be used. The

coefficient of determination for each variant located on a chromo-some harboring relevant QTL

was determined where feasible. The average coefficients of determination for each chromosome

analyzed are summarized in Table 3 (complete analysis results in Supplementary Table 1).

Table 3. Identification of local imputation inaccuracies. Chip data from each of the 24

founders was imputed using the remaining 23 founder animals as the reference panel.

Coefficients of determination (R2) were calculated for each variant in order to calculate average

R2 for SSC1, SSC2, SSC4, SSC7, SSC17, and SSC18.

GWAS results and clusters

From the genome-wide association study conducted in the pooled F2 population, the following

number of variants exceeded the genome-wide significance threshold: 448, 17,105, 6635, and

27,641 for ADG, BFT, MFR, and CRCL, respectively. Manhattan plots of the GWAS for the

four phenotypic traits are shown in Figure 2. A total of 120 clusters were designated by the MH

tool (i.e., 4 for ADG, 33 for BFT, 22 for MFR and61 for CRCL) and they were located on the

following Sus Scrofa chromosomes (SSC): 1, 2, 4, 5, 7, 17, and 18. The complete cluster list

with additional supporting information for cluster assignment can be found in Supplementary

Table 2. From each of the defined clusters, the top5 variants were retained. The genes

incorporating or lying nearby these highly significant associations are presented in Table 4. The

clusters associated with the traits overlapped on several chromosomes, specifically on SSC2,

SSC4, and SSC7. The location and the extent of the overlapping clusters is depicted in Figure

3. Particular chromosomes had exclusive clusters assigned, e.g., SSC17 for CRCL and SSC18

for MFR. To evaluate all possible relations among the variants exceeding the significance

threshold for each trait, a Venn diagram was used (Figure 4). The highest number of common

Chromosome Average R² SD

1 0.28 0.32

2 0.22 0.29

4 0.25 0.30

7 0.25 0.31

17 0.18 0.25

18 0.29 0.32

59

variants (i.e., 6,859) was between BFT and CRCL and the second highest was between BFT

and MFR (i.e., 2,380). To get an estimate of systemic bias, quantile-quantile plots were

generated for all p-values from each GWAS (Supplementary Figure 2). As a measure of

association between observed and expected p-values, lambda values were calculated for all four

traits: λADG = 1.282319, λBFT = 1.333425, λCRCL = 1.422044 and λMFR = 1.35587.

VEP and high impact variants

To predict functional consequences on genes the ensembl VEP tool was employed. Multiple

transcripts per gene resulted in larger numbers of annotations that are reflected in the higher

number of predicted effects as compared to the actual number of identified variants per trait.

All inferred consequences for Bonferroni corrected variants per trait and their percentage

breakdown are summarized in Table 5. The large majority (over 70%) of the consequences were

classified as intron variants. According to the severity of the variant consequence, intron

variants are assigned to having a modifier impact, which means that predictions are difficult to

be made or there is no solid evidence of impact. Variants inferred to have a disruptive impact

on the protein, leading to protein truncation, loss of function or causing nonsense-mediated

decay were of further interest. These significant high impact variants (Table 6) were mostly

located on SSC7, with the exception of SSC2:rs1110687780 (splice donor variant) affecting

TCN1 for the trait MFR. For the BFT, the most severe consequences were located in the genes

C6orf89, PI16, DST, and PRIM2, while for the CRCL disruptive impact variants were found in

NEU1, four novel genes, ABCD4, DST, PRIM2, and LPCAT4. Notably, the same two splice

donor variants affect the common genes for BFT and CRCL: DST and PRIM2. Sorting

Intolerant From Tolerant (SIFT) scores were determined for all significant missense variants

and are summarized in Supplementary Table 3 (Ng and Henikoff, 2003).

60

Figure 2. Manhattan plots of the –log10 p-values for association of variants with the traits

(A) average daily gain (ADG), (B) back fat thickness (BFT), (C) meat to fat ratio (MFR),

and (D) carcass length (CRCL). P-values < 0.001 were excluded from the plots.

Figure 3. Cluster overlap for (A) SSC2, (B) SSC4 and (C) SSC7 for all traits (average daily

gain (ADG) – red, back fat thickness (BFT) – green, meat to fat ratio (MFR) – purple, and

carcass length (CRCL) – blue). The heights of the clusters are according to the top variant

(–log10 p-value) within each given cluster.

61

Figure 4. Variants concordance and discordance between the traits average daily gain

(ADG), back fat thickness (BFT), meat to fat ratio (MFR), and carcass length (CRCL).

The Venn diagram contains statistically significant variants. Intersections between traits include

the number of common variants. Numbers of variants that were exclusively found in the single

traits are outside of intersections.

Figure 5. Imputation accuracy on SSC2 between positions 1,250,000 and 2,000,000. IGF2

is located between bp 1,469,183 and 1,496,417.

62

Table 4. Top associated genes for average daily gain (ADG), back fat thickness (BFT),

meat to fat ratio (MFR), and carcass length (CRCL) identified in the GWAS. Genes

incorporating or nearby the top 5 variants in the clusters are listed with chromosome and cluster

numbers.

Trait SSC Cluster

no./SSC Genes

ADG

2 1 SHANK2

4 1 RPS20, LYN, PLAG1

7 2 HMGCLL1, TFEB

BFT

1 1 ZNF462, ENSSSCG00000005432

2 7 LOC102158414, PGA5, MRPL16, ENSSSCG00000013151, ZFP91, CTNND1,

ENSSSCG00000024984, OR9Q2, OR10Q1

4 1 RPS20

7 24

SCGN, LRFN2, DAAM2, C7H6orf223, C7H6orf132, RIPOR2, CARMIL1,

BMP5, ENSSSCG00000001500, KIFC1, C6orf106, PPARD, FKBP5, CPNE5,

ENSSSCG00000001574, TMEM217, LRFN2, MRPS10, TRERF1, RUNX2,

RCAN2, MEP1A, ADGRF5, PTCHD4, ENSSSCG00000001734, PGK2, IREB2,

ABHD17C, GSTA2, CRABP1, CRISP3, PRSS16, TBC1D2B,

ENSSSCG00000038708, BCL2A1, E2F3

MFR

1 3 LOC106507123, TMEM245, SCAI, ABL1, RAPGEF1, CFAP77, DDX31,

MAPKAP1

2 8

LOC102158414, LOC110259166, LOC110259708, TMEM80, DEAF1, EHD1,

MACROD1, ATL3, NAV2, DHCR7, ENSSSCG00000028537, CTTN, SHANK2,

ENSSSCG00000036180 (KRTAP5-5-like), NELL1

4 3 PDE7A, SNTG1, RPS20

5 1 ENSSSCG00000034097

7 1 ENSSSCG00000001500

18 6 PPP1R3A, IMMP2L, LRRC4, EXOC4, SND1, ELMO1, MDFIC, TFEC

CRCL

1 1 FNBP1

7 52

VEGFA, FLRT2, LRFN2, MCTP2, DAAM2, PGF, SV2B, MAX, COL21A1,

KLHL25, NPAS3, LOC110261756, NHLRC1, TPMT, CDKAL1, GMNN,

RIPOR2, MDC1, DDX39B, HMGCLL1, ENSSSCG00000001500, C6orf106,

KCTD20, SRSF3, ENSSSCG00000001612, FOXP4, TFEB, RCAN2, ADGRF1,

MUT, CRISP1, TFAP2D, PKHD1, BNC1, ENSSSCG00000001827, TMEM266,

NKX2-1, PRKD1, LPCAT4, NR2F2, MCTP2, SLCO3A1,

ENSSSCG00000002270, FUT8, ENSSSCG00000002317, DPF3, PTGR2,

ZNF410, FAM161B, EIF2B2, MLH3, VIPAS39, SPTLC2,

ENSSSCG00000010328, RF01299, RF00100, HMGN4, NRXN3, ID4,

SYNJ2BP, ZFP36L1, RAD51B, AVEN, ANG, GCM1, FOXG1,

ENSSSCG00000033840, ENSSSCG00000035274, RSL24D1,

NSSSCG00000036697,ENSSSCG00000037115, ENSSSCG00000038445,

CEMIP, SLC25A21, SPTSSA, ENSSSCG00000039877, DIO2,

ENSSSCG00000040930

17 8 BMP2, JAG1, SPTLC3, TMX4

63

Table 5. Results of variant effect prediction for the production traits average daily gain

(ADG), back fat thickness (BFT), meat to fat ratio (MFR), and carcass length (CRCL).

Bonferroni-corrected variants were analyzed.

Predicted effect ADG ADG % BFT BFT % MFR MFR % CRCL CRCL %

Missense variant 2 0.1580 962* 0.6523* 58 0.1893 787* 0.4750*

Frameshift variant 0 0 0 0 0* 0* 6* 0.0036*

Start lost 0 0 1 0.0007 0 0 0 0

Stop gained 0 0 0 0 0 0 1 0.0006

Inframe deletion 0 0 1* 0.0007* 0 0 2 0.0012

Intron variant 936 73.9336 116815 79.2090 21556* 70.3525* 131590 79.4275

5 prime UTR variant 0 0 229* 0.1553* 89 0.2905 277* 0.1672*

3 prime UTR variant 8 0.6319 1160* 0.7866* 1242 4.0535 1543* 0.9314*

Upstream gene variant 50 3.9494 5680* 3.8514* 2195* 7.1638* 5300* 3.1991*

Downstream gene

variant 44* 3.4755* 5791* 3.9267* 3442 11.2337 6893* 4.1606*

Frameshift variant,

splice region variant 0 0 2 0.0014 0 0 0 0

Missense variant,

splice region variant 0 0 41* 0.0278* 0 0 75* 0.0453*

Splice region variant,

non coding transcript

exon variant

0 0 2 0.0014 3* 0.0098* 5 0.0030

Splice region variant, 3

prime UTR variant 0 0 3* 0.0020* 3* 0.0098* 0 0

Splice region variant,

intron variant, non

coding transcript

variant

0 0 2* 0.0014* 4 0.0131 20* 0.0121*

Splice region variant,

intron variant 0 0 426* 0.2889* 41* 0.1338* 489* 0.2952*

Splice region variant,

synonymous variant 0 0 21 0.0142 22* 0.0718* 28* 0.0169*

Splice donor variant 0 0 36 0.0244* 1* 0.0033 37 0.0223

Intergenic variant 109 8.6098 3318 2.2498 644 2.1018 9909 5.9811

Synonymous variant 0 0 2837 1.9237 214* 0.6984* 2751 1.6605

Intron variant, non

coding transcript

variant

117* 9.2417* 9636 6.5339 1060* 3.4595* 5759* 3.4761*

Non coding transcript

exon variant 0 0 514* 0.3485* 66 0.2154 200* 0.1207*

Start lost, start retained

variant, 5 prime UTR

variant

0 0 0 0 0 0 1* 0.0006*

Total 1266 147477 30640 165673

64

Table 6. Statistically significant high impact variants that were discovered in the genome

wide association studies for the production traits average daily gain (ADG), back fat

thickness (BFT), meat to fat ratio (MFR), and carcass length (CRCL).

Trait High impact

consequence Variant

Position

bp Gene Gene name

BFT

Start lost SSC7:rs319855624 32544657 C6orf89 chromosome 7 C6orf89

homolog

Frameshift variant,

splice region variant

SSC7:._504514 32606375 PI16 peptidase inhibitor 16

SSC7:._504513 32606373 PI16 peptidase inhibitor 16

Splice donor variant

SSC7:rs80834233 29157904 DST dystonin

SSC7:rs327743463 28571665 PRIM2 DNA primase subunit 2

MFR Splice donor variant SSC2:rs1110687780 11630410 TCN1 transcobalamin 1

CRCL

Start lost, start

retained variant, 5

prime UTR variant

SSC7:rs793752812 23958518 NEU1 neuraminidase 1

Stop gained SSC7:rs334442580 87783592 novel

gene

Frameshift variant

SSC7:._1165873 97574140 ABCD4 ATP binding cassette

subfamily D member 4

SSC7:rs693811701 48561663 novel

gene aurora kinase A-like

SSC7:._1068730 87783712 novel

gene

SSC7:._1068731 87783718 novel

gene

Splice donor variant

SSC7:rs80834233 29157904 DST dystonin

SSC7:rs327743463 28571665 PRIM2 DNA primase subunit 2

SSC7:rs331245426 80150975 LPCAT4 lysophosphatidylcholine

acyltransferase 4

Gene set analysis

GO functional enrichment analysis revealed eleven significantly overrepresented GO terms

including molecular functions (MF), biological processes (BP), and cellular components (CC).

A list containing the GO terms and the associated list of genes is presented in Table 7. For BFT

a GO-MF term was overrepresented and related to calcium ion binding (GO:0005509). Several

olfactory receptor genes were prevalent for the GO terms assigned to MFR (e.g. GO-BP

GO:0007186 G-protein coupled receptor signaling pathway, GO-MF GO:0005549 odorant

binding). The gene set for the CRCL trait was associated with two BP terms (GO:0001666

response to hypoxia and GO:0008283 cell proliferation) and two CC terms (GO:0045177 apical

part of the cell and GO:0031410 cytoplasmic vesicle).

65

Table 7. Most significant Gene Ontology (GO) terms from DAVID for the top associated

genes that were identified in genome wide association studies for the traits back fat

thickness (BFT), meat to fat ratio (MFR), and carcass length (CRCL).

Trait Category Term Genes

BFT MF GO:0005509

calcium ion binding

DST, LOC100152993, SCGN, GUCA1B, ITPR3, CIB2,

GUCA1A, RASGRP2

MFR

BP

GO:0007186

G-protein coupled

receptor signaling

pathway

OR5B3, LOC100623017, LOC106509349, LOC100512519,

LOC100513457, OR9Q2, LOC100628183, LOC100511243,

LOC100512154, LOC100514032, LOC100521066,

LOC100519351, OR10Q1, LOC100511620, LOC106509346

CC

GO:0016021

integral component of

membrane

ANO9, OR5B3, LOC100512519, LOC100519082,

LOC100513457, LOC100628183, SIGIRR, LOC100512154,

BET1L, LOC100521066, TMX2, OR10Q1, TMEM80,

LOC100623017, LOC106509349, OR9Q2, LOC100511243,

ZDHHC5, ATL3, LOC100514032, LRRC4, PPP1R3A,

LOC100519351, LRRN3, LOC100511620, STX3,

LOC100521938, CCDC136, LOC106509346, NRXN2

CC GO:0005886

plasma membrane

OR5B3, EHD1, LOC100623017, LOC106509349,

LOC100512519, OR9Q2, LOC100513457, LOC100628183,

CTNND1, LOC100511243, ELMO1, LOC100512154,

ZDHHC5, LOC100514032, LOC100521066, LOC100519351,

STX3, LOC100511620, RABEPK, LOC106509346, RASGRP2

MF

GO:0004930

G-protein coupled

receptor activity

OR5B3, LOC100623017, LOC106509349, LOC100512519,

LOC100513457, OR9Q2, LOC100628183, LOC100511243,

LOC100512154, LOC100514032, LOC100521066,

LOC100519351, OR10Q1, GPR141, LOC100511620,

LOC106509346

MF

GO:0004984

olfactory receptor

activity

OR5B3, LOC100623017, LOC106509349, LOC100512519,

LOC100513457, OR9Q2, LOC100628183, LOC100511243,

LOC100512154, LOC100514032, LOC100521066,

LOC100519351, OR10Q1, LOC100511620, LOC106509346

MF GO:0005549

odorant binding

OR5B3, LOC100623017, LOC106509349, LOC100513457,

OR9Q2, LOC100628183, LOC100512154, LOC100514032,

LOC100521066, LOC100519351, OR10Q1, LOC100511620,

LOC106509346

CRCL

BP GO:0001666

response to hypoxia ANG, TGFB3, PGF, PLAT, VEGFA

BP GO:0008283

cell proliferation FURIN, FAM83B, ZFP36L1, MORF4L1, BYSL, RASGRF1

CC GO:0045177

apical part of cell ADGRF5, VASH1, PLAT, HOMER2, BYSL

CC GO:0031410

cytoplasmic vesicle ANG, ADGRF5, FES, NEU1, GRM4, RHGC

66

Discussion

Genotyping strategy

The genotyping strategy that we developed for this study is outlined in Figure 1. Briefly: 24 F0

pigs were subjected to high coverage Illumina short read sequencing and in addition 91 F1

animals were sequenced at low coverage and imputed to high coverage WGS levels in order to

allow phasing. 2657 F2 animals were chip genotyped, imputed using a merged dataset of F0 and

imputed F1 as reference-panel. All imputation steps involved pedigree information. Opposed to

a population-based strategy this approach does not rely on a large reference-panel but on the

relatedness of individuals. In general, the genotyping strategy can be considered reliable since

the majority of the QTLs identified were already described for the four traits analyzed in this

study (cross-reference with Pig QTL database (Hu et al., 2019)). Nevertheless, we expected to

identify a variant that was associated with muscle mass and fat deposition in exon 2 of IGF2,

which has been extensively described to influence muscle development (Nezer et al., 1999).

The absence of IGF2 associated variants can be explained by a local drop in coefficients of

determination from an average of R² = 0.22 to R² = 0.03 in the genomic region where IGF2

resides (SSC2 1,469,183 – 1,496,417 bp, Figure 5). It must be pointed out that those coefficients

of determination cannot be used to draw conclusions about the actual accuracy of the

imputation. Since no pedigree information was included in the simulation, it can solely be used

to identify local inaccuracies, which were most likely due to assembly errors in the reference

genome.

The genotyping approach presented in this study can be considered a reasonable strategy to

radically increase the marker density of large F2 populations to WGS levels. By sequencing the

founder individuals with high coverage and the F1 with low coverage, which are only a fraction

of the number of F2 animals, the approach provides an affordable opportunity to improve the

power and potential of otherwise obsolete datasets. Due to the relatedness of the animals deep

sequencing of only a few animals is necessary, rendering it economically attractive.

Cluster identification and exploratory analysis

To fully exploit the potential of the four resource populations, the crosses were pooled and

further used for conducting GWAS. The increased sample size together with the increased

marker density ensures a high resolution that might allow the pinpointing of more specific

causative genes and mutations. Further experiments, e.g. Sanger sequencing of promising

regions could elaborate on that. Designing F2 populations implies that the LD-blocks are longer,

a fact that is counteracted to some extent by jointly analyzing the four designs. Lambda values

67

of 1.282319 to 1.422044 point to a moderate degree of p-value inflation in the GWAS, which

is most likely caused by the usage of WGS data and a LOCO GWAS approach. However, to

exploit the whole depth and power of the dataset we chose a LOCO analysis approach. To

further comprehend the closely linked association signals from GWAS, the following approach

was employed: i) clusters incorporating strong evidence for trait-associated chromosomal

regions were defined, ii) the effect of the significant variants was predicted and iii) a gene set

analysis was employed to identify sets of genes jointly associated with the traits of interest.

The quantitative traits considered for this study have been investigated in the past and are

mostly well represented in the Pig QTL database (Hu et al., 2019), except for MFR. The clusters

assigned to each trait were compared with the QTL regions from the database. For MFR,

additional fat-related traits (e.g. fat percentage in the carcass and fat-cuts percentage) were

considered in order to allow an adequate comparison given that the trait has few records in the

database and the trait definition can be country dependent. Most of the clusters overlapped or

were in the vicinity of the previously reported QTLs. This was expected as the database has

been recently updated and also includes our previous results (Blaj et al., 2018) using SNP chip

data and three out of the four pig populations which were taken into account here. Some of the

earlier reported QTLs in the database spread over large genomic regions (e.g. > 5Mb). It is

assumed that many of these large QTL regions might in fact not be due to a single mutation,

thus representing haplotype effects caused by several causative variants (Andersson, 2009). In

the current study, we were able to assign numerous clusters within these regions, which implies

that a higher genomic resolution was achieved and that it may be possible to disentangle distinct

quantitative trait nucleotides.

Conditional association analyses by including the top variant as a fixed effect in the MLMA

were carried out in order to gather statistical evidence for putative causality (Cohen-Zinder et

al., 2005) and was specifically applied to CRCL and BFT on SSC7. This chromosome exhibits

the highest number of clusters (SM with Clusters) and the highest association signals. By

including the top variant (rs81228492) for BFT, only one well-supported peak was above the

significance threshold (Supplementary Figure 1) meaning that there is additional genetic

variation within this region. Similarly, for CRCL the two top variants (rs333021601 and

rs319044994) representing the two different significant genomic regions were included

alternatively in the model. After fixing the effect of the latter variant, the surrounding significant

region disappeared, pointing to the possibility that there could be only one QTL responsible for

CRCL on SSC7 around the 99 Mb region. An alternative or additional explanation could be the

presence of long LD blocks, long-range LD and/or various epistatic interactions among the loci.

68

The overlap among the BFT and CRCL significant variants (see Figure 3 and Figure 4)

localized mostly in the genomic region 24-32 Mb indicate the existence of pleiotropic loci for

the two traits. When conditioned on the top BFT variant (rs81228492) as a fixed effect for a

MLMA on CRCL and the top CRCL variant (rs333021601) for MLMA on BFT, the initially

associated clusters and those nearby dropped in the intensity of the association signals

(Supplementary Figure 1), supporting the presence of pleiotropic loci. It is also noteworthy that

CRCL might be influenced by the number of thoracolumbar vertebrae (Rohrer et al., 2015).

Since the variant that has been associated with a higher number of vertebrae is a large Indel in

intron 1 of the VRTN gene (Fan et al., 2013) we were not able to discover this variant since the

genotyping pipeline applied in this study does only cover small INDELs.

In order to gain insight into the possible genetic mechanisms that control the traits, an

enrichment analysis of the gene function was performed with DAVID, prioritizing on the GO

terms. The GO-MF calcium ion binding term found for BFT supports the relationship between

the calcium ion, food intake and lipid metabolism previously described in the literature (Cui et

al., 2017). Furthermore, one of the genes in this group is DST, a strong candidate gene for which

high impact variants were found via VEP, which is discussed in detail below. A GO-BP term

related to cell proliferation comprised the FAM83B gene, which is the gene incorporating the

top variant found for CRCL. Interestingly, the majority of the genes included in the over-

represented terms for MFR were olfactory receptors. This enrichment is a consequence of the

MFR-identified clusters overlapping regions that are rich in various olfactory receptor genes.

This particular gene family is known to have significant expansion throughout time within the

Sus Scrofa genome (Nguyen et al., 2012).

Variant effect prediction

ADG: A QTL for ADG found on SSC7 comprises 115 statistically significant intron variants

and 83 variants upstream (min. p-value 8.71 x 10-14) of the HMGCLL1 gene, which was shown

by Comuzzie et al. to be associated with childhood obesity in the Hispanic population and to

influence creatinine levels. Another QTL on SSC2 contains 112 intron variants in SHANK2

(min. p-value 1.33 x 10-12). SHANK2 was also shown to be associated with childhood obesity

in the same study and to have an influence on estradiol blood concentrations (Comuzzie et al.,

2012). A third QTL on SSC4 harbors 2 intron and 12 downstream variants (min p-value 1.06 x

10-13) affecting LYN, which encodes for the LYN proto-oncogene, which was also identified by

Comuzzie et al. and correlated with the amount of fat mass in obese children (Comuzzie et al.,

2012). Six additional variants in the QTL on SSC4 (min. p-value 2.44 x 10-12) lie in an

69

intergenic region 13,463 – 14,460 bp downstream of RPS20, a gene which in interplay with

GNL1 is critical for cell growth (Krishnan et al., 2018). Another likely candidate SNP to

influence ADG is an intron variant in the PLAG1 transcription factor (p-value 1.32 x 10-11),

which is a regulator of IGF2 expression (Zatkova et al., 2004).

BFT: A QTL for BFT with a very prominent peak was detected on SSC7. The SNP with the

lowest p-value (6.63 x 10-54) is an intron variant in gene C6orf106. C6orf106 is a target of the

human miRNA has-miR-192, which has been identified to have regulatory functions in type 2

diabetes mellitus (Cui et al., 2016). The second top scoring SNP is an intron variant in the

RIPOR2 gene (p-value 4.34 x 10-50). RIPOR2 expression and protein levels are upregulated

during muscle cell differentiation in human fetal muscle cells (Yoon et al., 2007). Another gene

containing top scoring variants on SSC7 is KIFC1 (7 intron variants, min p-value 3.12 x 10-47).

Overexpression of KIFC1 promotes cell proliferation in non-small cell lung cancer (Liu et al.,

2016). 21 intron and 8 downstream variants in BMP5 (min. p-value 1.91 x 10-29), which induces

cartilage and bone formation (Wozney et al., 1988), are also located in a cluster on SSC7. 6

variants downstream of the aforementioned RPS20 (min. p-value 1.90 x 10-15) were found in

the cluster on SSC4.

MFR: GWAS for the MFR trait revealed a strong QTL on SSC2 with variant rs81327136

upstream of KRTAP5-5-like being the most significant (p-value 1.59 x 10-23). Of 72 variants, 6

were located in KRTAP5-5-like introns and 66 in the vicinity of the gene. KRTAP5-5 is a

transcription factor that regulates proliferation of epithelial cells (Barker et al., 2008) and that

forms a dominant-negative splice isoform in type 1 diabetes, which correlates with disease

severity (Yip et al., 2015). Other variants found in clusters on SSC2 are located in or adjacent

to DEAF1 (8 intron variants, min p-value 3.47 x 10-29), which is a transcription factor that

regulates proliferation of epithelial cells (Barker et al., 2008) and that forms a dominant-

negative splice isoform in type 1 diabetes, which correlates with disease severity (Yip et al.,

2015). Clusters on SSC2 also harbor variants associated with SHANK2 (1,714 intron variants,

3 5’ UTR variants, min p-value 2.18 x 10-26) and CTTN (188 up- and downstream variants, min

p-value 1.53 x 10-25). CTTN’s protein product Cortactin binds to and is indirectly

phosphorylated by obesity factor PTP1B (Stuible et al., 2008). A noteworthy intron variant is

located in the vitamin D pathway gene DHCR7 (p-value 3.06 x 10-25), which has been

associated with obesity traits in humans (Vimaleswaran et al., 2013). A total of 14 DHCR7

intron variants were above the significance threshold. A less prominent QTL on SSC4 harbors

variants in or close to the aforementioned genes RPS20 (17 downstream variants, min p-value

1.51 x 10-25) and in SNTG1 (19 intron variants, min p-value 1.48 x 10-14), which has been

70

associated with type 2 diabetes (Ban et al., 2010). A third, rather minor QTL on SSC18, contains

21 variants downstream of MDFIC (min p-value 1.97 x 10-15), a gene which has been linked to

improved piglet birth weight (Zhang et al., 2014). 25 intron and 42 downstream variants were

found for the PPP1R3A gene (min p-value 6.92 x 10-15), which in a whole exome sequencing

study was found to be associated with type 2 diabetes in a Mayan population (Sánchez-Pozos

et al., 2018).

CRCL: In the GWAS for CRCL 52 clusters were identified on SSC7. Although not located in

one of the clusters, the two lowest p-values (min p-value 5.40 x 10-49) were found in the intron

and coding region (silent mutation) of FAM83B (or C6orf143) respectively. A total of 62

significant variants in FAM83B were discovered comprising of 60 intron variants, 1 silent

mutation, and 1 missense mutation. Cipriano et al. demonstrated that overexpression or

mutation of FAM83B leads to EGFR hyperactivation by direct interaction and consequent

hyperactivation of the EGFR downstream effector phospholipase D1, which was previously

associated with BMI in humans (Davenport et al., 2015). An intron variant in the RIPOR2 gene

with a p-value of 5.08 x 10-47 is the same SNP, which was found in the GWAS for BFT. A total

of 85 mostly intronic RIPOR2 variants were found for the CRCL trait. A second, less prominent

QTL on SSC7 harbors 9 intron, 12 downstream and 317 upstream variants (min p-value 3.62 x

10-31), which have been assigned to the RSL24D1 gene. RSL24D1 has been identified as a

potential target in familial hypercholesterolemia (Li et al., 2015). One of the clusters identified

for CRCL on SSC17 contains 230 variants 122,416-126,520 bp downstream of BMP2 (min p-

value 7.21 x 10-38), a bone formation inducing factor (Wang et al., 2013). In addition, 18 intron

and 114 variants upstream of TMX4 were discovered. TMX4 was associated with feed

conversion ratios in chickens (Shah et al., 2016).

High impact variants: Various high impact variants were discovered by variant effect

prediction. A splice donor variant (rs80834233) in DST, the gene encoding Dystonin, is

associated with BFT (p-value 1.98 x 10-19) and CRCL (p-value 1.25 x 10-19). Knockout of DST

leads to intrinsic muscle weakness and instability of skeletal muscle cytoarchitecture in mice

(Dalpé et al., 1999). Variant rs793752812 leads to a probable start codon loss in NEU1 and is

associated with CRCL (p-value 1.49 x 10-12). A deficiency of the NEU1 gene product

Neuraminidase 1 leads to vertebral deformities in humans (Sphranger et al., 1977), which is

reasonable considering CRCL is largely determined by the number of vertebrae. Furthermore

one frameshift variant in AURKA (rs693811701, p-value 2.95 x 10-12) and one splice donor

variant in NUTM1 (rs331245426, p-value 1.38 x 10-9), both oncogenes (Umene et al., 2015)

(Schaefer et al., 2018), are associated with CRCL. The splice donor variant rs1110687780,

71

which affects the gene coding for placenta-specific protein 1-like, was detected in the GWAS

for MFR. In humans, PLAC1 has been found to be highly expressed in various types of tumors

(Koslowski et al., 2007).

Application of results in breeding programs and follow up studies

Functional validation studies based on appointed candidate genes and genetic variants will be

considered in follow-up studies. Besides understanding the underlying molecular mechanisms

of ADG, BFT, MFR and CRCL, the results of GWAS can render a substantial increase in the

reliability of genomic predictions in breeding programs. This concept was demonstrated in

several studies in cattle (Brøndum et al., 2015; Porto-Neto et al., 2015; van den Berg et al.,

2016) and in Drosophila melanogaster (Ober et al., 2015) by including pre-selected variants

from GWAS results in the prediction models. Even though implementing genomic selection is

becoming a common practice, the usage of marker-assisted selection or genomic screening is

not obsolete pointing that the identification of relevant genetic markers via GWAS and post-

GWAS analyses is still of practical importance in pig breeding.

Conclusion

Putting the results of previous simulation studies to test, we conducted GWAS in four pooled

F2 designs, which have been imputed to sequence level based on high coverage founder and

low coverage F1 sequencing. We found that by pooling the designs the sequence level marker

density can be exploited efficiently. QTLs for four well-characterized traits were identified in

agreement with previous mapping studies and candidate genes and pathways were unraveled,

that should be subject to further studies. Thus, the approach applied herein is a feasible strategy

to efficiently utilize extremely well phenotyped experimental designs that have been established

in the past.

72

References

Andersson, L., 2009 Genome-wide association analysis in domestic animals: a powerful

approach for genetic dissection of trait loci. Genetica 136: 341–349.

https://doi.org/10.1007/s10709-008-9312-4

Ban, H.-J., J. Y. Heo, K.-S. Oh, and K.-J. Park, 2010 Identification of Type 2 Diabetes-

associated combination of SNPs using Support Vector Machine. BMC Genet. 11: 26.

https://doi.org/10.1186/1471-2156-11-26

Barker, H. E., G. K. Smyth, J. Wettenhall, T. A. Ward, M. L. Bathet al., 2008 Deaf-1 regulates

epithelial cell proliferation and side-branching in the mammary gland. BMC Dev. Biol.

8: 94. https://doi.org/10.1186/1471-213X-8-94

Berens, E. B., G. M. Sharif, M. O. Schmidt, G. Yan, A. Wellstein et al., 2017 Keratin-

associated protein 5-5 controls cytoskeletal function and cancer cell vascular invasion.

Oncogene 36: 593–605. https://doi.org/10.1038/onc.2016.234

Blaj, I., J. Tetens, S. Preuß, J. Bennewitz, and G. Thaller, 2018 Genome-wide association

studies and meta-analysis uncovers new candidate genes for growth and carcass traits

in pigs. PLoS One 13: e0205576. https://doi.org/10.1371/journal.pone.0205576

Borchers, N., N. Reinsch, and E. Kalm, 2000 Familial cases of coat colour-change in a Piétrain

cross. J. Anim. Breed. Genet. 117: 285–287. https://doi.org/10.1046/j.1439-

0388.2000.00255.x

Brøndum, R. F., G. Su, L. Janss, G. Sahana, B. Guldbrandtsen et al., 2015 Quantitative trait

loci markers derived from whole genome sequence data increases the reliability of

genomic prediction. J. Dairy Sci.98: 4107–4116. https://doi.org/10.3168/jds.2014-9005

Browning, B. L., Y. Zhou, and S. R. Browning, 2018 A One-Penny Imputed Genome from

Next-Generation Reference Panels. Am. J. Hum. Genet. 103: 338–348.

https://doi.org/10.1016/j.ajhg.2018.07.015

Browning, S. R., and B. L. Browning, 2007 Rapid and accurate haplotype phasing and

missing-data inference for whole-genome association studies by use of localized

haplotype clustering. Am. J. Hum. Genet. 81: 1084–1097.

https://doi.org/10.1086/521987

Chang, C. C., C. C. Chow, L. C. Tellier, S. Vattikuti, S. M. Purcell et al., 2015 Second-

generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4:

7. https://doi.org/10.1186/s13742-015-0047-8

Cohen-Zinder, M., E. Seroussi, D. M. Larkin, J. J. Loor, A. E. Wind et al., 2005 Identification

of a missense mutation in the bovine ABCG2 gene with a major effect on the QTL on

73

chromosome 6 affecting milk yield and composition in Holstein cattle. Genome Res.

15: 936–944. https://doi.org/10.1101/gr.3806705

Comuzzie, A. G., S. A. Cole, S. L. Laston, V. S. Voruganti, K. Haack et al., 2012 Novel

genetic loci identified for the pathophysiology of childhood obesity in the Hispanic

population. PLoS One 7: e51954. https://doi.org/10.1371/journal.pone.0051954

Cui, H., S. Yang, M. Zheng, R. Liu, G. Zhao et al., 2017 High-salt intake negatively regulates

fat deposition in mouse. Sci. Rep. 7: 2053. https://doi.org/10.1038/s41598-017-01560-

3

Cui, Y., W. Chen, J. Chi, and L. Wang, 2016 Comparison of Transcriptome between Type 2

Diabetes Mellitus and Impaired Fasting Glucose. Med.Sci. Monit. 22: 4699–4706.

https://doi.org/10.12659/MSM.896772

Dalpé, G., M. Mathieu, A. Comtois, E. Zhu, S. Wasiak et al., 1999 Dystonin-deficient mice

exhibit an intrinsic muscle weakness and an instability of skeletal muscle

cytoarchitecture. Dev. Biol. 210: 367–380. https://doi.org/10.1006/dbio.1999.9263

Davenport, E. R., D. A. Cusanovich, K. Michelini, L. B. Barreiro, C. Ober et al., 2015

Genome-Wide Association Studies of the Human Gut Microbiota. PLoS One 10:

e0140301. https://doi.org/10.1371/journal.pone.0140301

Fan, Y., Y. Xing, Z. Zhang, H. Ai, Z. Ouyang et al., 2013 A further look at porcine

chromosome 7 reveals VRTN variants associated with vertebral number in Chinese and

Western pigs. PLoS One 8: e62534. https://doi.org/10.1371/journal.pone.0062534

Fujii, J., K. Otsu, F. Zorzato, S. de Leon, V. Khanna et al., 1991 Identification of a mutation

in porcine ryanodine receptor associated with malignant hyperthermia. Science 253:

448–451. https://doi.org/10.1126/science.1862346

Haller, T., T. Tasa, and A. Metspalu, 2019 Manhattan Harvester and Cropper: a system for

GWAS peak detection. BMC Bioinformatics 20: 22. https://doi.org/10.1186/s12859-

019-2600-4

Hayes, B., and M. E. Goddard, 2001 The distribution of the effects of genes affecting

quantitative traits in livestock. Genetics, Selection. Evolution GSE 33: 209–229.

Hu, Z., C. A. Park, and J. M. Reecy, 2019 Building a livestock genetic and genomic

information knowledgebase through integrative developments of Animal QTLdb and

CorrDB. Nucleic Acids Res. 47: D701–D710. https://doi.org/10.1093/nar/gky1084

Huang da, W., B. T. Sherman, and R. A. Lempicki, 2009 Systematic and integrative analysis

of large gene lists using DAVID bioinformatics re-sources. Nat. Protoc. 4: 44–57.

https://doi.org/10.1038/nprot.2008.211

74

Huang, D. W., B. T. Sherman, Q. Tan, J. R. Collins, and W. G. Alvord et al., 2007 The DAVID

Gene Functional Classification Tool: a novel biological module-centric algorithm to

functionally analyze large gene lists. Genome Biol. 8: R183. https://doi.org/10.1186/gb-

2007-8-9-r183

Knott, S. A., 2005 Regression-based quantitative trait loci mapping: robust, efficient and

effective. Philos. Trans. R. Soc. Lond. B Biol. Sci. 360: 1435–1442.

https://doi.org/10.1098/rstb.2005.1671

Koslowski, M., U. Sahin, R. Mitnacht-Kraus, G. Seitz, C. Huber et al., 2007 A placenta-

specific gene ectopically activated in many human cancers is essentially involved in

malignant cell processes. Cancer Res. 67:9528–9534. https://doi.org/10.1158/0008-

5472.CAN-07-1350

Krishnan, R., N. Boddapati, and S. Mahalingam, 2018 Interplay between human nucleolar

GNL1 and RPS20 is critical to modulate cell proliferation. Sci. Rep. 8: 11421.

https://doi.org/10.1038/s41598-018-29802-y

Li, G., X.-J. Wu, X.-Q. Kong, L. Wang, and X. Jin, 2015 Cytochrome coxidase subunit VIIb

as a potential target in familial hypercholesterolemia by bioinformatical analysis. Eur.

Rev. Med. Pharmacol. Sci. 19:4139–4145.

Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan et al., 2009 The Sequence

Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079.

https://doi.org/10.1093/bioinformatics/btp352

Liu, Y., P. Zhan, Z. Zhou, Z. Xing, S. Zhu et al., 2016 The overexpression of KIFC1 was

associated with the proliferation and prognosis of non-small cell lung cancer. J. Thorac.

Dis. 8: 2911–2923. https://doi.org/10.21037/jtd.2016.10.67

McKenna, A., M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis et al., 2010 The Genome

Analysis Toolkit: a Map Reduce framework for analyzing next-generation DNA

sequencing data. Genome Res. 20: 1297–1303. https://doi.org/10.1101/gr.107524.110

McLaren, W., L. Gil, S. E. Hunt, H. S. Riat, G. R. S. Ritchie et al., 2016 The Ensembl Variant

Effect Predictor. Genome Biol. 17: 122. https://doi.org/10.1186/s13059-016-0974-4

Nagamine, Y., C. S. Haley, A. Sewalem, and P. M. Visscher, 2003 Quantitative trait loci

variation for growth and obesity between and within lines of pigs (Sus scrofa). Genetics

164: 629–635.

Nezer, C., L. Moreau, B. Brouwers, W. Coppieters, J. Detilleux et al., 1999 An imprinted QTL

with major effect on muscle mass and fat deposition maps to the IGF2 locus in pigs.

Nat. Genet. 21: 155–156. https://doi.org/10.1038/5935

75

Ng, P. C., and S. Henikoff, 2003 SIFT: Predicting amino acid changes that affect protein

function. Nucleic Acids Res. 31: 3812–3814. https://doi.org/10.1093/nar/gkg509

Nguyen D. Truong, K. Lee, H. Choi, M.-k. Choi, M. Thong Le, et al., 2012 The complete

swine olfactory subgenome: expansion of the olfactory gene repertoire in the pig

genome. BMC genomics 13: 584. https://doi.org/10.1186/1471-2164-13-584

Ober, U., W. Huang, M. Magwire, M. Schlather, H. Simianeret al., 2015 Accounting for

genetic architecture improves sequence based genomic prediction for a Drosophila

fitness trait. PLoS One 10: e0126880. https://doi.org/10.1371/journal.pone.0126880

Porto-Neto, L. R., W. Barendse, J. M. Henshall, S. M. McWilliam,S. A. Lehnert et al., 2015

Genomic correlation: harnessing the benefit of combining two unrelated populations for

genomic selection. Genetics, Selection, Evolution GSE 47: 84.

Ramos, A. M., R. P. M. A. Crooijmans, N. A. Affara, A. J. Amaral,A. L. Archibald et al.,

2009 Design of a high density SNP genotyping assay in the pig using SNPs identified

and characterized by next generation sequencing technology. PLoS One 4: e6524.

https://doi.org/10.1371/journal.pone.0006524

Rohrer, G. A., D. J. Nonneman, R. T. Wiedmann, and J. F. Schneider, 2015 A study of vertebra

number in pigs confirms the association of vertnin and reveals additional QTL. BMC

Genet. 16: 129. https://doi.org/10.1186/s12863-015-0286-9

Rothschild, M. F., Z. Hu, and Z. Jiang, 2007 Advances in QTL mapping in pigs. Int. J. Biol.

Sci. 3: 192–197. https://doi.org/10.7150/ijbs.3.192

Rückert, C., and J. Bennewitz, 2010 Joint QTL analysis of three connected F2-crosses in pigs.

Genetics, Selection. Evolution GSE 42: 40.

Sánchez-Pozos K., M. G. Ortíz-López, B. I. Peña-Espinoza, M. de Los Ángeles Granados-

Silvestre, V. Jiménez-Jacinto, et al., 2018 Whole-exome sequencing in maya indigenous

families: variant in PPP1R3A is associated with type 2 diabetes. Molecular genetics and

genomics MGG 293:1205–1216. https://doi.org/10.1007/s00438-018-1453-2

Schaefer, I.-M., P. Dal Cin, L. M. Landry, C. D. M. Fletcher, G. J. Hanna et al., 2018 CIC-

NUTM1 fusion: A case which expands the spectrum of NUT-rearranged epithelioid

malignancies. Genes Chromosomes Cancer 57: 446–451. https://doi.org/10.1002/gcc.3

Schmid, M., R. Wellmann, and J. Bennewitz, 2018 Power and precision of QTL mapping in

simulated multiple porcine F2 crosses using whole-genome sequence information. BMC

Genet. 19: 22. https://doi.org/10.1186/s12863-018-0604-0

76

Shah T. M., N. V. Patel, A. B. Patel, M. R. Upadhyay, A. Mohapatra, et al.,2016 A genome-

wide approach to screen for genetic variants in broilers (Gallus gallus) with divergent

feed conversion ratio. Molecular genetics and genomics MGG 291: 1715–1725.

Sphranger, J., J. Gehler, and M. Cantz, 1977 Mucolipidosis I–a sialidosis. Am. J. Med. Genet.

1: 21–29. https://doi.org/10.1002/ajmg.1320010104

Stratz, P., M. Schmid, R. Wellmann, S. Preuß, I. Blaj et al., 2018 Linkage disequilibrium

pattern and genome-wide association mapping for meat traits in multiple porcine F2

crosses. Anim. Genet. 49: 403–412. https://doi.org/10.1111/age.12684

Stuible, M., N. Dubé, and M. L. Tremblay, 2008 PTP1B Regulates Cortactin Tyrosine

Phosphorylation by Targeting Tyr446S. J. Biol. Chem. 283:15740–15746.

https://doi.org/10.1074/jbc.M710534200

Umene, K., M. Yanokura, K. Banno, H. Irie, and M. Adachi et al., 2015 Aurora kinase A has

a significant role as a therapeutic target and clinical biomarker in endometrial cancer.

Int. J. Oncol. 46: 1498–1506. https://doi.org/10.3892/ijo.2015.2842

van den Berg, I., D. Boichard, and M. S. Lund, 2016 Sequence variants selected from a multi-

breed GWAS can improve the reliability of genomic predictions in dairy cattle.

Genetics, Selection. Evolution GSE 48: 83.

Vimaleswaran, K. S., A. Cavadino, D. J. Berry, J. C. Whittaker, C. Power et al., 2013 Genetic

association analysis of vitamin D pathway with obesity traits. Int. J. Obes. 37: 1399–

1406. https://doi.org/10.1038/ijo.2013.6

Wang, L., P. Park, F. La Marca, K. Than, S. Rahman et al., 2013 Bone formation induced by

BMP-2 in human osteosarcoma cells. Int. J. Oncol.43: 1095–1102.

https://doi.org/10.3892/ijo.2013.2030

Wellmann, R., S. Preuß, E. Tholen, J. Heinkel, K. Wimmers et al., 2013 Genomic selection

using low density marker panels with application to a sire line in pigs. Genetics,

Selection. Evolution GSE 45: 28.

Wozney, J. M., V. Rosen, A. J. Celeste, L. M. Mitsock, M. J. Whitters et al., 1988 Novel

regulators of bone formation: molecular clones and activities. Science 242: 1528–1534.

https://doi.org/10.1126/science.3201241

Yang, J., S. H. Lee, M. E. Goddard, and P. M. Visscher, 2011 GCTA: a tool for genome-wide

complex trait analysis. Am. J. Hum. Genet. 88: 76–82.

https://doi.org/10.1016/j.ajhg.2010.11.011

Yip, L., R. Fuhlbrigge, C. Taylor, R. J. Creusot, T. Nishikawa-Matsumura et al., 2015

Inflammation and hyperglycemia mediate Deaf1 splicing in the pancreatic lymph nodes

77

via distinct pathways during type 1 diabetes. Diabetes 64: 604–617.

https://doi.org/10.2337/db14-0803

Yoon, S., M. J. Molloy, M. P. Wu, D. B. Cowan, and E. Gussoni, 2007 C6ORF32 is

upregulated during muscle cell differentiation and induces the formation of cellular

filopodia. Dev. Biol. 301: 70–81. https://doi.org/10.1016/j.ydbio.2006.11.002

Zatkova, A., J.-M. Rouillard, W. Hartmann, B. J. Lamb, R. Kuick et al., 2004 Amplification

and overexpression of the IGF2 regulator PLAG1 in hepatoblastoma. Genes

Chromosomes Cancer 39: 126–137. https://doi.org/10.1002/gcc.10307

Zhang, L., X. Zhou, J. J. Michal, B. Ding, R. Li et al., 2014 Genome Wide Screening of

Candidate Genes for Improving Piglet Birth Weight Using High and Low Estimated

Breeding Value Populations. Int. J. Biol. Sci. 10:236–244.

https://doi.org/10.7150/ijbs.7744

Zhou Z.-Y., A. Li, N. O. Otecko, Y.-H. Liu, D. M. Irwin, et al. 2017 PigVar: a database of pig

variations and positive selection signatures. Database the journal of biological databases

and curation.

78

79

Chapter 4

Recombination landscape in multiple F2 pig crosses between

genetically diverse founder breeds

Iulia Blaj1, Jens Tetens2, Robin Wellmann3, Siegfried Preuß3, Jörn Bennewitz3 and Georg

Thaller1

1Institute of Animal Breeding and Husbandry, Kiel University, Kiel, Germany

2Functional Breeding Group, Department of Animal Sciences, Göttingen University,

Göttingen, Germany

3Institute of Animal Husbandry and Breeding, University of Hohenheim, Stuttgart, Germany

Published in Proceedings of the 11th World Congress of

Genetics Applied to Livestock Production

80

Summary

In the present study, the recombination landscape of multiple pig F2 pedigrees is evaluated in

detail. Three of the pedigrees under investigation were generated from distantly related founder

breeds: Wild boar, Piétrain and Meishan and the fourth pedigree originated from closely related

founder breeds: Piétrain and Large white or crossbred sows Large white × Landrace.

Recombination rates and genetic maps were estimated from SNP chip data using marker

positions according to the current pig reference genome. The level of recombination events

varies within crosses, among breeds and individuals as well as across chromosomes or regions

within chromosomes. Although we observed a substantial heterogeneity in the pedigrees,

certain patterns specific to crosses, sex or chromosomes were identified. These patterns depend

on the extent of conservation of the local rate of recombination over time, on the levels of

diversity, efficiency or direction of selection, and the genome composition. The current findings

are aimed to have practical consequences for the genetic mapping of traits in pigs.

Keywords: recombination landscape, domestication, crossover inference, pig F2 cross

Introduction

Recombination is shaping the genomic architecture of organisms by producing new genetic

combinations every generation. This process of shuffling is the major source of genetic

variability upon which selection can operate in a natural or artificial manner. Among the various

domesticated pig breeds, either of European or Asian origin, the selection acted mostly on very

different traits thus specifically altering the genome landscape. In the present study, we

investigate aspects related to recombination rate (RR) and genetic maps in four pig populations

stemming from the following European and Asian founder breeds: Piétrain (P), Large white

(Lw), Landrace (L) and Meishan (M), as well as their wild ancestor: the Wild boar (W). The

aim of the investigation was three fold:

1. Estimate a high-density recombination map of the pig based on the new reference genome;

2. Evaluate cross, sex and chromosome specific differences;

3. Assess male specific rates and genetic map lengths.

Material and methods

Experimental populations

Four pedigrees were included in the analysis (Table 1). The population PxLwL/Lw was

generated from closely related founders (Boysen et al., 2010) and the other three originate from

81

distantly related founder breeds (Rückert and Bennewitz, 2010). The F2 individuals and the

respective F1 and F0 ancestors were genotyped with the Illumina PorcineSNP60 BeadChip. SNP

chromosomal positions were based on the current pig genome assembly (Sscrofa 11.1).

Genotype filtering was done using Illumina GenomeStudio software and Plink (Purcell et al.,

2007). Autosomal SNPs were further used and their number was on average 43K, except for

the WxM data set which contained 37K SNPs. The statistical analysis was conducted in R

(Team, R. Core, 2014).

Table 1. Description of the study designs.

Design/Generation F0 males F0 females F1 males F1 females F2

PxLwL/Lw 5 Piétrain 8 LwL/Lw1 8 88 1785

MxP 1 Meishan 8 Piétrain 3 19 304

WxP 1 Wild boar2 9 Piétrain 2 26 291

WxM 1 Wild boar2 4 Meishan 2 21 312 1 Large white x Landrace/Large white; 2 the same Wild boar founder

Haplotype reconstruction and inference of crossover events

The autosomal recombination events were inferred using LinkPhase3 software (Druet and

Georges, 2015) which performs phasing based on Mendelian segregation rules. The crossover

events (CO) are further identified as phase switches observed in the gametes. The output

consists of the crossover calls in each parent-child pair (F1-F2) and the genomic interval for

which the inference is made. Double CO occurring in windows of 1 Mb, CO intervals bigger

than half of the chromosome length and recombination fractions larger than 0.05 in 1 Mb

window were ignored. Moreover, only chromosomes with a maximum of 4 CO events were

further considered. Recombination fractions were estimated for every non-overlapping 1 Mb

window and converted into centiMorgans (cM) using the Haldane mapping function. For each

experimental population we calculated a sex-averaged, a female and a male recombination rate

and map. Additionally, individual genetic maps for 14 F1 males with more than one hundred

meiosis were constructed. Sex differences in the recombination rate distribution were evaluated

chromosome wise using the Kolmogorov-Smirnov test (KS test). Correlations between cross

specific, sex specific and male specific recombination rates were tested using Pearson’s

correlation coefficient at a chromosomal level as well as at a genome-wide level.

82

Results and discussion

Recombination rates and maps

The crossover events inference for the F1 individuals revealed an average of 1.03 CO per

chromosome in PxLwL/Lw, 1.13 in MxP, 1.12 in WxP and 1.08 in WxM. A higher number of

events was identified in the females as compared to the males which lead to sex specific

differences in the recombination rates and as a consequence in the linkage map (Table 2). The

longest genetic maps found were 1953 cM for the MxP and 1920 cM for the WxP cross,

respectively. In general, the sex-averaged, female and male genetic maps were shorter than

those previously published (Tortereau et al., 2012, Guo et al., 2009, Rohrer et al., 1996).

Furthermore, the estimated recombination rates were higher than reported by Tortereau et al.

while using similar pedigrees and Sscrofa 10.2 as reference. Therefore, we ran our analysis

pipeline for the four study designs using the SNP positions according to Sscrofa 10.2. CO

filtering acted more stringent and together with the overestimation of the physical length of the

previous assembly led to lower RR and longer linkage maps. Thus, higher RR estimates and

shorter maps can be mainly attributed to the fact that we used the latest reference genome.

Table 2. Characteristics of sex-averaged, female and male linkage maps.

Sex-averaged Female Male

Design Linkage

map (cM)

cM/Mb Linkage

map (cM)

cM/Mb Linkage

map (cM)

cM/Mb

PxLwL/Lw 1740 0.87 1952 1 1553 0.75

MxP 1953 0.96 2068 1.05 1643 0.79

WxP 1920 0.95 1905 0.97 1731 0.82

WxM 1864 0.92 1977 1.01 1591 0.75

The longest chromosome was SSC6 for PxLwL/Lw (120.39 cM), for MxP (149.84 cM) and for

WxM (137.49 cM). However, in the WxP cross, the longest was SSC1 with 146.64 cM. With

respect to map size and recombination rates, in general, female rates were higher. Nonetheless,

there are exceptions as previously described in other studies, specifically on SSC1 and SSC13

for which male maps and rates are surpassing the female one. Additionally, we identified a

similar behaviour for other chromosomes in WxP (for SSC14 and SSC15) and in WxM (for

SSC15), i.e. in the crosses stemming from the Wild boar.

Sex differences in the recombination rate distribution were compared between males and

females, stratified by chromosome via the KS test. For SSC1, SSC3, SSC13 and SSC18 the

distributions were consistently different in all four pedigrees, whereas for SSC5 and SSC11 the

83

RR came from the same distribution. Several factors such as chromosome length, number of

CO events identified in each sex, centromere position and particularity of the genomic regions

can be incriminated for the observed chromosomal sex differences.

Male specific differences

Different genetic lengths and recombination rates were observed for females and males in the

four pedigrees. In each of the crosses the male chromosomal genetic maps exhibited similar

lengths, except for SSC1, SSC2 and SSC14 (Figure 1). The observed differences are due to

sequence variation and different informative markers within the males used.

Figure 1. Male-averaged chromosomal genetic maps for the four pedigrees.

Individual recombination rates and maps were estimated for the F1 boars for which we had more

than one hundred meiosis available. The correlation of the RR at a genome level among the

eight boars in PxLwL/Lw varied between 0.51 and 0.62. For the two boars in WxP we

calculated a 0.55 correlation coefficient while between the males in MxP and WxP the

correlation was 0.42 for both. The WxP and WxM crosses stemmed from the same Wild boar

founder male therefore we also assessed the similarity among the four F1 boars. We found an

average correlation of 0.41 suggesting that the crossing with the female Piétrain and Meishan

founders reshaped the recombination landscape at a genome level. Regarding the overall

genetic map length, the shortest map was observed in one male from the European breed cross

(1461 cM) and the longest map was recorded in one of the F1 boars in WxP cross (1719 cM)

implying the individual male differences can be substantial.

Conclusion

We report in this study the first, to our knowledge, recombination map of the porcine genome

based on the newest reference genome assembly, with more precise localization of crossover

events and a broad coverage of segregating variation due to the various founder breeds. The

study is aimed to contribute to the recombination picture in the Sus scrofa population, to

84

understand how domestication and breed development impacted the recombination landscape

and, last but not least, to assist in the genetic mapping of relevant traits.

Acknowledgements

The study was funded by the Deutsche Forschungsgemeinschaft, DFG.

85

References

Boysen T.J., Tetens J. & Thaller G. (2010). Detection of a quantitative trait locus for ham

weight with polar overdominance near the ortholog of the callipyge locus in an

experimental pig F2 population. Journal of animal science 88, 3167-3172.

Druet, T., & Georges, M. (2015). LINKPHASE3: an improved pedigree-based phasing

algorithm robust to genotyping and map errors. Bioinformatics, 31(10), 1677-1679.

Guo, Y., Mao, H., Ren, J., Yan, X., Duan, Y., Yang, G., ... & Brenig, B. (2009). A linkage map

of the porcine genome from a large‐scale White Duroc×Erhualian resource population

and evaluation of factors affecting recombination rates. Animal genetics, 40(1), 47-52.

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., ... & Sham, P.

C. (2007). PLINK: a tool set for whole-genome association and population-based

linkage analyses. The American Journal of Human Genetics, 81(3), 559-575.

Rückert C. & Bennewitz J. (2010). Joint QTL analysis of three connected F2-crosses in pigs.

Genetics, selection, evolution: GSE 42, 40.

Rohrer, G. A., Alexander, L. J., Hu, Z., Smith, T. P., Keele, J. W., & Beattie, C. W. (1996). A

comprehensive map of the porcine genome. Genome research, 6(5), 371-391.

Team, R. C. (2015). R: A language and environment for statistical computing [Internet].

Vienna, Austria: R Foundation for Statistical Computing; 2014.

Tortereau, F., Servin, B., Frantz, L., Megens, H. J., Milan, D., Rohrer, G., ... & Groenen, M. A.

(2012). A high-density recombination map of the pig reveals a correlation between sex-

specific recombination and GC content. BMC genomics, 13(1), 586.

86

87

Chapter 5

A systematic survey of sequence data variation in the founder

individuals of four F2 pig crosses

Iulia Blaj and Georg Thaller

Institute of Animal Breeding and Husbandry, Kiel University, Kiel, Germany

Manuscript in preparation

88

Abstract

We propose a reverse genetics approach for exploiting information in F2 resource populations

by surveying the sequence data variation that exists in the F0 generation. The rationale behind

is that almost all the variation that exists in F2 individuals (used normally for genome wide

association studies) is being propagated from the founder generation. The panel of animals

consists of 14 Piétrain, 7 crossbred Large white x Landrace sows, 1 Large white sow, 1 Meishan

boar and 1 Wild boar. We explore two approaches. The first one considers the pooled variation

while the second, the breed specific variation. The Meishan individual sets itself apart from the

rest of the European individuals with a high number of unique variants involved in various vital

biological processes. The two European pig groups (Piétrain group and Large white x Landrace

and Large white group) display breed specific variation in their olfactory receptor gene family.

Finally, a gene-based survey found that variation of high impact effect occurred in genes

relevant to domestication and breeding (e.g. KIT, IGF2, PRKAG3, and PRLR) which can be of

further interest.

Introduction

Associating phenotypic variation to genotypic variation has been an ongoing objective in pig

research from both a breeding and a biomedical perspective. One approach to establish this

association is by analyzing experimental populations. Common population structures are F2

crosses usually obtained by mating genetically divergent lineages (Geldermann et al., 1996),

but also from closely related founders (Borchers et al., 2000). This classical setting was

explored in the past using sparse genomic information (e.g. microsatellites) and linkage

mapping (Ernst and Steibel, 2013). Advancing into the genomics era, the now common usage

of the SNP chip array (Ramos et al. 2009) was a pivotal moment with major implications for

increasing resolution in quantitative trait loci (QTL) mapping experiments and the successful

implementation of genomic selection (Meuwissen et al., 2001). The resolution can be further

increased with the availability of whole genome sequence (WGS) data, information that opens

many opportunities to find causative variants influencing traits of interest.

QTL mapping was successfully employed in the past in the four F2 crosses considered in this

study and a number of QTLs and candidate genes were pinpointed for various traits (Boysen

et al., 2011; Rückert and Bennewitz, 2010; Stratz et al., 2018; Blaj et al., 2018). This forward

genetics approach usually implies coupling the F2 generation genotypes with the phenotypes

via linkage mapping or genome wide association studies. This study, in contrast, explores the

89

feasibility of employing a reverse genetics approach, scenario in which we survey the genomic

variation from WGS of the founder (F0) individuals. This variation from the diverse panel of

F0 individuals (European Wild boar, European and Asian breeds) is assessed systematically

using bioinformatics tools via a pooled and a breed based approach.

Material and methods

Founder individuals

Whole genome sequence data was available for 24 founder individuals from four F2 resource

populations (Falker-Gieske et al., 2019). The designs are described in detail by Geldermann et

al. (1996) and Borchers et al. (2000). A brief description of the crosses and founders is shown

in Table 1. The panel of animals consists of 14 Piétrain (5 males from D1 and 9 females used

in D2 and D3), 7 crossbred Large white x Landrace sows (D1), 1 Large white sow (D1), 1

Meishan boar (D2) and 1 Wild boar (D3 and D4).

Table 1. Description of the resource populations.

Cross WGS

Founders*

Sample IDs ∑ F2

D1 P x LwL/Lw 13 (13) P: 10345, 17118, 17123, 17161, 17165

LwL : 662, 690, 693, 735, 750, 756, 771

Lw: 728

1785

D2 M x P 8 (9) M: M199

P: P102, P107, P108, P113, P119, P130, P244

312

D3 W x P 6 (10) W: P181

P: P102, P108, P113, P115, P128

300

D4 W x M 1 (5) M: M199 304

*Four founders are in common among the crosses and ∑ founders in brackets; P = Piétrain, Lw

= Large white, L = Landrace, M = Meishan, W = Wild boar.

Population analysis and heterozygosity levels

With the aim to infer population structure, we performed principal component analysis (PCA)

in PLINK (Purcell et al., 2007) using the multisample vcf file containing variants for all 24

individuals. PC1 was plotted against PC2 using the R package ggplot2 (R Core Team, 2008;

Wickham, 2009). Observed (Ho) and expected heterozygosity (He) were estimated using

PLINK function –het from which the inbreeding coefficient of each individual was calculated.

90

Functional effect prediction and gene set analysis

To predict the coding effects of genetic variation (i.e. SNPs, indels) on genes, transcripts,

protein sequence and regulatory elements, we employed the SnpEff tool (Cingolani et al. 2012).

The database containing the genomic annotations for the latest reference Sscrofa 11.1

(GCA_000003025.6 provided by Swine Genome Sequencing Consortium on NCBI) was build

and utilized for effect prediction. We used two approaches: i) pooled based (using collectively

the variants from all 24 animals) and ii) breed based. For the latter, we prioritized on the private

doubletons (i.e. variants where the minor allele only occurs in a single individual and that

individual is homozygous for that allele) retained using the singleton command from vcftools

(Danecek et al., 2011).

In the pooled approach, we focused on high, moderate, and low impact variants (a detailed

explanation is given in the Results section, Table 3). The genes affected by these variants were

used for a gene set analysis using the ShinyGO Gene Ontology Enrichment Analysis tool (Ge

and Jung, 2018) according to biological processes (BP). The same tool provides graphical

visualization of the relationships among the enriched BP terms via hierarchical clustering tree

view. In general, for all the gene set analyses, we used false discovery rate (FDR) cutoff of

0.05.

The breed based approach relied on selecting exclusive doubletons from each individual. These

variants, as well as their associated genes and effects (according to SnpEff tool) were grouped

into four breed categories: Piétrain, Large white x Landrace or Large white, Meishan and Wild

boar. The datasets genes affected by high impact variants were considered for a gene set

analysis with ShinyGo as described above.

A comprehensive list incorporating genes associated with domestication, carcass composition,

reproduction, meat and fat quality traits was retrieved from the literature (Rothschild and

Ruvinsky, 2011; Groenen, 2016). We further evaluated the pooled approach SnpEff output for

the type of variation that exists within these genes, considered the severity of their predicted

variant effects and ultimately focused on high impact SNPs and indels.

Results

A total of ca. 33 Million (M) genomic variants were cumulatively available for the 24 samples

from which ca. 27M were SNPs and ca. 6M indels. The percentage breakdown per animal

according to the variant type is shown in Figure 1. Individual genomic inbreeding coefficients

derived from homozygous and heterozygous variants ranged from -0.71 for the Meishan to

91

0.27 for the Wild boar (Table SM1). We used PCA to identify the main axes of variance within

the data set (Figure 2). PC1 explained almost a quarter of the variance (22.12%) while PC2

captured another 10.01%.

Figure 1. Individual genome variant composition. nRefHom = percentage of variants that

are reference homozygous; nNonRefHom = percentage of variants that are non-reference

homozygous; nHets = percentage of heterozygous variants; nIndels = percentage of indels.

Figure 2. Principal component analysis. PC1 22.12% versus PC2 10.01% (Lw = Large white,

LwL = Large white x Landrace, M = Meishan, P = Piétrain, W = Wild boar).

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%6

62

69

0

69

3

72

8

73

5

75

0

75

6

77

1

P1

02

P1

07

P1

08

P1

13

P1

15

P1

19

P1

28

P1

30

P2

44

10

34

5

17

11

8

17

12

3

17

16

1

17

16

5

M1

99

P1

81

nRefHom nNonRefHom nHets nIndels

92

Variant analysis

We evaluated SNPs and indels from all samples in the pooled approach. From this large scale

analysis, according to variant effect, the majority were located in intronic (56.5%) and

intergenic regions (31.7%). A detailed summary containing all effect types, count and

percentage is presented in Table 2. A further classification was made based on the severity of

the variants’ impact at the transcript and protein level. While more then 99% had a modifier

impact, 0.03%, 0.22% and 0.48% had a high, moderate or low impact, respectivly (Table 3).

Interpretation of large gene lists was accomplished through enrichment analysis. In the pooled

strategy, the genes containing high, moderate and low impact variants were the input for

ShinyGO. This list contained 75% of the genes annotated on the current reference genome

Sscrofa 11.1 (total number of genes is 25,880, Ensembl Genes (Zerbino et al., 2018)). Due to

the large dataset used for gene enrichment analysis, we prioritize on the Top 30 GO terms

(more than 500 GO terms enriched) related to the biological processes (Figure 3 and Table

SM2). The most significant GO terms were Localization (GO: 0051179), Cellular component

organization or biogenesis (GO: 0071840), Cellular component organization (GO: 0016043),

Positive regulation of biological process (GO: 0048518) and Developmental process (GO:

0032502).

The breed specific analysis yielded four data sets based on the exclusive doubletons found in

each animal. The genes in which the high, moderate or low impact doubleton resided were used

as gene sets for enrichment analysis. For Meishan, 2,283 genes were identified which were

enriched at 240 GO terms. From the Top 30 terms (Table SM3), a hierarchical tree was obtained

(Figure 4). For the Piétrain group, 16 GO terms were significant from an input of 330 genes

(Figure 5 and Table SM4). For Large white x Landrace or Large white group with a gene list

of 272, 12 GO terms were identified (Table SM5, tree not shown due to high similarity to the

Piétrain group). For the latter two groups, many of the genes were involved in biological

processes related to Response to chemical (GO: 0042221) and Sensory perception (GO:

0007600). No enriched terms were found for the Wild boar gene list.

The list of preselected genes explored yielded eleven genes on which high impact variants have

a disruptive impact in the protein formation (Table 4). The genes were related to the following

traits: two color coat genes (KIT and MC1R), two growth related genes (IGF2 and PRKAG3),

one early domestication genes (SMOC2), LDHA (for pH and meat color), one reproduction

associated gene (PRLP) and one gene related to the vertebrae number (VRTN). The type of

93

effects were frameshift variants, splice acceptor variant and splice donor variant mostly

resulting from an indel type genomic variation.

Table 2. Number of effects by type from SnpEff (for pooled data).

Type Count Percent

3_prime_UTR_variant 580,514 0.888

5_prime_UTR_premature_start_codon_gain_variant 13,918 0.021

5_prime_UTR_truncation 1 0

5_prime_UTR_variant 108,000 0.165

bidirectional_gene_fusion 102 0

conservative_inframe_deletion 926 0.001

conservative_inframe_insertion 1,183 0.002

disruptive_inframe_deletion 1,648 0.003

disruptive_inframe_insertion 1,285 0.002

downstream_gene_variant 3,281,085 5.02

exon_loss_variant 9 0

frameshift_variant 12,132 0.019

gene_fusion 30 0

initiator_codon_variant 25 0

intergenic_region 20,747,527 31.745

intragenic_variant 3 0

intron_variant 36,905,738 56.469

missense_variant 140,085 0.214

non_canonical_start_codon 1 0

non_coding_transcript_exon_variant 18,757 0.029

non_coding_transcript_variant 317 0

splice_acceptor_variant 1,766 0.003

splice_donor_variant 2,433 0.004

splice_region_variant 64,794 0.099

start_lost 370 0.001

stop_gained 1,622 0.002

stop_lost 309 0

stop_retained_variant 130 0

synonymous_variant 247,209 0.378

transcript_ablation 2 0

upstream_gene_variant 3,224,049 4.933

94

Table 3. Number of effects by impact from SnpEff (for pooled data).

Type and definitions* Count Percent

HIGH = The variant is assumed to have high (disruptive)

impact in the protein, probably causing protein truncation, loss

of function or triggering nonsense mediated decay.

17,746 0.027

MODERATE = A non-disruptive variant that might change

protein effectiveness. 145,006 0.222

LOW = Assumed to be mostly harmless or unlikely to change

protein behavior. 315,545 0.483

MODIFIER = Usually non-coding variants or variants

affecting non-coding genes, where predictions are difficult or

there is no evidence of impact.

64,810,011 99.267

*According to SnpEff tool documentation (Cingolani et al. 2012)

Figure 3. Pooled analysis: Hierarchical clustering of Top 30 GO Biological Process terms

from the enrichment analysis (genes containing high, moderate and low impact variants).

95

Figure 4. Meishan breed specific analysis: Hierarchical clustering of Top 30 GO Biological

Process terms from the enrichment analysis (genes containing high, moderate and low impact

doubleton variants).

Figure 5. Piétrain breed specific analysis: Hierarchical clustering of GO Biological Process

terms from the enrichment analysis (genes containing high, moderate and low impact doubleton

variants).

96

Table 4. Domestication or breeding related genes retrieved from the literature with high

impact effects variants in the founders.

Gene Transcript ID

High

impact

variants

Type of effects Type of

Variants Trait*

KIT

ENSSSCT00000009679

ENSSSCT00000062378

4 frameshift variant,

splice acceptor

variant, splice

donor variant

1 SNP

and 2

indels

Coat color

MC1R ENSSSCT00000022534 1 frameshift variant 1 indel Coat color

IGF2

ENSSSCT00000039341

ENSSSCT00000044712

ENSSSCT00000049151

21 frameshift variant,

splice donor

variant

1 SNP

and 4

indel

Growth and

fat deposition

PRKA

G3

ENSSSCT00000017641

ENSSSCT00000036402

3 frameshift variant,

splice acceptor

variant

2 SNPs

and 3

indels

Lean growth

SMOC

2

ENSSSCT00000004437 1 frameshift variant 1 indels Initial stages

of

domestication

LDHA ENSSSCT00000046190 1 frameshift variant 1 indels pH and meat

colour

PRLR

ENSSSCT00000018325

ENSSSCT00000036206

4 splice acceptor

variant, splice

donor variant

2 indels Reproduction

VRTN ENSSSCT00000002625 1 frameshift variant 1 indels Vertebrae

number

*According to literature review (Groenen, 2016; Rothschild and Ruvinsky, 2011)

Discussion

In this paper, we evaluated the genomic variants from WGS data generated for the founder

individuals of four F2 crosses. The panel of 24 animals consists of diverse pig breeds (European

and Asian) as well as the European ancestor (the European Wild boar).

The amount of genomic variation encountered depended on the origin of the individuals.

Having representatives from both European and Asian lineages, which have been

geographically separated for more than one million years (Frantz et al., 2013), leads to specific

individual levels of variation. The Meishan individual is the most genetically distinct (Figure

2) and has the highest number of variants (Figure 1). On the one hand, this is a result of the

breed evolving from the Asian Wild boar that is known to be fixed for the alternative allele at

97

over one million locations, as compared the European counterpart (Groenen et al., 2012). On

the other hand, this is also a consequence of using the reference genome assembly Sscrofa 11.1,

which is from a Duroc individual, thus a European breed. Interestingly, we observe slightly

higher amount of variation in Piétrain, Large white and the crossbred sows (expected due to

the increased heterozygosity levels in these seven crossbred individuals) as compared to the

Wild boar. Two types of events, occurring in the last centuries, explain the observation of

increased genetic diversity in European breeds than in their ancestor: i) domestic pigs

hybridized with local wild populations due to farming practices (Zeder et al., 2006) and ii)

human driven introgression of Asian domestic pigs into the European stocks (Bosse et al.,

2014).

Given the high genomic variation usually encountered in WGS data, the importance of

prioritizing on relevant variants can be at times challenging. In the pooled based analysis, more

than 60 million effects were predicted out of which less than 1% had a disruptive effect at the

protein level, nevertheless the count of these effects was still high, with 17,746 effects residing

in 5026 genes. At the top of the list with more than 20 high impact effect variants, were

ENSSSCG00000001229 (patr class I histocompatibility antigen, A-126 alpha chain-like, a

member of the major histocompatibility complex MHC), ENSSSCG000000031998 (olfactory

guanylyl cyclase GC-D-like) and ENSSSCG00000038461 (taste receptor type 2 member 20-

like). Genes related to immunity, from the MHC complex, and genes related to sensory

perception are families known to be actively evolving and expanding in pigs (Groenen et al.,

2012), thus harboring a high amount of variation. Gene set analysis (which translates gene lists

into enriched functions) used in the pooled strategy proved rather unspecific in its output

because 75% of the entire Sus Scrofa genes were included. Nevertheless, the most significantly

enriched GO terms, represented by more than 4,000 genes, were related to fundamental or basic

biological processes such as localization (GO: 0051179) and development (GO: 0032502).

We observed an increase in the level of specificity with the breed based approach, where the

focus is based on groups of animals representing the European commercial breeds, the

European Wild boar, or the Meishan breed. The genes containing the exclusive variation found

in the Meishan individual clustered in GO terms related to cell adhesion, localization,

developmental processes, anatomical structure morphogenesis, nervous system development,

animal organ development, and others (Figure 4 and Table SM3). This supports further the idea

that the Asian pigs have many fixed variants that are part of vital biological functions.

The private variation for the Piétrain and Large white x Landrace or Large white groups were

mostly related to olfactory receptor genes (Figure 4, Table SM4 and SM5). The swine olfactory

98

subgenome is the largest gene superfamily and it includes 1,113 functional olfactory receptor

genes and 188 pseudogenes based on Sscrofa 10.2 (Dinh Truong Nguyen et al., 2012). We

compare the exclusive variation, its associated genes, and the enriched biological processes in

the two European Breed groups and, while a hand full were common, the majority of the

olfactory receptor genes were breed specific.

Finally, we explored the variation from specific genes previously associated with

domestication, production and reproduction traits (Rothschild and Ruvinsky, 2011; Groenen,

2016). Four mutations had high impact effect on KIT, gene related to coat color and known to

display an extensive genetic heterogeneity (Fontanesi et al., 2010). For IGF2 (Joen et al., 1999;

Nezer et al., 1999), we detected 21 high impact variants, which could be of further interests as

different phenotypes have been associated with this gene in these crosses (Boysen et al., 2011;

Blaj et al., 2018). In addition, to be noticed is that many of the variants with high impact listed

in Table 4 are actually indels, which are commonly disregarded in genome wide association

studies, even though they have the ability to cause a disruption at the protein level.

Conclusion

This reverse genetic approach used in this study relied on an exploratory analysis conducted

using sequence data for a panel of founder individuals of F2 populations. The effect prediction

tools and gene enrichment analysis are powerful instruments in the genomics area. The

complexity of deciphering the functional implications of genomic variations can be dissected

with such tools and can assist in providing further directions of research.

99

References

Blaj, I., Tetens, J., Preuß, S., Bennewitz, J., & Thaller, G. (2018). Genome-wide association

studies and meta-analysis uncovers new candidate genes for growth and carcass traits

in pigs. PLoS ONE, 13(10), e0205576.

Borchers, N., Reinsch, N., & Kalm, E. (2000). Familial cases of coat colour‐change in a Piétrain

cross. Journal of Animal Breeding and Genetics, 117(4).

Bosse, M., Megens, H. J., Madsen, O., Frantz, L. A., Paudel, Y., et al. (2014). Untangling the

hybrid nature of modern pig genomes: a mosaic derived from biogeographically distinct

and highly divergent Sus scrofa populations. Molecular Ecology, 23(16), 4089–4102.

Boysen, T. J., Tetens, J., & Thaller, G. (2011). Evidence for additional functional genetic

variation within the porcine IGF2 gene affecting body composition traits in an

experimental Piétrain × Large White/Landrace cross. Animal: An International Journal

of Animal Bioscience, 5(5), 672–677.

Cingolani, P., Platts, A., Le Wang, L., Coon, M., Nguyen, T., et al. (2012). A program for

annotating and predicting the effects of single nucleotide polymorphisms, SnpEff:

SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2),

80–92.

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., et al. (2011). The variant call

format and VCFtools. Bioinformatics, 27(15), 2156–2158.

Dinh T. N., Kyooyeol L., Hojun C., Min-kyeung C., Minh T. L., et al. (2012). The complete

swine olfactory subgenome: expansion of the olfactory gene repertoire in the pig

genome. BMC Genomics, 13, 584.

Ernst, C. W., & Steibel, J. P. (2013). Molecular advances in QTL discovery and application in

pig breeding. Trends in Genetics, 29(4), 215-224.

Fontanesi, L., D'Alessandro, E., Scotti, E., Liotta, L., Crovetti, A., et al. (2010). Genetic

heterogeneity and selection signature at the KIT gene in pigs showing different coat

colours and patterns. Animal Genetics, 41(5), 478–492.

Frantz, L. A. F., Schraiber, J. G., Madsen, O., Megens, H.-J., Bosse, M., et al. (2013). Genome

sequencing reveals fine scale diversification and reticulation history during speciation

in Sus. Genome Biology, 14, R107.

Ge, S., & Jung, D. (2018). ShinyGO: a graphical enrichment tool for animals and plants.

bioRxiv, doi: 10.1101/315150.

100

Geldermann, H., Müller, E., Beeckmann, P., Knorr, C., Yue, G., et al. (1996). Mapping of

quantitative‐trait loci by means of marker genes in F2 generations of Wild boar, Pietrain

and Meishan pigs. Journal of Animal Breeding and Genetics, 113(1‐6), 381-387.

Falker-Gieske C., Blaj I., Preuß S., Bennewitz J., Thaller G., et al. (2019). GWAS for meat and

carcass traits using imputed sequence level genotypes in pooled F2-designs in pigs. G3:

Genes, Genomes, Genetics, 9(9), 2823-2834.

Groenen, M. A. M. (2016). A decade of pig genome sequencing: a window on pig

domestication and evolution. Genetics Selection Evolution, 48, 23.

Groenen, M. A. M., Archibald, A. L., Uenishi, H., Tuggle, C. K., Takeuchi, Y., et al. (2012).

Analyses of pig genomes provide insight into porcine demography and evolution.

Nature, 491(7424), 393–398.

Jeon, J.-T., Carlborg, O., Törnsten, A., Giuffra, E., Amarger, V., et al. (1999). A paternally

expressed QTL affecting skeletal and cardiac muscle mass in pigs maps to the IGF2

locus. Nature Genetics, 21(2), 157–158.

Meuwissen, T. H., Hayes, B. J., & Goddard, M. E. (2001). Prediction of Total Genetic Value

Using Genome-Wide Dense Marker Maps. Genetics, 157(4), 1819–1829.

Nezer, C., Moreau, L., Brouwers, B., Coppieters, W., Detilleux, J., et al. (1999). An imprinted

QTL with major effect on muscle mass and fat deposition maps to the IGF2 locus in

pigs. Nature Genetics, 21, 155–156.

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., et al. (2007). PLINK:

a tool set for whole-genome association and population-based linkage analyses.

American Journal of Human Genetics, 81(3), 559–575.

Ramos, A. M., Crooijmans, R. P. M. A., Affara, N. A., Amaral, A. J., Archibald, A. L., et al.

(2009). Design of a high density SNP genotyping assay in the pig using SNPs identified

and characterized by next generation sequencing technology. PLoS ONE, 4(8), e6524.

Rothschild, M. F., & Ruvinsky, A. (2011). The Genetics of the Pig. 2nd Edition. Oxfordshire,

UK, Cambridge: USA: CAB International.

Rückert, C., & Bennewitz, J. (2010). Joint QTL analysis of three connected F2-crosses in pigs.

Genetics, Selection, Evolution, 42, 40.

Stratz, P., Schmid, M., Wellmann, R., Preuß, S., Blaj, I., et al. (2018). Linkage disequilibrium

pattern and genome-wide association mapping for meat traits in multiple porcine F2

crosses. Animal Genetics, 49(5), 403–412.

R Core Team (2018). R: A language and environment for statistical computing. R Foundation

for Statistical Computing, Vienna, Austria.

101

Wickham, H. (2009). Ggplot2. Elegant graphics for data analysis.

Zeder, M. A., Emshwiller, E., Smith, B. D., & Bradley, D. G. (2006). Documenting

domestication: the intersection of genetics and archaeology. Trends in Genetics, 22(3),

139-55.

Zerbino, D. R., Achuthan, P., Akanni, W., Amode, M. R., Barrell, D., et al. (2018). Ensembl

2018. Nucleic Acids Research, 46(D1), D754-D761.

102

103

General discussion

The present study investigates the potential of four existing F2 pig resource populations and it

aims to demonstrate how the availability of tens of thousands of single nucleotide

polymorphisms (SNPs) (i.e. SNP arrays) and whole genome sequence data (WGS) can raise

and answer questions in the context of both population and quantitative genomics. A substantial

proportion of the phenotypic variation in pig populations, namely the health, production and

reproduction related traits, holds a genetic basis that is complex or quantitative. To decipher the

convoluted nature of the quantitative traits, the processes leading to their genetic architecture

have to be addressed. The forces that shape the genomes are various genetic mechanisms and

fundamental population processes (e.g. mutation, selection, gene drift, gene flow, and

recombination) and, in the case of livestock, the humans’ interference upon these forces as a

result of the domestication process and selective breeding.

In the following, the main findings of the thesis are discussed along with additional

investigations and results that were not included in the chapters but have direct implications.

Four main topics are covered: the release of the latest pig reference genome, recombination

rates and gene density, pooling data from several F2 crosses for quantitative trait loci (QTL)

mapping purposes and the shift from SNP array to whole genome sequence level data.

Reference genome Sscrofa 11.1

The quality of the reference genome assemblies is a critical aspect for a successful analysis of

genomic data. At the beginning of the data analyses included in this thesis only the Sscrofa 10.2

reference genome was available and used but genetic marker positions were updated with the

release of the new reference genome Sscrofa 11.1 in 2017 by Swine Genome Sequencing

Consortium (Schook et al., 2005). An evaluation of the potential biases and implications for

downstream analysis followed. Figure 1 shows a comparison between the old and the new

assembly SNPs numbers for the Sus scrofa chromosomes (SSC) (i.e. 1 to 18 autosomal, 19

unplaced SNPs and 20 X SNPs). Attention is drawn to SSC19 which in the 10.2 version

comprises more than 4,000 variants, while in the 11.1 version the number decreased

substantially as the SNPs were assigned to the autosomal chromosomes. These findings are

evidence of a denser and more accurate genome wide SNP coverage as a result of using a higher

quality reference genome.

To demonstrate the benefits of minimizing the downstream analyses bias, the impact on the

genome wide association studies (GWAS) was evaluated with emphasis on the genomic region

harboring the IGF2 gene on SSC2. The gene has been reported to have an effect on muscle

104

Figure 1. Number of SNP markers per chromosome: Sscrofa 10.2 in blue versus Sscrofa

11.1 in red. Chromosomes are 1 to 18 autosomal, 19 unplaced and 20 X SNPs.

mass and fat deposition (Nezer et al., 1999). The association studies considering additive,

imprinting and paternal effects (in Chapter 1, Chapter 2, and Chapter 3) confirmed this gene as

a strong candidate gene. The IGF2 was not assembled in the Sscrofa 10.2 genome version, but

now has been positioned on the Sscrofa 11.1 reference genome (comparison of GWAS results

shown exemplarily for one trait in Figure 2). While the statistical significance of the SNPs that

are in linkage disequilibrium (LD) with the causative mutation was not influenced, the

ambiguous genomic context of these SNPs hindered the interpretation of the results. If further

exploratory analysis would aim to define clusters incorporating strong evidence for trait-

associated chromosomal regions (following the methodology in Chapter 1 and 3), some clusters

would be erroneous (e.g. Figure 2, end of SSC2 upper Manhattan plot for 10.2). On the other

hand, having the statistically significant variants in the correct genomic context ensures a higher

precision when defining regions of interest (Figure 2, lower Manhattan plot for 11.1).

However, a minor inconvenience was observed when shifting to an updated reference genome.

The release of a new assembly triggers many changes in the available public online resources

as Pig QTL database (Hu et al., 2018), Ensembl variation resources (Hunt et al., 2018), and

DAVID (Dennis et al., 2003). Until they are updated accordingly, results interpretation and data

mining tends be cumbersome when using various bioinformatics tools.

105

Figure 2. Genome wide association study for meat to fat ratio trait in the European breeds

F2 cross. Sscrofa 10.2 up versus Sscrofa 11.1 down.

Recombination and gene density

Recombination is a major drive in shaping the genetic diversity of populations creating new

allelic combinations each generation. The four F2 crosses constitute highly informative resource

populations for estimating recombination rates, because crossover events can be accurately

inferred in the F1 individuals. Findings related to the recombination landscape are presented in

Chapter 4. Yet again, the implications of utilizing the latest reference genome for deriving the

marker positions is highlighted. The usage of 11.1 led to higher recombination rates as

compared to 10.2 because of a more accurate downstream analysis and overall improved quality

of the assembly (details in the paper).

The nature and spatial pattern of recombination depend on certain chromosomal and sequence

features (e.g. GC content, Tortereau et al., 2012). A genomic feature closely related to GC

content was considered, namely gene density. In several species, including mice (Paigen et al.,

2008) and humans (Freudenberg et al., 2009), a positive correlation between gene density and

recombination rates has been reported. Interestingly, a statistically significant weak positive

correlation (p < 0.05) was only observed for the crosses that included Meishan as a founder

breed (Table 1). The Asian breed and the rest of the European breeds diverged around 1 million

years ago (Groenen et al., 2012) which could explain, in general terms, the different distribution

of recombination rates across the genome. In Chapter 4, other breed specific dependent

106

characteristics are described, emphasizing on patterns encountered in the individuals stemming

from the Wild boar founder.

Table 1. Genome wide correlation between gene density and recombination rates in 1 Mb

bins (P = Piétrain, Lw = Large white, L = Landrace, M = Meishan and W = Wild boar).

Statistically significant correlations in bold.

Correlation coefficient p value

P x (LwxL)/Lw 0.25 (0.13) 0.057

M x P 0.31 (0.13) 0.018

W x P 0.15 (0.14) 0.287

W x M 0.32 (0.13) 0.014

For a coarse inspection of the landscape at the chromosome level, estimates of the

recombination rates for all four crosses together with the gene density are shown in Figure 3.

In general, higher rates tended to be located in distal regions of the chromosomes, which has

been described previously in various other mammal species (Jensen-Seaman et al., 2004), but

also in plants (Jordan et al., 2018). The evolutionary significance suggests highly conserved

mechanisms involving, for example, chromosomal features (e.g. centromere location) and their

influence during the meiosis process. The gene density followed a similar distribution to the

recombination estimates for some of the chromosomes (e.g. SSC10, SSC18), however a

potential mechanism is more difficult to devise in this case. An investigation at a higher

resolution would be required to understand what are the major drivers influencing correlation

between recombination rates and various chromosomal or sequence features.

In the recent past, there has been a grown interest in investigating whether the distribution of

the recombination events along the chromosomes is genetically controlled. An association

study in cattle (Sandor et al., 2012), for instance, provided evidence for genetic markers

influencing genome-wide recombination rates and hotspot usage (REC8, RNF212 and

PRDM9). In the same line of ideas, a preliminary analysis on the four F2 crosses was conducted.

Two limiting factors were identified: i) the small sample size (less than 200 F1 individuals) and

ii) the difficulty to define the phenotype (lack of standard definition). Nevertheless, as two of

the chapters (i.e. 1 and 3) in the thesis prove, the sample size can be increased by pooling data

available from other informative populations on which recombination rates can be accurately

estimated.

107

Figure 3. Kernel density estimates of the recombination rates for all four crosses

compared to gene density. Autosomal chromosomes from 1 to 18.

Pooling data in F2 crosses

Genome-wide association studies are a powerful tool to identify phenotype-associated variants.

The outcome of a GWAS is determined by several parameters such as marker density, mapping

population, sample size as well as the genetic architecture of the quantitative trait. With respect

to increasing the sample size, Chapter 1 and 3 address and use two suitable approaches to

collectively investigate data: a joint analysis, in which the phenotypic and the genotypic of

multiple datasets is combined directly, and a meta-analysis. The latter method does not require

access to the original datasets as it relies on combining information from the summary statistics

of a GWAS, specifically the effects and the p values (Evangelou and Ioannidis, 2013). The

meta-analysis can increase the detection power and reduce false-positive findings while it

allows to efficiently account for population substructure and for study specific covariates

(Willer and Abecasis, 2010).

108

The joint analysis was proven to be efficient for increasing the mapping power while pooling

data for several F2 designs (Ruckert and Bennewitz, 2010; Bennewitz and Wellmann, 2014;

Stratz et al., 2018). To complement the work of the previous studies, Chapter 1 compares the

results from a joint and a meta-analysis for growth and carcass traits and concludes that the

outputs are similar and that latter approach can be a valuable tool whenever access to raw

datasets is limited. For the F2 crosses here, the joint analysis is a more straightforward approach

(thus used in Chapter 3) because the data can be combined in one dataset and analyzed at once.

In contrast, in the meta-analysis approach several GWAS (depending on the number of

populations) need to be conducted. An additional statistical step to derive the final output is

needed, thus this approach could be more time-consuming. A GWAS for cattle stature

(Bouwman et al., 2018) considered 17 populations and the meta-analysis as a method of choice,

suggesting that the benefit of employing the MA is apparent when combining data from a higher

number of populations. Moreover, in the context having populations with sequence level data

available (also the case of the cattle populations mentioned above), the meta-analysis renders

more feasibility.

From SNP array to whole genome sequence data

The thesis transitions throughout its chapters from using SNP array to sequence data based

analyses as it demonstrates the advantages of increasing the genome wide marker density. In

livestock species, the usefulness of the SNP array is undeniable as proved by the constant

success of genomic selection, association and genetic diversity studies.

The design of the chip (Ramos et al., 2009) aims to be suitable and informative for various pig

breeds. The preselected variants are based on a validation population mostly composed of

European and US commercial breeds whilst the Asian pigs are underrepresented (Meishan

individuals account for ca. 5% of 554 animals). This ascertainment bias lead to a smaller

number of polymorphic SNPs in the two F2 populations derived from Meishan and an increased

number of variants in the high minor allele frequency spectrum (Figure 4, left hand side). In

contrast, sequence data does not imply any preselection and has a minor allele frequency

(Figure 5, right hand side) which is favorable for lower values. The relevance of these variants

can be of interest for GWAS purposes, as a significant proportion of the phenotypic variance

of a quantitative trait may depend on rare variants (Visscher et al., 2017). Additionally, given

the high density of variants in WGS data, the genetic variation is thus captured in a more

complete manner than with common genotyping arrays.

109

Figure 4. Minor allele frequency (MAF) distribution. Upper row is Piétrain x (Large white

x Landrace)/Large white, lower row is Meishan x Piétrain, left column is MAF from SNP array

and right column is MAF from WGS.

GWAS are usually the first step, based on statistical models, in elucidating the underlying

responsible molecular mechanism for the phenotypes of interest. The post GWAS analysis

relies rather on bioinformatics tools to prioritize variants for further functional validation

(hypothesis-based experiment) (Lappalainen, 2015). Following these lines, the association

analysis in Chapter 1 (based on SNP array) was successful in pinpointing genomic regions

associated to growth and carcass and, as a result of an exploratory analysis, candidate genes

were nominated. Chapter 3, based on imputed sequence data, adds to the resolution of the study.

Using additional tools such as the Variant Effect Predictor (McLaren et al., 2016) and various

databases, several causative mutations are incriminated to have an effect on the traits of interest

(the average daily gain, backfat thickness, meat to fat ratio and carcass length). These findings

pave the way to design experiments (e.g. by means of eQTL studies; Nica and Demitzakis,

2013) to prove the causality of the variants and will be considered in follow-up studies.

The final chapter is a further proof of the power that the sequence data holds. One noteworthy

aspect is that, in the context of WGS, the Meishan is no longer underrepresented with respect

to the number of variants, mainly due to the reference genome, i.e. TJ Tabasco, which belongs

to the Duroc European breed (Groenen et al., 2012). Based on exclusive variants, part of the

110

olfactory receptor gene family was found to be breed specific when comparing the two

European breeds groups (Piétrain group and Large white x Landrace and Large white group).

The survey provides valuable insight into how the Asian and European lineages diverged and

how breed formation shaped the pigs’ genome.

Concluding remarks

F2 crosses provide a powerful tool for genetic research. The multitude of analysis designed to

exploit such populations demonstrate the remarkable potential for investigating additive,

dominance and imprinting effects and for proposing putative causative mutations for further

functional validation. They also allow to estimate recombination rates across the genome and

to determine respective patterns at population, breed and individual level. The directions for

future applications of the four F2 crosses are manifold. The populations are well characterized

for additional phenotypes. Therefore, an imputed sequence data GWAS addressing these traits

is further planned. Moreover, the recombination rates and the drivers shaping the genome wide

recombination landscape will require an in-depth investigation for a better understanding of the

underlying mechanisms.

This thesis emphasizes on the strengths of genomic data with substantial implications for

research and practical breeding. Genotyping of the animals at birth and technology-aided

phenotypic collection will likely become routine at the farm level. Data-driven precision

breeding will come into the scene ensuring sustainable farming for the future. However, this

will only be possible if there are methods to analyze big data (e.g. machine learning) as well as

the ability to interpret high scale data sets that are generated by various emerging technologies.

111

References

Bennewitz, J., & Wellmann, R. (2014). Mapping Resolution in Single and Multiple F2

Populations using Genome Sequence Marker Panels. Proceedings of the 10th World

Congress on Genetics Applied to Livestock Production, 17-22.

Dennis, G., Sherman, B. T., Hosack, D. A., Yang, J., Gao, W., et al. (2003). DAVID: Database

for annotation, visualization, and integrated discovery. Genome Biology, 4(9), R60.

Evangelou, E., & Ioannidis, J. P. A. (2013). Meta-analysis methods for genome-wide

association studies and beyond. Nature Review Genetics, 14, 379.

Freudenberg, J., Wang, M., Yang, Y., & Li, W. (2009). Partial correlation analysis indicates

causal relationships between GC-content, exon density and recombination rate in the

human genome. BMC Bioinformatics, 10(1), S66.

Groenen, M. A., Archibald, A. L., Uenishi, H., Tuggle, C. K., Takeuchi, Y., et al. (2012).

Analyses of pig genomes provide insight into porcine demography and evolution.

Nature, 491(7424), 393.

Hu, Z. L., Park, C. A., & Reecy, J. M. (2018). Building a livestock genetic and genomic

information knowledgebase through integrative developments of Animal QTLdb and

CorrDB. Nucleic Acids Research, 47(D1), D701-D710.

Jensen-Seaman, M. I., Furey, T. S., Payseur, B. A., Lu, Y., Roskin, K. M., et al. (2004).

Comparative recombination rates in the rat, mouse, and human genomes. Genome

Research, 14(4), 528-538.

Jordan, K. W., Wang, S., He, F., Chao, S., Lun, Y., et al. (2018). The genetic architecture of

genome‐wide recombination rate variation in allopolyploid wheat revealed by nested

association mapping. The Plant Journal, 95(6), 1039-1054.

Nezer, C., Moreau, L., Brouwers, B., Coppieters, W., Detilleux, J., et al. (1999). An imprinted

QTL with major effect on muscle mass and fat deposition maps to the IGF2 locus in

pigs. Nature Genetics, 21, 155.

Lappalainen, T. (2015). Functional genomics bridges the gap between quantitative genetics and

molecular biology. Genome Research, 25(10), 1427-1431.

McLaren, W., Gil, L., Hunt, S. E., Riat, H. S., Ritchie, et al. (2016). The ensembl variant effect

predictor. Genome biology, 17(1), 122.

Nica, A. C., & Dermitzakis, E. T. (2013). Expression quantitative trait loci: present and future.

Philosophical Transactions of the Royal Society B, 368(1620), 20120362.

Paigen, K., Szatkiewicz, J. P., Sawyer, K., Leahy, N., Parvanov, E. D., et al. (2008). The

recombinational anatomy of a mouse chromosome. PLoS Genetics, 4(7), e1000119.

112

Ramos, A. M., Crooijmans, R. P., Affara, N. A., Amaral, A. J., Archibald, A. L., et al. (2009).

Design of a high density SNP genotyping assay in the pig using SNPs identified and

characterized by next generation sequencing technology. PLoS ONE, 4(8), e6524.

Rückert, C., & Bennewitz, J. (2010). Joint QTL analysis of three connected F2-crosses in pigs.

Genetics Selection Evolution, 42(1), 40.

Hunt, S. E., McLaren, W., Gil, L., Thormann, A., Schuilenburg, H., et al. (2018). Ensembl

variation resources Database, Volume 2018, doi:10.1093/database/bay119.

Sandor, C., Li, W., Coppieters, W., Druet, T., Charlier, C., et al. (2012). Genetic variants in

REC8, RNF212, and PRDM9 influence male recombination in cattle. PLoS Genetics,

8(7), e1002854.

Schook, L. B., Beever, J. E., Rogers, J., Humphray, S., Archibald, A., et al. (2005). Swine

genome sequencing consortium (SGSC): A strategic roadmap for sequencing the pig

genome. Comparative and Functional Genomics, 6, 251-255.

Stratz, P., Schmid, M., Wellmann, R., Preuß, S., Blaj, I., et al. (2018). Linkage disequilibrium

pattern and genome‐wide association mapping for meat traits in multiple porcine F2

crosses. Animal Genetics, 49(5), 403-412.

Tortereau, F., Servin, B., Frantz, L., Megens, H. J., Milan, D., et al. (2012). A high density

recombination map of the pig reveals a correlation between sex specific recombination

and GC content. BMC Genomics, 13(1), 586.

Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., et al. (2017). 10 years of

GWAS discovery: biology, function, and translation. The American Journal of Human

Genetics, 101(1), 5-22.

Willer, C. J., Li, Y., & Abecasis, G. R. (2010). METAL: fast and efficient meta-analysis of

genomewide association scans. Bioinformatics, 26(17), 2190-2191.

113

General summary

Advances in pig genomics, for both research and practical breeding, rely now greatly on the

availability of genome-wide dense single nucleotide polymorphism (SNP) panels and next

generation sequencing. Mapping populations, such as F2 crosses, together with technological

progress were building blocks for the development of the genomic selection concept. This type

of selection is fundamentally an extension at a genome-wide level of the marker-assisted

selection which was facilitated by various quantitative trait loci (QTL) mapping experiments.

The present thesis investigates the potential of four existing F2 pig resource populations and it

aims to demonstrate how the availability of tens of thousands of SNP markers (i.e. SNP arrays)

and whole genome sequence data (WGS) can raise and answer questions in the context of both

population and quantitative genomics. Every generation level (F0, F1, and F2) conveys different

types of information that is accessible by using phenotypes coupled with genotypes or

phenotypes coupled with sequence data but also by considering genotypic or sequence data

alone.

The main purpose of an F2 cross is to map QTLs and to locate putative causative mutations

associated with phenotypes of interests, specifically for this thesis, related to growth, carcass

and fat deposition. Briefly, the F2 resource populations here included comprise one cross

stemming from European type breeds (Piétrain, Landrace, and Large white) while the other

three crosses originate from distantly related founder breeds (Piétrain, the Asian breed Meishan,

and the European ancestor, the Wild boar). The method of choice to relate phenotypic to

genotypic information is the genome-wide association study (GWAS).

Chapter 1 explores aspects related to how a collective investigation of data from several F2

resource populations is advantageous. This study performs, based on SNP array data, an

individual population GWAS, a joint population GWAS and a meta-analysis in three of the pig

F2 designs (with Piétrain as a common founder) for growth and carcass traits. The benefit of

pooling the data is an increased mapping resolution that narrowed the genomic regions

harboring causative variants. Many genes previously associated with the traits were confirmed

and further new candidate genes were suggested (e.g. BMP2 bone morphogenetic protein 2 for

carcass length). An extension of this work goes beyond the additive genetic effects and looks

into dominance and imprinting effects in all four F2 crosses (taken separately) by means of

variance component estimation and various GWAS models (Chapter 2). The contribution of

the imprinting effects to the total phenotypic variance ranges from zero to 19% while the

dominance effects account for up to 34%. Significant associations from the imprinting and

paternal GWAS exist in the IGF2 (insulin-like growth factor 2) region for the traits related to

114

growth and fat deposition. To further dissect the genetic architecture of quantitative traits at an

even higher resolution, Chapter 3 presents a GWAS in the four pooled F2 resource populations

that have been imputed to WGS data based on high coverage founder and low coverage F1

sequencing. Besides providing directions of further research by uncovering information on

putative causative mutations in candidate genes as well as pathways, this research demonstrates

a convenient approach to efficiently exploit well-characterized experimental designs

established in the past.

The last two chapters of the thesis consider levels of information often overlooked in an F2

population, specifically at the F1 and the founder (F0) generation level. Chapter 4 covers

investigations on recombination events, a process involved in maintaining genetic variability

and the evolution of genomes. The basis of the constructed recombination maps is the F1

generation because crossover calls can be inferred in each parent-child pair (F1-F2), due to

population setting in which many full sibs stem from one F1 parent. The level of recombination

events varies within crosses, among breeds and individuals as well as across chromosomes or

regions within chromosomes. Although substantial heterogeneity is observed in the designs,

certain patterns specific to crosses, sex or chromosomes do exist and are influenced by the

extent of conservation of the local rate of recombination over time, by the levels of diversity,

efficiency or direction of selection, and the genome composition. Finally, in Chapter 5 a

reverse genetics approach explores information about the F2 resource populations by surveying

the sequence data variation within the founder generation. The rationale behind is that almost

all the variation which exists in F2 (used for GWAS purposes) is being propagated from the

founder generation. This exploratory analysis indicates how large-scale genomic data can offer

insights into the founder population structure and breed specific variation and how appropriate

bioinformatics tools and databases can lead to a knowledge driven variant selection for further

functional validation.

115

Allgemeine Zusammenfassung

Die Fortschritte in der Genomik von Schweinen hängen, sowohl für die Forschung als auch für

die praktische Zucht, von der Verfügbarkeit genomweiter SNP-Panels und der Next-

Generation-Sequenzierung ab. Die Kartierung von Populationen, wie z.B. von F2-Kreuzungen,

und der technologische Fortschritt waren die Bausteine für die Entwicklung der genomischen

Selektion. Diese Art der Selektion ist eine Erweiterung der markergestützten Selektion auf

genomweiter Ebene, die durch verschiedene QTL-Kartierungsexperimente erleichtert wurde.

Die vorliegende Forschungsarbeit untersucht das Potential des Einsatzes von SNP-Arrays und

der Gesamt-Genomsequenzierung (WGS) von vier vorhandenen F2-Schweine-

Ressourcenpopulationen. Dieses Vorgehen wirft Fragen in Bezug auf Populations- und

quantitative Genetik auf und kann diese beantworten. Jede Generationsebene (F0, F1, F2)

vermittelt verschiedene Arten von Informationen, die durch die Kopplung von Genotypen mit

Phänotypen, sowie durch die alleinige Verwendung von genetischen Informationen zugänglich

sind.

Der Hauptzweck einer F2-Kreuzung besteht darin, QTLs zu kartieren und vermeintliche kausale

Mutationen mit Phänotypen zu assoziieren. In der vorliegenden Arbeit wurden Phänotypen

betrachtet, die mit dem Wachstum, der Schlachtkörperqualität und der Fettverteilung assoziiert

sind. Eine Kreuzung der F2-Ressourcenpopulation besteht aus europäischen Rassen (Piétrain,

Landrasse und Deutsches Edelschwein), während die anderen drei Kreuzungen von entfernt

verwandten Gründerrassen stammen (Piétrain, die asiatische Rasse Meishan und der

europäische Vorfahre, das Wildschwein). Um phänotypische und genotypische Informationen

zu verbinden, wird die genomweite Assoziationsstudie (GWAS) verwendet. In Kapitel 1 wird

eruiert, wie die Beobachtungen der verschiedenen F2-Populationen kombiniert und gemeinsam

analysiert werden können. Dafür wurde für jede Population eine separate GWAS, sowie eine

gemeinsame GWAS aller Populationen durchgeführt. Weiterhin wurde eine Meta-Analyse für

drei der beschriebenen F2-Designs (mit Piétrain als gemeinsame Gründerrasse) für Wachstums-

und Schlachtkörpermerkmale durchgeführt. Der Vorteil der zusammengefassten Daten ist eine

erhöhte Kartierungsauflösung, die auf engere Genomregionen mit kausalen Varianten hinweist.

Viele Gene, die zuvor mit den Merkmalen assoziiert wurden, werden bestätigt und weitere neue

Kandidatengene werden vorgeschlagen (z.B. BMP2 bone morphogenetic protein 2 für die

Schlachtkörperlänge). Eine Erweiterung dieser Arbeit geht über die additiven genetischen

Effekte hinaus und untersucht Dominanz- und Imprinting-Effekte in allen vier F2-Kreuzungen

mit Hilfe der Varianzkomponentenschätzung und verschiedenen GWAS-Modellen (Kapitel 2).

Der Anteil der Imprinting-Effekte an der gesamten phänotypischen Varianz liegt zwischen Null

116

und 19%, während die Dominanzeffekte bis zu 34% der phänotypischen Varianz erklären. Die

Ergebnisse zeigen signifikante Assoziationen in der IGF2-Region (insulin-like growth factor

2) für die mit dem Wachstum und der Fettverteilung zusammenhängenden Merkmale. Kapitel

3 stellt eine GWAS der vier gepoolten F2-Ressourchenpopulationen dar. Hierfür wurden die

verfügbaren Genotypen auf Sequenzdaten unter Berücksichtigung verfügbarer

Genominformationen von Gründertieren und der F1-Generation imputiert. Es wurden

Kandidatengene, sowie vermeintlich kausale Mutationen identifiziert, die für die weitere

Forschung richtungsweisend sein können. Weiterhin demonstriert die Arbeit einen praktikablen

Ansatz zur effizienten Nutzung von bereits etablierten experimentellen Designs.

Die letzten beiden Kapitel dieser Thesis befassen sich mit Informationen der F1- und der

Gründergeneration (F0), die oftmals nicht beachten werden. Kapitel 4 beinhaltet

Untersuchungen zur Rekombination, ein Prozess zur Aufrechterhaltung der genetischen

Variabilität und zur Evolution von Genomen. Die Rekombinationskarten wurden auf Basis der

F1-Generation konstruiert, da in jedem Eltern-Kind-Paar (F1-F2) Crossing-Over auftritt. Die

Anzahl an Rekombinationen variierte innerhalb der Kreuzungspopulationen, zwischen Rassen

und Individuen sowie zwischen Chromosomen und innerhalb von Genomregionen. Obwohl in

den Designs eine beträchtliche Heterogenität beobachtet wird, existieren für die Kreuzungen,

Geschlechter oder Chromosomen bestimmte Muster. Dabei waren die identifizierten Muster

beeinflusst von der Rekombinationsrate im Zeitverlauf, der Diversität, der Effizienz oder der

Richtung der Selektion und der Genomzusammensetzung beeinflusst werden. Kapitel 5

betrachtet die Variation der Sequenzdaten in der Gründergeneration mit einem reversen

genetischen Ansatz, um die Informationen der F2-Ressourcenpopulationen zu untersuchen. Die

Begründung dafür ist, dass der Großteil der genetischen Variationen in der F2-Population von

der Gründerpopulation aus weitergegeben wurde. Diese explorative Analyse zeigt, wie anhand

von genomweiten Daten Rückschlüsse auf die Populationsstrukturen von Gründerrassen

gezogen werden können und wie geeignete bioinformatische Methoden und Datenbanken zu

einer wissensbasierten Selektion der Varianten für die weitere funktionale Validation führen

können.

117

Appendix

Supplementary Material for Chapter 2 is available upon request in a digital format.

Supplementary Material for Chapter 5:

Table SM1. Heterozygosity levels and genomic inbreeding coefficient (F). P = Piétrain, Lw

= Large white, L = Landrace, M = Meishan, W = Wild boar.

Sample ID Breed O(HOM) E(HOM) N(NM) F

10345 P 25619605 2.50E+07 32492844 0.08659

17118 P 26334896 2.50E+07 32515253 0.1794

17123 P 26114274 2.50E+07 32498664 0.1518

17161 P 25770702 2.50E+07 32507073 0.1052

17165 P 25508230 2.50E+07 32497963 0.07124

662 Lw x L 24740919 2.50E+07 32461390 -0.02733

690 Lw x L 24832741 2.50E+07 32464041 -0.01523

693 Lw x L 24715282 2.49E+07 32453620 -0.02998

728 Lw 25731335 2.49E+07 32451402 0.1057

735 Lw x L 24759779 2.49E+07 32456585 -0.02433

750 Lw x L 24777990 2.50E+07 32462484 -0.02254

756 Lw x L 24714906 2.49E+07 32454557 -0.03012

771 Lw x L 24698781 2.50E+07 32463989 -0.03328

M199 M 19403359 2.46E+07 32045079 -0.7073

P102 P 26146023 2.50E+07 32510064 0.1549

P107 P 25866702 2.50E+07 32519648 0.1166

P108 P 26182812 2.50E+07 32511118 0.1596

P113 P 25538474 2.50E+07 32495620 0.07541

P115 P 25910374 2.50E+07 32482869 0.1264

P119 P 25980850 2.50E+07 32496978 0.1343

P128 P 26652961 2.50E+07 32491478 0.2242

P130 P 25822097 2.50E+07 32505442 0.1122

P181 W 26962335 2.49E+07 32450507 0.2698

P244 P 25513125 2.50E+07 32500177 0.07163

118

Table SM2. Pooled analysis: Top 30 GO Biological Process enriched terms.

Enrichment

FDR

Genes in

list

Total

genes Functional Category

1.00E-20 4410 4834 Localization

7.20E-20 3899 4264 Developmental process

5.80E-19 4128 4527 Positive regulation of biological process

2.50E-18 3678 4024 Anatomical structure development

1.60E-16 3235 3535 Transport

2.50E-16 3657 4011 Positive regulation of cellular process

5.70E-16 3337 3653 Establishment of localization

4.10E-15 3269 3581 Multicellular organism development

5.80E-15 4347 4798 Cellular component organization or biogenesis

3.10E-14 4221 4660 Cellular component organization

4.80E-14 2703 2951 Cellular developmental process

9.70E-14 3014 3302 System development

1.90E-13 2554 2787 Cell differentiation

6.70E-13 1647 1776 Organic substance transport

1.80E-11 1781 1931 Regulation of localization

2.00E-11 1940 2109 Macromolecule localization

3.00E-11 2624 2877 Regulation of biological quality

5.70E-11 1777 1929 Cell surface receptor signaling pathway

6.20E-11 1826 1984 Anatomical structure morphogenesis

6.40E-11 2687 2950 Regulation of response to stimulus

7.90E-11 3527 3898 Negative regulation of biological process

8.90E-11 2218 2424 Animal organ development

1.30E-09 1897 2070 Cellular response to chemical stimulus

1.30E-09 2021 2209 Regulation of multicellular organismal process

1.40E-09 702 742 Organic acid metabolic process

1.60E-09 2236 2451 Response to stress

1.80E-09 1782 1942 Response to organic substance

2.50E-09 684 723 Oxoacid metabolic process

3.40E-09 649 685 Carboxylic acid metabolic process

3.70E-09 2506 2757 Positive regulation of metabolic process

119

Table SM3. Meishan breed specific analysis: Top 30 GO Biological Process enriched terms.

Enrichment

FDR

Genes in

list

Total

genes Functional Category

2.0E-11 169 913 Cell adhesion

2.0E-11 642 4834 Localization

2.5E-11 169 919 Biological adhesion

4.2E-10 568 4264 Developmental process

2.1E-09 536 4024 Anatomical structure development

4.8E-09 483 3581 Multicellular organism development

6.0E-09 450 3302 System development

9.8E-08 288 1984 Anatomical structure morphogenesis

9.8E-08 595 4660 Cellular component organization

1.2E-06 602 4798 Cellular component organization or

biogenesis

4.5E-06 391 2951 Cellular developmental process

5.5E-06 227 1557 Nervous system development

7.0E-06 271 1931 Regulation of localization

8.5E-06 328 2424 Animal organ development

2.6E-05 213 1474 Movement of cell or subcellular component

2.7E-05 366 2787 Cell differentiation

2.8E-05 282 2060 Intracellular signal transduction

4.1E-05 226 1595 Cell development

6.3E-05 271 1987 Cellular localization

1.0E-04 32 122 Homophilic cell adhesion via plasma

membrane adhesion molecules

1.1E-04 116 720 Regulation of cellular component movement

1.1E-04 154 1023 Plasma membrane bounded cell projection

organization

1.2E-04 444 3535 Transport

1.2E-04 124 786 Circulatory system development

1.2E-04 96 569 Cell-cell adhesion

1.3E-04 41 180 Cell-cell adhesion via plasma-membrane

adhesion molecules

1.3E-04 160 1079 Cytoskeleton organization

1.3E-04 155 1038 Cell projection organization

1.8E-04 281 2109 Macromolecule localization

1.8E-04 455 3653 Establishment of localization

120

Table SM4. Piétrain breed specific analysis: GO Biological Process enriched terms.

Enrichment

FDR

Genes in

list

Total

genes Functional Category

1.50E-15 75 1713 Sensory perception of smell

1.50E-15 75 1716 Detection of chemical stimulus involved in

sensory perception

1.50E-15 75 1689 Detection of chemical stimulus involved in

sensory perception of smell

1.50E-15 76 1758 Sensory perception of chemical stimulus

1.50E-15 76 1737 Detection of chemical stimulus

1.50E-15 83 2027 Sensory perception

4.10E-15 75 1753 Detection of stimulus involved in sensory

perception

8.50E-15 76 1822 Detection of stimulus

1.20E-13 86 2353 Nervous system process

8.70E-13 85 2394 G protein-coupled receptor signaling

pathway

1.50E-12 93 2796 System process

1.30E-05 105 4429 Response to chemical

7.50E-03 9 122 Homophilic cell adhesion via plasma

membrane adhesion molecules

2.90E-02 10 180 Cell-cell adhesion via plasma-membrane

adhesion molecules

4.60E-02 2 3 Negative regulation of vascular smooth

muscle contraction

4.60E-02 2 3 Protein localization to nuclear inner

membrane

121

Table SM5. Large white x Landrace/Large white breed specific analysis: GO Biological

Process enriched terms.

Enrichment

FDR

Genes in

list

Total

genes Functional Category

1.70E-12 62 1689 Detection of chemical stimulus involved in

sensory perception of smell

1.70E-12 62 1713 Sensory perception of smell

1.70E-12 62 1716 Detection of chemical stimulus involved in

sensory perception

2.20E-12 62 1737 Detection of chemical stimulus

2.50E-12 62 1753 Detection of stimulus involved in sensory

perception

2.50E-12 62 1758 Sensory perception of chemical stimulus

1.10E-11 62 1822 Detection of stimulus

3.10E-11 65 2027 Sensory perception

2.40E-10 77 2796 System process

1.70E-09 68 2394 G protein-coupled receptor signaling

pathway

2.00E-09 67 2353 Nervous system process

1.40E-03 83 4429 Response to chemical

Acknowledgements

This dissertation would not have been possible without the support of many people, both at a

professional and personal level.

To start I would like to thank Prof. Dr. Georg Thaller for the constant guidance, for the

productive discussions, valuable suggestions and encouragement throughout the doctoral time.

Additional thanks go to Prof. Dr. Jens Tetens who was of great support in data analysis and lab

work but also in shaping many of the research ideas. Further, for the lab work, I greatly

appreciate the support from Gabi and Fabian who eased the workload.

I also want to thank my colleagues for the pleasant work environment and for the fun times at

conferences and other institute activities. Mitze, Katharina and Sowah, thanks for putting up

with me as your office mate. A big “gracias” goes to Edson for all the constructive

conversations, for his friendship and the last minute support for the thesis. Laura and Christin,

many thanks for the German translations.

Kiel was home in the last years due to having many warmhearted people around. I believe

people make the places (unless it is Iceland) therefore special thanks go to Asli, Luca and Utku

(in no particular order, it is alphabetic), for their true friendship and unconditional love. I would

also like to mention the Ola Mensch crew for the companionship, for all the lunches, concerts

and life stories. For the soundtrack of my life, thank you Ane Brun.

Last, but not least, I would like to thank my family for their love, support and encouragement.

We grew a lot together in the last years and I am extremely grateful for having you in my life.

Vă iubesc.

Curriculum Vitae

Iulia Georgiana Blaj

Birth date, place: 23/04/86, Suceava, Romania

Nationality: Romanian

Employment

11/2014 – present

Research Assistant at the Institute for Animal Breeding and Husbandry, Faculty of

Agricultural and Nutritional Sciences, Kiel University, Germany

Education

10/2012 – 11/2014

AgriGenomics Masters, Kiel University, Germany (M.Sc.)

Thesis: Inferring Demography from Whole-Genome Sequence Data in Herens and

Tyrolean Grey Cattle Breeds

10/2005 – 07/2011

Veterinary Medicine, University of Agronomic Sciences and Veterinary Medicine,

Bucharest, Romania (DVM)

Thesis: Delayed puberty in gilts

09/2001 – 07/2005

National College Petru Rares, Suceava, Romania (Baccalaureate)

Focus: Mathematics – Informatics, Romanian – English bilingual major