A genomic and transcriptomic study of lineage-specific ...

290
A genomic and transcriptomic study of lineage-specific variation in Mycobacterium tuberculosis Graham David Rose Thesis submitted for the degree of Doctor of Philosophy 2013 MRC National Institute for Medical Research

Transcript of A genomic and transcriptomic study of lineage-specific ...

A genomic and transcriptomic study

of lineage-specific variation in Mycobacterium

tuberculosis

Graham David Rose

Thesis submitted for the degree of

Doctor of Philosophy

2013

MRC National Institute for Medical Research

ii

Declaration

I, Graham David Rose, confirm that the work presented in this thesis is my own. Where

information has been derived from other sources, I confirm that this has been indicated in the

thesis.

Signed………………………………………….Date……………………………………..

The thesis work was conducted from September 2009 to March 2013 at the MRC National

Institute of Medical Research (NIMR), London, UK, under the supervision of Douglas Young

(NIMR, London), and Sebastien Gagneux (Swiss Tropical and Public Health Institute,

Switzerland).

iii

Abstract

Human tuberculosis (TB) is caused by several closely related species of bacteria collectively

known as the Mycobacterium tuberculosis complex (MTBC). In this thesis the identification

and effect of lineage-specific genetic variation within the phylogenetic lineages of the MTBC

was investigated using a combination of computational methods and high-throughput

sequencing technology.

Genome sequencing has now identified an extensive repertoire of single nucleotide

polymorphisms (SNPs) amongst clinical isolates of the MTBC. Comparative analysis focused

on the detection of all lineage-specific SNPs, providing the first glimpse of the total SNP

diversity that separates the main phylogenetic lineages from each other. Bioinformatic

analysis focused on SNPs more likely to contribute to functional diversity, which predicted

nearly half of all SNPs in the MTBC to have functional consequences, while SNPs within

regulatory proteins were over-represented. To determine whether these and other lineage-

specific SNPs lead to phenotypic diversity, genome datasets were integrated with RNA-

sequencing to assess their impact on the comparative transcriptome profiles of strains

belonging to two MTBC lineages. Analysing the transcriptomes in the light of the underlying

genetic variation found clear correlations between genotype and transcriptional phenotype.

These arose by three mechanisms. First, lineage-specific changes in amino acid sequence of

transcriptional regulators were associated with alterations in their ability to control gene

expression. Second, changes in nucleotide sequence were associated with alteration of

promoter activity and generation of novel transcriptional start sites in intergenic regions and

within coding sequences. Finally, genes showing lineage-specific patterns of differential

expression not linked directly to primary mutations were characterised by a striking over-

representation of toxin-antitoxin pairs.

iv

Acknowledgements

This thesis would not have been possible without the efforts of my colleagues and friends.

Firstly I would like to thank my PhD supervisors Sebastien Gagneux and Douglas Young for

their support and guidance throughout my project, providing me with their invaluable depth

of knowledge and resources. Of special note were the annual Gagneux group retreats in

Charmey and Les Diablerets, which always provided a healthy mix of stimulating scientific

discussions about my projects and great food, including of course the meringue et la crème

double. I am grateful to my three thesis supervisor’s, Delmiro Fernandez-Reyes, Roger

Buxton and Seb, who were a great help in contextualising my ideas and providing a focus.

My thesis relied heavily on sequence data, and as such I thank Abdul Sesay and the rest of the

High Throughput Sequencing group at NIMR for performing the Illumina sequencing. Next I

would like to thank Iñaki Comas, who was always happy to answer my questions on

evolutionary theory and phylogenomics, and provide more general daily support on all things

computational. I also thank the other original member of the Gagneux group at NIMR, Sonia

Borrell, particularly so for her help in getting me up and running in the lab at the start, and

then the current members of Douglas Young’s group, including Kristine Arnvig, for her

guidance on the RNA side of my project, and Steve Coade, who was my Biosafety

Containment Level 3 trainer for the first six months of my PhD. My time at NIMR would not

have been as enjoyable without my colleagues and friends Christina Kahramanoglou and

Teresa Cortés Méndez, and to Teresa, I am indebted to you for your support in keeping me

focused and all things in perspective during the final few months. I apologise that despite

your and the past efforts from the Spanish contingent of the group that my vocabulary is still

quite limited in your language. One day! Of course I am grateful to my parents, who provided

me with their untiring support to undertake my studies throughout the years, and to my

brother Phil for his advice and the countless Sunday lunches in Balham. Finally I am grateful

to the Medical Research Council (MRC) for their funding, who supported not only my

university costs and living expenses for the last three and a half years, but the research of

many of my colleagues as well. Thank you.

CONTENTS

v

Contents

Declaration...……………………………………………………………………………..ii

Abstract.…………………………………………………………………………………iii

Acknowledgements...……………………………………………………………………iv

List of Figures...………………………………………………………………………….x

List of Tables...………………………………………………………………………….xii

Glossary...………………………………………………………………………………xiii

Chapter 1 Introduction ............................................................................................... 1

1.1 The genus Mycobacterium ................................................................................ 2

1.1.1 Taxonomy ..................................................................................................... 2

1.1.2 The Mycobacterium tuberculosis complex (MTBC) .................................... 4

1.1.3 TB disease in humans ................................................................................... 5

1.1.4 Disease diversity ........................................................................................... 6

1.2 Genetic diversity in the MTBC ........................................................................ 7

1.2.1 General features of the M. tuberculosis genome .......................................... 7

1.2.2 Typing the MTBC ......................................................................................... 7

1.2.3 The phylogenetic lineages of the MTBC ...................................................... 9

1.2.4 Origin of the MTBC ................................................................................... 13

1.2.5 Selective pressures acting within the MTBC .............................................. 13

1.3 Phenotypic diversity ........................................................................................ 15

1.3.1 Laboratory strains ....................................................................................... 15

1.3.2 Clinical strain phenotype ............................................................................ 16

1.4 Linking genotype to phenotype ...................................................................... 17

1.4.1 In silico prediction of functional SNPs ....................................................... 19

1.4.2 Gene expression diversity ........................................................................... 20

1.4.3 High throughput DNA sequencing technology ........................................... 22

1.5 Thesis Outline .................................................................................................. 25

CONTENTS

vi

Chapter 2 Materials and Methods ........................................................................... 26

2.1 General microbiological methods .................................................................. 26

2.1.1 Containment 3 laboratory ........................................................................... 26

2.1.2 General chemicals and reagents .................................................................. 26

2.1.3 Bacterial culture and storage ....................................................................... 27

2.1.4 Growth curves ............................................................................................. 27

2.2 Molecular biology techniques ......................................................................... 28

2.2.1 Genomic DNA extraction ........................................................................... 28

2.2.2 RNA Isolation and handling ....................................................................... 28

2.2.3 Quantification of DNA and RNA by Nanodrop ......................................... 29

2.2.4 Determination of DNA and RNA integrity by micro fluidics .................... 30

2.2.5 Removal of DNA contamination from RNA samples ................................ 30

2.2.6 Polymerase chain reaction (PCR) ............................................................... 30

2.3 Materials ........................................................................................................... 31

2.3.1 Mycobacterium tuberculosis strains ........................................................... 31

2.4 DNA-seq ............................................................................................................ 31

2.5 RNA-seq ............................................................................................................ 32

2.5.1 Strand specific RNA-seq libraries .............................................................. 32

2.5.2 TSS 5’ enriched RNA-seq libraries ............................................................ 34

2.6 Illumina sequencing DNA (genome) and cDNA (RNA-seq) libraries ......... 34

2.7 Quantitative RT-PCR ...................................................................................... 34

2.7.1 Primer sequences ........................................................................................ 35

2.8 MTBC annotation datasets ............................................................................. 36

2.8.1 Coding sequence annotations ...................................................................... 36

2.8.2 Functional Categories ................................................................................. 36

2.8.3 Essential M. tuberculosis genes .................................................................. 36

2.9 Bioinformatics software .................................................................................. 37

2.9.1 Artemis ........................................................................................................ 37

2.9.2 Quality control of raw RNA-sequencing data ............................................ 37

2.9.3 Transcriptome mapping software ............................................................... 38

2.9.4 Calculation of mapped read frequencies per feature region ....................... 39

2.9.5 R .................................................................................................................. 40

2.9.6 Perl scripts ................................................................................................... 40

2.9.7 Graph pad prism 5.0 .................................................................................... 40

Chapter 3 Lineage-specific SNPs ............................................................................. 41

3.1 Introduction ..................................................................................................... 41

CONTENTS

vii

3.1.1 Aims ............................................................................................................ 42

3.2 Materials and Methods ................................................................................... 43

3.2.1 Genome collection used in study ................................................................ 43

3.2.2 Genome sequencing. ................................................................................... 43

3.2.3 Mapping genome sequences ....................................................................... 43

3.2.4 Phylogenetic analysis. ................................................................................. 44

3.2.5 Categorising SNPs ...................................................................................... 44

3.2.6 dN/dS calculation ........................................................................................ 45

3.3 Results ............................................................................................................... 47

3.3.1 A globally representative 28-genome human-adapted MTBC phylogeny . 47

3.3.2 Identification of all lineage-specific SNPs ................................................. 53

3.3.3 Distribution of SNPs ................................................................................... 56

3.3.4 Monomorphic population structure and homoplasic SNPs ........................ 59

3.3.5 Creation of pseudogenes ............................................................................. 62

3.3.6 SNPs within genes associated with antibiotic resistance ............................ 69

3.3.7 Conservation and removal of lineage-specific nonsynonymous SNPs ....... 72

3.4 Discussion ......................................................................................................... 77

3.4.1 Strengths and limitations of this study ........................................................ 77

3.4.2 General characteristics of lineage-specific diversity .................................. 78

3.4.3 Insights into the evolution of M. tuberculosis lineages .............................. 80

Chapter 4 In silico prediction of functional Single Nucleotide Polymorphisms .. 84

4.1 Introduction ..................................................................................................... 84

4.1.1 Aims ............................................................................................................ 86

4.2 Materials and Methods ................................................................................... 87

4.2.1 SIFT ............................................................................................................ 87

4.2.2 Indels ........................................................................................................... 89

4.2.3 Homology modelling .................................................................................. 89

4.2.4 Change in protein stability .......................................................................... 90

4.3 Results ............................................................................................................... 91

4.3.1 Predicting functional SNPs within control set ............................................ 91

4.3.2 Predicted functional nonsynonymous SNPs ............................................... 92

4.3.3 Impact of nonsynonymous SNPs outside of the human adapted MTBC .... 95

4.3.4 Clustering of functional SNPs .................................................................... 95

4.3.5 Functional category analysis of functional SNPs ........................................ 99

4.3.6 Functional impairment of Lineage 1 and 2 regulatory proteins ................ 101

4.4 Discussion ....................................................................................................... 106

CONTENTS

viii

4.4.1 Strengths and limitations of the study ....................................................... 106

4.4.2 Validation of the SIFT method ................................................................. 108

4.4.3 Half of lineage-specific SNPs are predicted to have functional

consequences ......................................................................................................... 109

Chapter 5 Screening the effect of lineage-specific variation by sequence-based

transcriptional profiling .............................................................................................. 112

5.1 Introduction ................................................................................................... 112

5.1.1 Aims .......................................................................................................... 113

5.2 Methods .......................................................................................................... 114

5.2.1 Clinical isolates in study ........................................................................... 114

5.2.2 Cluster analysis ......................................................................................... 118

5.2.3 Differential expression analysis ................................................................ 118

5.2.4 Transcriptional Start Site (TSS) calling .................................................... 119

5.3 Results ............................................................................................................. 120

5.3.1 Growth rate in vitro ................................................................................... 120

5.3.2 RNA isolation and Illumina ready libraries .............................................. 124

5.3.3 Transcriptome sequencing ........................................................................ 125

5.3.4 Mapping reads to the H37Rv genome ...................................................... 128

5.3.5 Identifying strain specific gene deletions ................................................. 129

5.3.6 Clustering of strains at the total sample level ........................................... 133

5.3.7 Clustering of strains by antisense expression ........................................... 138

5.3.8 Testing for differential expression in RNA-seq data ................................ 140

5.3.9 Lineage-specific gene expression ............................................................. 141

5.3.10 Enrichment of toxin-antitoxins ............................................................... 155

5.4 Discussion ....................................................................................................... 159

5.4.1 Strengths and limitations of the study ....................................................... 159

5.4.2 Lineage-specific expression ...................................................................... 161

5.4.3 Linking genotype to phenotypic at the transcriptional level ..................... 162

Chapter 6 Final discussion ..................................................................................... 167

References ................................................................................................................ 174

Appendices A-G

Appendix A. genomeDeletions.pl…………………………………………………209

Appendix B. Lineage-specific SNPs………………………………………………211

Appendix C. Lineage-specific SNPs within drug resistance associated genes……265

Appendix D. Nonsynonymous/synonymous SNP ratio………………………...…267

Appendix E. RNA-seq differential expression……………………………………269

CONTENTS

ix

Appendix F. Functional categories…………………………………………..…274

Appendix G. Publications…………………………………………...…..……...275

LIST OF FIGURES

x

List of Figures

Figure 1.1. Phylogenetic structure of the genus Mycobacterium.. .................................. 3!Figure 1.2. The most complete phylogeny of the human adapted MTBC .................... 11!Figure 1.3. Distribution of the MTBC lineages globally .............................................. 12!Figure 1.4. The number of MTBC genome sequences in the Short Read Archive…....18!Figure 3.1. Neighbour-joining phylogeny for 28 human-adapted MTBC genomes ..... 49!Figure 3.2. Within-lineage SNP diversity. .................................................................... 52!Figure 3.3. Isolating lineage-specific SNPs from the phylogeny. ................................. 54!Figure 3.4. Distribution of the lineage-specific SNPs across the genome. ................... 55!Figure 3.5. The average number of non-coding and coding lineage-specific SNPs ..... 57!Figure 3.6 Distribution of lineage SNPs per gene. ....................................................... 58!Figure 3.7. Homoplasic lineage SNPs. .......................................................................... 60!Figure 3.8. Change in protein length due to nonsense SNPs. ....................................... 67!Figure 3.9. Gene creation by nonsense SNPs ................................................................ 68!Figure 3.10 Lineage-specific SNPs within genes associated with drug resistance ........ 69!Figure 3.11. The rate of nonsynonymous SNP accumulation by functional category .... 75!Figure 4.1. SIFT database phylogeny. ........................................................................... 88!Figure 4.2. SIFT predictions. ......................................................................................... 94!Figure 4.3. Distribution of predicted functional SNPs per gene. .................................. 97!Figure 4.4. Frequency distribution of predicted functional SNPs across genome. ....... 98!Figure 4.5. Functional category representation.. ........................................................... 99!Figure 4.6. Predicted loss of function of virS transcriptional regulator in Lineage 1.. 105!Figure 4.7. Spectrum of functional SNPs. ................................................................... 111!Figure 5.1. Strains sequenced in RNA-seq study. ....................................................... 117!Figure 5.2. In vitro growth curves. .............................................................................. 121!Figure 5.3. Quality control of RNA-seq samples by Bioanalyser. .............................. 124!Figure 5.4. Distribution of quality scores for strain N0145. ....................................... 125!Figure 5.5. Circular plot of mapped RNA-seq data.. .................................................. 128!

LIST OF FIGURES

xi

Figure 5.6. Representation of transcriptome plot based on Artemis. .......................... 129!Figure 5.7. Distribution of gene deletions in the six RNA-seq study strains. ............. 130!Figure 5.8. Distribution of gene deletions grouped by gene function category. ......... 132!Figure 5.9. Unsupervised hierarchical clustering of total gene expression. ................ 135!Figure 5.10. Relationship of genotypic to transcriptomic diversity. ............................. 136!Figure 5.11. Correlation of SNP distance to gene expression. ...................................... 137!Figure 5.12. Unsupervised hierarchical clustering of total antisense expression. ......... 139!Figure 5.13. Venn diagram comparing differential expression methods ...................... 141!Figure 5.14. Heatmap of 112 differentially expressed genes. ....................................... 142!Figure 5.15. Differential expression of divergently regulated genes.. .......................... 144!Figure 5.16. Heat map of dosR regulon. ....................................................................... 146!Figure 5.17. Duplication of dosR region. ...................................................................... 147!Figure 5.18. DosR regulon and SNP-associated TSS.. .................................................. 149!Figure 5.19. SNP-associated TSS leading to differential gene expression. .................. 152!Figure 5.20. SNP-associated TSS leading to differential antisense expression. ........... 154!Figure 5.21. Over-representation of differentially expressed toxin-antitoxins.. ........... 156!Figure 5.22. Validation of select RNA-seq differentially expressed toxin- antitoxins.. 156!Figure 5.23. Rates of the types of nucleotide mutations across. ................................... 165!

LIST OF TABLES

xii

List of Tables

Table 2.1. Primer sequences used in the qRT-PCR study........................................... 35

Table 3.1. Twenty eight strains used in this study....................................................... 46

Table 3.2. Estimates of evolutionary divergence between strains. ............................. 50

Table 3.3. Summary of lineage-specific SNPs. .......................................................... 57

Table 3.4. Homoplasic nucleotide positions within the lineage branches. ................. 60

Table 3.5. Variable genomic positions within the lineages. ....................................... 61

Table 3.6. Nonsense SNPs .......................................................................................... 63

Table 3.7. Nonsense SNPs by lineage..........................................................................64

Table 3.8. Nonsense SNPs grouped by functional category........................................64

Table 3.9. Mutations found in drug resistance studies associated with drug resistance

Table 3.10. The rate of nonsynonymous SNP accumulation across the lineages.......... 73

Table 3.11. The rate of nonsynonymous SNP accumulation by functional category….76

Table 4.1. SIFT database of non-MTBC species. ........................................................89

Table 4.2. Predicted tolerated and functional SNPs using SIFT. ................................94

Table 4.3. Functional category representation............................................................100

Table 4.4. Transcriptional regulators with predicted functional mutations................102

Table 4.5. Regulatory proteins with predicted functional mutations in Lineage 1

and 2...............................................................................................................................104

Table 5.1. Lineage 1 and 2 strain used in the RNA-seq study. ...........................115

Table 5.2. Additional strains used in growth curve experiment. ........................ 115

Table 5.3. Additional strains used in qRT-PCR confirmation. .........................116

Table 5.4. In vitro growth rates. ....................................................................... 123

Table 5.5. Details of exponential phase transcriptomes used in differential expression

analysis.......................................................................................................................... 126

Table 5.6. Transcriptomes used in TSS mapping….................................................. 127

Table 5.7. Differential expression associated with lineage-specific amino acid

mutations SNPs............................................................................................................. 143

LIST OF TABLES

xiii

Table 5.8. Ten differentially expressed genes associated with a change in promoter

sequences……................................................................................................................150

Table 5.9. Nine differentially expressed antisense associated with introduction of

SNP- associated TSS……………................................................................................. 153

Table 5.10. Ten differentially expressed toxin-antitoxins (TA).................................. 157

xiv

Glossary

∆∆G change in Gibbs free

energy

-10 Pribnow box

CCAL creative commons

attribution license

cDNA complementary DNA

dt doubling time

DNA deoxyribonucleic acid

DNA-seq DNA-sequencing

g gram

GA Genome Analyser

Gb gigabase

HS HiSeq2000

HTH helix-turn-helix

indel insertion/deletion

LSP large sequence

polymorphism

Mb megabase

mg milligram

ml millilitre

MLSA multilocus sequence

analysis

mRNA messenger RNA

MTBC Mycobacterium

tuberculosis complex

nt nucleotide

OD optical density

PCR polymerase chain reaction

PDB protein data bank

PE proline-glutamic acid

PPE proline-proline-glutamic

acid

PGRS polymorphic glycine rich

sequence

qRT-PCR quantitative realtime-PCR

RD region of difference

RNA ribonucleic acid

RNA-seq RNA-sequencing

RPKM reads per kilobase per

million mapped reads

rRNA ribosomal RNA

sd standard deviation

SNP single nucleotide

polymorphism

SEM standard error of the mean

sRNA small RNA

TA toxin-antitoxin

TSS transcriptional start site

µg microgram

µl microlitre

UTR untranslated region

VST variance stabilising

transformation

HGT horizontal gene transfer

TbD1 M.tuberculosis specific

deletion 1

HMM Hidden Markov model

VCF variant call format

GTF gene transfer format

X2 chi-square test

1.1 The genus Mycobacterium

1

Chapter 1 Introduction

Tuberculosis (TB) is caused by several closely related species of bacteria collectively

known as the Mycobacterium tuberculosis complex (MTBC) (Cole et al., 1998). The

infamous member of the MTBC is the human-adapted pathogen Mycobacterium

tuberculosis, the etiologic agent of human TB along with Mycobacterium africanum, a

phylogenetic variant limited to West Africa (de Jong et al., 2010). Together these

species are regarded as human-adapted MTBC members. Today, TB causes more adult

deaths than any other single infectious disease, and is second only to HIV/AIDS, of

which TB is the greatest cause of mortality in those infected with HIV (WHO, 2012). It

is estimated that nine million new TB cases and over one million deaths from TB

currently occur each year (WHO, 2012). In addition to active cases of TB, two billion

people have a latent infection, effectively acting as a reservoir of active TB cases for

several decades to come (Barry et al., 2009).

Historically TB is an ancient disease (Donoghue et al., 2004). Early cultural references

date back to classical Greek times (Daniel, 1997), when Hippocrates used the term

“phthisis” to describe active TB in individuals (Coar, 1982). Ancient M. tuberculosis

DNA has been isolated from mummies found in Egypt (Nerlich et al., 1997) and South

America (Salo et al., 1994). More recently, molecular genetics and the advent of

sequencing technologies have facilitated more rigorous dating of M. tuberculosis and

other MTBC members; low estimates range from 15,000-20,000 (Sreevatsan et al.,

1997a), but more recently 70,000 years or more has been suggested (Hershberg et al.,

2008). TB has therefore been a burden on humans for a long time, possibly since the

migration of modern humans out of Africa (Hershberg et al., 2008). Recent analyses of

MTBC evolution, largely driven by the advances in sequencing technology (Loman et

al., 2012), have revealed a global picture of human MTBC strain variation, consisting of

1.1 The genus Mycobacterium

2

six major phylogenetic lineages that display strong geographic structure (Gagneux &

Small, 2007; Hershberg et al., 2008) and a rare seventh lineage recently discovered in

the Horn of Africa (Firdessa et al., 2013). This has questioned the accuracy of prior

assumptions that variation in the MTBC was negligible and of no clinical significance

(Musser et al., 2000; Sreevatsan et al., 1997a), whilst bringing to the forefront the

identification, potential effects of genetic variation, and future trajectory of the disease

(Comas & Gagneux, 2009; Hershberg et al., 2008; Homolka et al., 2010). New

opportunities now exist to study how the evolution of the MTBC has resulted in

functional consequences in the lineages of MTBC at the definitive resolution - the level

of DNA and RNA. It is these opportunities that shall be explored in this thesis.

1.1 The genus Mycobacterium

A genus of Actinobacteria, Mycobacteria are distinctive rod-shaped bacteria that are

characterised by high GC content, and complex lipid-rich cell walls (Madigan et al.,

2003). This physical property of the cell wall was exploited in 1882 by Koch, who

stained M. tuberculosis with alkaline methylene blue and a Bismarck brown stain for

surrounding tissue (Ellis & Zabrowarny, 1993). In the same year the Ziehl-Neelsen stain

was developed, which used a similar process to identify acid-fast bacteria, and is still

used today to identify mycobacteria (Parish & Stoker, 2001).

1.1.1 Taxonomy

A working taxonomy for Mycobacteria was established 50 years ago, with original

classifications based on growth rate, pigmentation and clinical significance (Stahl &

Urbance, 1990). A fundamental division can be made based on growth rate, splitting

Mycobacteria into two major groups, fast and slow growers. The fast growers include

mainly opportunistic or non-pathogenic mycobacteria, such as Mycobacterium

smegmatis, which can be cultured from dilute inocula within a week. In contrast, the

slow growing species can take several weeks for visible growth from dilute inocula. This

group includes M. tuberculosis, Mycobacterium bovis and Mycobacterium leprae, the

causative agents of human TB, bovine TB and leprosy, respectively. Modern molecular

biology techniques based on 16S rRNA have revealed the macro population structure of

mycobacteria (Gutierrez et al., 2005; Stahl & Urbance, 1990). The phylogenetic

structure of mycobacteria based on this method is shown in Figure 1.1, and of note is the

1.1 The genus Mycobacterium

3

position of the MTBC together with the smooth tubercle bacilli, which includes

Mycobacterium canetti; it is hypothesised that it was an ancestral pool of smooth

tubercle-like bacilli from which the MTBC originated (Gutierrez et al., 2005; Supply et

al., 2013).

Figure 1.1. Phylogenetic structure of the genus Mycobacterium. The neighbor-

joining tree is based on 16S sequences from seventeen smooth mycobacterial and

MTBC strains. The blue triangle indicates the MTBC. Bootstrap support higher than

90% shown on nodes. Scale bar is pairwise distances after Jukes-Cantor correction.

Adapted from Gutierrez et al. (2005). Image reproduced under the Creative Commons

Attribution License (CCAL).

1.1 The genus Mycobacterium

4

1.1.2 The Mycobacterium tuberculosis complex (MTBC)

The MTBC is used as an umbrella term to group the closely related mycobacteria that

cause TB (Cole et al., 1998). Early sequencing of mycobacteria from the MTBC showed

that they share more than 99.9% sequence identity (Sreevatsan et al., 1997a), as

demonstrated by the collapsed branches in Figure 1.1 for the MTBC members.

However, despite this close relatedness, members of the MTBC display different

phenotypic characteristics and mammalian host ranges; as described above, MTBC

members M. tuberculosis and M. africanum are the primary cause of TB in humans.

The MTBC includes several other species and sub-species that are adapted to various

hosts, including both wild and domestic animal species; these bacterial variants have

been referred to as “ecotypes” (Smith et al., 2006b). Here an ecotype is used as the

definition of a set of strains using the same or similar ecological resources (Cohan,

2002). The host of M. bovis is largely cattle, which is of significant agricultural

significance due to the associated cost of bovine TB, estimated globally at $3 billion per

year (Garnier et al., 2003). M. bovis can also cause TB in humans through the

consumption of unpasteurised milk (de la Rua-Domenech, 2006; Grange, 2001).

Fortunately, modern food practices have effectively stopped this transmission route, and

person-to-person transmission of M. bovis is rare (Evans et al., 2007; Grange, 2001).

Other animal adapted pathogens include Mycobacterium microti (infects voles),

Mycobacterium caprae (infects sheep and goats) and Mycobacterium pinnipedii (infects

seals and sea lions). An MTBC pathogen of Dassies, or Rock Hyrax, has been isolated in

South Africa and named the Dassie bacillus (Parsons et al., 2008), whilst more recently

an MTBC pathogen of banded mongooses has been identified in Botswana named

Mycobacterium mungi (Alexander et al., 2010). It is anticipated that MTBC members of

other ecotypes will likely be identified in future studies.

A special member of the MTBC is M. canetti, a rare tubercle bacillus with an unusual

smooth colony phenotype, unlike the classical rough appearance of other MTBC

members (van Soolingen et al., 1997). M. canetti and the other smooth TB bacilli harbor

greater genetic diversity compared with the rest of the MTBC, and are more distantly

related to the remaining MTBC than any two other MTBC strains are to each other

(Gutierrez et al., 2005). M. canetti is subsequently a common choice as an outgroup in

phylogenetic analysis (Bentley et al., 2012; Comas et al., 2010). Horizontal

recombination events are another feature of the M. canetti genome (Supply et al., 2013),

1.1 The genus Mycobacterium

5

which is in stark contrast to the rest of the MTBC where no significant signs of

recombination are seen (Hirsh et al., 2004; Supply et al., 2003).

1.1.3 TB disease in humans

M. tuberculosis and M. africanum, which together make up the human adapted members

of the MTBC, are the etiological agents of TB in humans. TB infection in humans

broadly follows an established pattern of events. Briefly, infectious bacilli are spread

through droplet nuclei that can remain aerosolised for several hours. Following

inhalation of the droplets the bacteria are phagocytosed by the host’s alveolar

macrophages, which are then thought to invade the subtending epithelial layer of the

lung (Russell et al., 2010); the infectious dose is estimated to be as low as a single

bacterium. A primary site of infection is established, known as the Ghon focus, whereby

a localised inflammatory response leads to recruitment of mononuclear cells from the

neighboring blood vessels, which acts to provide fresh cells for the bacterial infection.

The subsequent lesion or granuloma, is a defining pathogenic feature of TB disease.

Initially consisting as a mass of macrophages, neutrophils and monocytes, the

granulomas eventually become stratified with recruitment of lymphocytes and develop a

centre that is rich in lipids. At this stage an equilibrium with the host immune system is

established in most individuals, which can persist from weeks to decades and is known

as latent TB infection. In this latent state the host is asymptomatic and noninfectious. It

is estimated that 95% of human-adapted MTBC infection follows this route into latency,

which is based on evidence of immunological sensitisation by mycobacterial proteins in

the absence of clinical signs and symptoms of active TB (Barry et al., 2009). In

individuals with active TB, either from disease progression, which occurs in about 5%

of cases, or from the reactivation of a latent infection estimated to occur in 10% over a

lifetime in HIV-negative individuals, the granuloma centre fills with caseous debris

including necrotic macrophages. This ultimately ruptures and releases thousands of

infectious bacilli into the lungs and respiratory airways (Kaplan et al., 2003). A

persistent productive cough develops, effectively aerosolising and spreading the bacilli

to new hosts, and it is this late stage of active TB that contributes to tissue damage and

pathogenesis. Bacilli can also escape into other tissues via the lymphatic blood system,

and this is known as miliary or extrapulmonary TB. Rapid progression to active TB

from an initial infection is higher in infants or immunocompromised persons, whilst

latent TB can be triggered by immunosuppression, of which the greatest identified cause

is HIV infection (Ho et al., 1995).

1.1 The genus Mycobacterium

6

1.1.4 Disease diversity

Although TB is clinically defined into active and latent TB forms, it is likely that this is

a gross oversimplification, with TB infection following a continuous spectrum, ranging

from sterilising immunity, subclinical active disease, and active disease (Barry et al.,

2009). Development of active disease is likely determined by multiple factors, including

the host genotype, environmental factors, and bacterial genetics. On the human genetics

side, SNPs have been identified that determine susceptibility of an individual to TB

using genome-wide linkage analysis (Bellamy et al., 2000). In addition to environmental

influences, strain variation in the MTBC is now also thought to play a role in the

outcome of TB infection and disease (Coscolla & Gagneux, 2010). The ability of the

MTBC strain to elicit an immune response was explored by Portevin et al. recently

using a monocyte-derived macrophage model to study the innate immune response to

twenty-eight diverse clinical MTBC strains (Portevin et al., 2011). It was shown that

macrophages infected with different strains differed in the levels of cytokines and

chemokines produced; infections by a group of strains that belong to the modern

phylogenetic lineages produced less pro-inflammatory cytokines compared with strains

from the ancient lineages (classification of modern and ancient lineages is discussed in

detail below in section 1.2.3). Moving into a clinical setting, it has been shown that over

the course of two years household contacts exposed to strains from the modern lineages

were more likely to develop active disease compared to strains from the ancient lineages

(de Jong et al., 2008). Taken together, Gagneux hypothesised that modern strains have

developed an evolutionary strategy of increased virulence and shorter latency, possibly

through adaptation to expanding human population sizes over the past few hundred

years which have provided more hosts for the MTBC pathogen (Gagneux, 2012). In

summary, it is likely that multiple factors play an important role in disease, with a

complex interaction between the host, pathogen and environment (Comas & Gagneux,

2009). This study focuses on the pathogen side, and the following section introduces the

genetic diversity and lineages of the MTBC.

1.2 Genetic diveristy in the MTBC

7

1.2 Genetic diversity in the MTBC

1.2.1 General features of the M. tuberculosis genome

A seminal moment in mycobacterial research was the genome sequencing of the first

strain of M. tuberculosis in 1998 (Cole et al., 1998). A canonical strain of TB research,

M. tuberculosis H37Rv was chosen in 1993 to be the first MTBC strain sequenced, and

the genome was closed and finished over the next five years. It was shown that the

single circular chromosome was 4,411,532 bp in length and consists of just over 4,000

protein coding genes. The annotated genome opened new insights into the biology and

metabolism of the pathogen, with identification of large protein families related to fatty

acid and polyketide biosynthesis, regulation, drug efflux pumps and transporters, and

PE_PGRS proteins. PE_PGRS are a large duplicated family unique to the MTBC.

The genome is rich in repetitive DNA, such as IS6110 insertion sequences, and in

multigene families and duplicated housekeeping genes (Cole et al., 1998). Sixteen

copies of the IS6110 sequence and six copies of the more stable element IS1081 were

found to reside within the genome of H37Rv. Due to the variable number of IS6110

elements in strains these were utilised in a DNA fingerprinting protocol which quickly

evolved into the first international gold standard for genotyping of MTBC (van Embden

et al., 1993). Typing of the MTBC in the context of strain diversity is discussed in the

following section.

1.2.2 Typing the MTBC

Members of the MTBC are considered genetically monomorphic with a high level of

genomic sequence similarity and negligible horizontal gene transfer (Hirsh et al., 2004;

Liu et al., 2006). As such, the MTBC displays a classic clonal population structure and

evolves by descent (Achtman, 2008), which leads to the situation whereby mutations in

the parental strain become defining markers for the rest of the progeny. Together, this

creates a situation where many genotyping tools useful in other species do not transfer to

the MTBC effectively (Achtman, 2008; Comas et al., 2009). Development of tools to

measure genetic variation in the MTBC was the start of generating a robust framework

needed firstly to measure the amount of genetic variation in strains, before secondary

questions, such as the effect of strain variation in TB disease could be asked. Before

1.2 Genetic diveristy in the MTBC

8

discussing the lineages of the MTBC it is first necessary to introduce a brief history of

typing the MTBC and the evolution of such tools to measure genetic diversity in a

robust and definitive manner.

As introduced above, the early 1990s saw the establishment of IS6110 restriction

fragment length polymorphism (RFLP) typing as the gold standard of the MTBC typing

(van Embden et al., 1993). The method is based on strain differences in the IS6110 copy

numbers, ranging from 0 to about 25, as well as the variability in the chromosomal

positions of the insertion sequences. Large collections were subsequently typed and the

first families of strains with a common genotype were uncovered in the MTBC (Van

Soolingen, 2001). It was found that some strains were at a higher frequency and across a

wider geographic area, suggesting differential success rates in terms of infection and

geographical spread (Van Soolingen, 2001). Although non-sequence based tools

including the above RFLP technique, and other methods such as Pulsed-Field Gel

Electrophoresis (PFGE) are useful for typing of monomorphic bacteria at the fine scale,

they have many drawbacks, including problems of reproducibility between laboratories

(Achtman, 2008).

Development of sequence based tools such as spoligotyping and MIRU-VNTR have

largely replaced RFLP typing, and are currently the official gold standards for

epidemiological typing of the MTBC (Supply et al., 2001). Spoligotyping is the

mycobacterial name given to the clustered regularly interspaced short palindromic

repeats (CRISPR) typing method, which is based on counting unique spacer regions

between a series of direct repeats in the M. tuberculosis genome (Grissa et al., 2008).

The second method, MIRU-VNTR or mycobacterial interspersed repetitive units

variable number tandem repeats, classifies strains by comparison of strain-specific

numbers of repeats of short DNA sequences at various genomic positions (Lindstedt,

2005). Databases have been built around the results of typing tens of thousands of

patient isolates with these methods, such as SpolDB4 (Brudey et al., 2006) and MIRU-

VNTR plus (Weniger et al., 2010). Although spoligotyping and MIRU-VNTR have

been invaluable from an epidemiological view, the application of such tools to study

evolutionary questions is not ideal as they are susceptible to convergent evolution.

Convergent evolution describes the identification of the same genotype in two strains

that is not due to descent, and this impacts the robustness of derived phylogenies

(Comas et al., 2009). This scenario arises due to the limited number of loci that the

methods are based on. In a study by Comas et al. it was found that phylogenies built

1.2 Genetic diveristy in the MTBC

9

using either method had low discriminatory power and were incongruent compared to

those based on a recent SNP based typing method (Comas et al., 2009). It was therefore

argued that for evolutionary studies the MTBC should be typed using robust SNP or

large sequence polymorphisms (LSPs) markers (Comas et al., 2009).

Typing the MTBC by LSP or gene deletions exploits the absence of horizontal gene

transfer in the MTBC, making each deletion event unique and so robust informative

phylogenetic markers. Whilst LSPs have been used to resolve the main lineages of the

MTBC (Gagneux et al., 2006a; Reed et al., 2009), deletions are less abundant that SNPs

and were also largely based on deletions found in the reference strain H37Rv, making

SNPs the best choice for sampling MTBC diversity. To date numerous studies have

utilised SNP markers to classify strains and explore the evolutionary history of the

MTBC (Baker et al., 2004; Comas et al., 2010; Gagneux & Small, 2007; Hershberg et

al., 2008). However, SNP analyses can also suffer from the same problems as previous

studies based on LSPs, such as using SNPs based on prior information, which can

introduce a discovery bias, or through simply using a non-representative set of strains. In

2008, Hershberg et al. used de novo sequencing of multiple genes from 108 global

MTBC strains to identify novel SNPs and constructed the most complete phylogenetic

tree of the MTBC (Hershberg et al., 2008). Subsequent whole genome sequencing of a

smaller set of strains in 2010 has defined the MTBC lineages at the highest possible

resolution, the single nucleotide level (Comas et al., 2010).

1.2.3 The phylogenetic lineages of the MTBC

The global populations structure of the MTBC is defined by six main phylogenetic

lineages, named Lineage 1 to 6 (Comas et al., 2010), although these have also been

described by their geographic distribution and other naming schemes in previous studies

(Filliol et al., 2003; Gagneux et al., 2006a; Hershberg et al., 2008). The largest

phylogeny of global MTBC diversity is shown in Figure 1.2. Lineages are coloured

based on previous deletion analysis in a global set of strains (Gagneux et al., 2006a), and

the same colouring scheme is continued throughout this thesis. The phylogeny is based

on a multi locus sequencing analysis (MLSA) of SNPs identified from the sequencing of

89 genes in 108 MTBC strains (Hershberg et al., 2008). The MLSA also included seven

animal-adapted strains, which were shown to all cluster within one of the M. africanum

lineages (Lineage 6). Of special note is the Beijing sub-lineage of Lineage 2, which is of

interest in the context of association with multidrug resistance and recent expansion

1.2 Genetic diveristy in the MTBC

10

(Borrell & Gagneux, 2009); this is discussed further in section 1.3.2. In addition to

strains clustering into six main lineages, two major groupings were observed, the

“ancient” and “modern” lineages (Figure 1.2). Lineage 1 and the two M. africanum

lineages are referred to as ancient as they branched off from a common ancestor at an

early stage of evolution, whilst the remaining three modern lineages diverged at a later

time point (Lineage 2, 3, and 4). Previously, studies have classified MTBC strains into

two groups based on the presence of a single genomic deletion known as TbD1 (Brosch

et al., 2002), but here it was demonstrated this separation is more than a single deletion

(Hershberg et al., 2008). TbD1 is in the relatively long branch prior to the separation of

Lineages 2, 3 and 4 shown in Figure 1.2, thus representing more genetic variation

between the ancient and modern lineages than had been suggested by TbD1. As

mentioned previously, recently a rare seventh MTBC lineage was identified, and this has

a phylogenetic location that is between the ancient and modern lineages in Figure 1.2,

although the Lineage 7 branch point is before TbD1 (Firdessa et al., 2013); Lineage 7

was published in March 2013 and therefore is not discussed further in this thesis.

Strains used in the MLSA study were derived from a global collection of 875 strains

from 80 countries that were previously characterised by genome wide deletion analysis

(Gagneux et al., 2006a), and represent the broadest sample of genetic and geographic

MTBC diversity to date. In the study by Gagneux et al. and following analyses, it was

found that the MTBC diversity is highly geographically structured (Gagneux et al.,

2006a; Hershberg et al., 2008). This is shown in Figure 1.3, where for example Lineage

4 is the dominant lineage in terms of geographical spread across the continents of

Europe, America and Africa, whilst Lineage 2 is predominantly found in East Asia.

1.2 Genetic diveristy in the MTBC

11

Figure 1.2. The most complete phylogeny of the human adapted MTBC. Maximum

Parsimony phylogeny of MTBC built using 89 concatenated gene sequences in 108

strains. The branches are colored according to the main lineages defined previously

based on LSP deletion analysis (Gagneux et al., 2006a). Although not part of this study,

the animal strains were part of the previous MLSA study and shown here for reference.

Adapted from Hershberg et al. (2008). Image reproduced under the Creative Commons

Attribution License (CCAL).

Lineage 1

Lineage 5

Lineage 3

Lineage 2

Lineage 4

Lineage 6

The Philippines

Rim of Indian Ocean

M. africanum (West Africa 1)

M. africanum (West Africa 2)

India, East Africa

Beijing

East Asia

Europe, America, Africa

Ancient lineages

Modern lineages

1.2 Genetic diveristy in the MTBC

12

Figure 1.3. Distribution of the MTBC lineages globally. The six lineages display a

strong geographic structure, with each dot representing the dominant lineage in each of

the 80 countries represented in the strain collection. Adapted from Gagneux et al.

(2006a) and Hershberg et al. (2008). Image reproduced under the CCAL.

sequenced for each strain [26], has been used very successfully todefine the genetic population structure of many bacterial species[27]. Because of the low degree of sequence polymorphisms inMTBC, however, standard MLST is uninformative [28]. A recentstudy of MTBC extended the traditional MLST scheme bysequencing 89 complete genes in 108 strains, covering 1.5% of thegenome of each strain [29]. Phylogenetic analysis of this extendedmultilocus sequence dataset resulted in a tree that was highlycongruent with that generated previously using LSPs (Figure 3).The new sequence-based data also revealed that the MTBCstrains that are adapted to various animal species represent just asubset of the global genetic diversity of MTBC that affects differenthuman populations [29]. Furthermore, by comparing thegeographical distribution of various human MTBC strains withtheir position on the phylogenetic tree, it became evident thatMTBC most likely originated in Africa and that human MTBCoriginally spread out of Africa together with ancient humanmigrations along land routes. This view is further supported by thefact that the so-called ‘‘smooth tubercle bacilli,’’ which are theclosest relatives of the human MTBC, are highly restricted to EastAfrica [30]. The multilocus sequence data reported by Hershberget al. [29] further suggested a scenario in which the three‘‘modern’’ lineages of MTBC (purple, blue, and red in Figure 3)seeded Eurasia, which experienced dramatic human populationexpansion in more recent times. These three lineages then spreadglobally out of Europe, India, and China, respectively, accompa-nying waves of colonization, trade and conquest. In contrast to theancient human migrations, however, this more recent dispersal ofhuman MTBC occurred primarily along water routes [29].The availability of comprehensive DNA sequence data has also

allowed researchers to address questions about the molecular

evolution of MTBC. In-depth population genetic analyses byHershberg et al. highlight the fact that purifying selection againstslightly deleterious mutations in this organism is strongly reducedcompared to other bacteria [29]. As a consequence, nonsynon-ymous SNPs tend to accumulate in MTBC, leading to a high ratioof nonsynonymous to synonymous mutations (also known as dN/dS). The authors hypothesized that the high dN/dS in MTBCcompared to most other bacteria might indicate increased randomgenetic drift associated with serial population bottlenecks duringpast human migrations and patient-to-patient transmission. Ifconfirmed, this would indicate that ‘‘chance,’’ not just naturalselection, has been driving the evolution of MTBC. Although thesekinds of fundamental evolutionary questions are often underap-preciated by clinicians and biomedical researchers, studying theevolution of a pathogen ultimately allows for better epidemiolog-ical predictions by contributing to our understanding of basicbiology, particularly with respect to antibiotic resistance.

A Vision for the FutureThanks to recent increases in research funding for TB [4],

substantial progress has been made in our understanding of the basicbiology and epidemiology of the disease. Unfortunately, this increasedknowledge has not yet had any noticeable impact on the currentglobal trends of TB (Figure 1). While TB incidence appears to havestabilized in many countries, the total number of cases is still increasingas a function of global human population growth [1]. Of particularconcern are the ongoing epidemics of multidrug-resistant TB [31], aswell as the synergies between TB and the ongoing epidemics of HIV/AIDS and other comorbidities such as diabetes (Box 1).As our understanding of TB improves, we would like to be able

to make better predictions about the future trajectory of the

Figure 2. Global distribution of the six main lineages of human MTBC. Each dot represents the most frequent lineage(s) circulating in acountry. Colours correspond to the lineages defined in Figure 3 (adapted from [20]).doi:10.1371/journal.ppat.1000600.g002

PLoS Pathogens | www.plospathogens.org 3 October 2009 | Volume 5 | Issue 10 | e1000600

1.2 Genetic diveristy in the MTBC

13

1.2.4 Origin of the MTBC

Early dating of the MTBC ranged from 15,000-20,000 years ago, where it was

hypothesised that animal domestication was the cause of TB in humans during the

Neolithic transition (Sreevatsan et al., 1997a). But more recent estimates place the

MTBC at 70,000 or more years old, linked with early human migrations out of Africa

(Hershberg et al., 2008). It is interesting that the continent that harbours the greatest

MTBC genetic diversity is Africa, with all six lineages represented (Figure 1.3). Based

on the MLSA data by Hershberg et al., it was postulated that the MTBC originated in

Africa and accompanied the Out-of-Africa migrations of modern humans approximately

70,000 years ago (Hershberg et al., 2008). In this evolutionary model it is suggested that

the two ancient M. africanum lineages (Lineage 5 and 6) remained in Africa, whilst the

other lineages spread with human migrations into Eurasia, with the three modern MTBC

lineages seeding Europe, India and China. Recent expansions in human population over

the last few centuries led to the rapid expansion of these modern lineages (Gagneux,

2012). In 2010, Comas et al. generated the first whole-genome global phylogeny of

human adapted MTBC (Comas et al., 2010). This phylogeny resolved the lineages at

much greater resolution than previous analyses, and demonstrated that the two M.

africanum lineages are the most basal. These two lineages are exclusively found in West

Africa (de Jong et al., 2010), and whilst the reason for this is unknown, this evidence

further supports the model that the MTBC originated in Africa (Gagneux, 2012;

Hershberg et al., 2008)

1.2.5 Selective pressures acting within the MTBC

Genetic diversity is introduced and fixed into populations by the four primary

evolutionary forces – mutation, natural selection, genetic drift and gene flow (Robinson

et al., 2010a). Mutation is a stochastic process affecting DNA regardless of function, but

only those mutations that ‘survive’ the processes of genetic drift and selection will be

detected in the genome. Genetic drift is a change in allele frequency over time due to

random sampling over the course of multiple generations. Importantly, it is dependent

on effective population size; smaller sizes are more strongly affected by genetic drift

than larger populations. In contrast, natural selection is a non random process and

determined by the differential survival of genetic variant within a population (Robinson

et al., 2010a). Finally, gene flow in the form of horizontal gene transfer (HGT) or

recombination can shuffle mutations and introduce new genetic information into

1.2 Genetic diveristy in the MTBC

14

populations. Importantly, while mycobacterial species display gene flow, it has not been

detectable in the MTBC (Hirsh et al., 2004; Supply et al., 2003), thus leaving the three

former evolutionary forces acting within the MTBC. Mutation, selection and drift are

intrinsically interdependent, and Hershberg et al. used the MLSA dataset to explore the

evolutionary forces that might have shaped the MTBC genetic diversity (Hershberg et

al., 2008). Comparison of nonsynonymous SNPs (which cause an amino acid change) to

synonymous SNPs (no amino acid change) can provide a measure of the selective

pressures acting within a sequence. This is expressed as the dN/dS ratio, whereby the

ratio of nonsynonymous SNPs to potential nonsynonymous SNPs (dN) is divided by the

respective synonymous ratio (dS); a ratio of near unity indicates the absence of

selection, whilst the ratio increases under positive selection, and decreases under

purifying selection (Rocha et al., 2006). Positive selection describes the process of

certain alleles increasing in frequency due to a greater fitness than others, whilst

purifying selection purges deleterious alleles, likely generated by nonsynonymous SNPs,

from the population. Applied to the MLSA it was found that 62% of the SNPs were

nonsynonymous and 38% synonymous, corresponding to a dN/dS ratio of 0.57. To put

this in context, the dN/dS ratio for M. canetti, the outlying member of the MTBC was

0.18, and in two sequenced Mycobacterium avium strains the dN/dS was 0.17 (see

phylogeny in Figure 1.1). Similar ratios were observed across all other Actinobacteria,

hence the dN/dS seen in the MTBC is markedly high compared to other mycobacteria. It

was concluded that in the MTBC purifying selection is strongly reduced.

The consequence of reduced purifying selection in the MTBC was examined at the level

of conservation of amino acid positions in the 89 genes sequenced across the MTBC

strains. Orthologs were found for 62 genes in mycobacteria distantly related to the

MTBC strains, and using a multiple sequence alignment of these genes the amino acids

were divided into either conserved or variable positions. This categorised 64% of the

amino acids positions in mycobacteria into conserved positions, and 36% into variable.

Mutations within conserved positions are more likely to have a functional effect than at

variable positions. Nonsynonymous changes in M. canetti predominantly fell into

variable positions (72%), but the majority (58%) of amino acid mutations in MTBC fell

into the conserved positions. This percentage was not dissimilar from that expected if

purifying selection in MTBC was no longer making a distinction among mutations in

these two classes of sites (Hershberg et al., 2008).

1.3 Phenotypic diveristy

15

1.3 Phenotypic diversity

Whilst the outcome of human tuberculosis infection and resulting disease is highly

variable and has been attributed to many factors including host and environmental

variables, the impact of bacterial strain variation on the clinical outcome of human

infection by MTBC remains an open question. At the level of phenotypic diversity, a

number of studies have explored the phenotypic differences between specific strains.

Many of the earlier studies were based on a small set of canonical laboratory reference

strains, whilst later studies moved into the use of clinical strains, increasingly informed

by the phylogenetic structure of the MTBC. The former studies shall be discussed first

in the next subsection, and then moving onto a discussion of clinical strain phenotypes.

1.3.1 Laboratory strains

As introduced above, many early studies were based on a few characterised reference

strains, namely the laboratory strains H37Rv, H37Ra, Erdman and the vaccine strain M.

bovis BCG reviewed in Coscolla & Gagneux (2010). In addition to these strains, two

additional reference clinical strains CDC1551 and HN878, isolated from TB outbreaks

in Tennessee and Texas respectively, have also been used (Jones et al., 1999; Valway et

al., 1998). From a phylogenetic context these stains are not representative of MTBC

diversity, with H37Rv, H37Ra, Erdman and CDC1551 all from Lineage 4, whilst

HN878 is part of the Beijing subgroup of Lineage 2 (Figure 1.2).

One of the clear differences in strain phenotype compared to the above laboratory and

clinical reference strains is from strain HN878 in infections. HN878 is consistently

associated with low inflammatory response and increased virulence in both in vitro

macrophage studies and in vivo animal models compared to the other laboratory stains

(Manca et al., 1999; Manca et al., 2001; Manca et al., 2005). In a mouse challenge study

using several clinical strains, it was found that HN878 was hypervirulent, causing

unusually early death of infected immune-competent mice (Manca et al., 2001).

Hypervirulence of HN878 was suggested to be due the failure of this strain to stimulate

Th1 type immunity for control of M. tuberculosis infection (Manca et al., 2001).

All studies that utilise laboratory strains suffer from the same issue of strain adaptation

to laboratory conditions. This mechanism was exploited to create the laboratory strain

1.3 Phenotypic diveristy

16

H37Ra, an avirulent M. tuberculosis strain that was generated by culturing H37, the

parental strain of H37Rv, on solid egg medium and selecting for resistance to lysis

(Steenken, 1935). This phenomenon can also affect clinical strains but can be managed

through minimal handling and passaging of cells, thereby limiting the number of

generations and potential for mutation. Adaptation can lead to changes in the virulence

of the strain, such as the loss of phthiocerol dimycocerosate (PDIM) from strain H37Rv

grown in vitro. PDIM is a wax-like compound and an important cell wall lipid

associated with mycobacterial virulence (Domenech & Reed, 2009). The other

laboratory strain, H37Ra, does not synthesise a number of cell surface antigens,

including sulfolipid-1, trehalose mycolates, as well as PDIM (Chesne-Seck et al., 2008).

As H37Rv and other laboratory strains have been passaged for many decades outside of

the human host (Ioerger et al., 2010), their relevance in studies of infection and

virulence is debatable. This is further underscored by the genomic diversity seen in

strains of H37Rv, which has been grown in numerous laboratories throughout the world

effectively in an unintentional in vitro evolution experiment, resulting in their separation

by multiple SNPs and frameshift insertion and deletions (indels) (Ioerger et al., 2010).

1.3.2 Clinical strain phenotype

Whilst there is currently little evidence of common phenotypic differences at the lineage

level, multiple phenotypes have been identified in nearly forty studies investigating the

virulence and immunological characteristics of clinical strains (Coscolla & Gagneux,

2010). One consistent phenotype is the lower induction of proinflammatory cytokines by

the Beijing sub-lineage of Lineage 2 (Figure 1.2) compared to H37Rv and other strains.

This group of strains is so described as they are endemic in many parts of East Asia, and

account for the majority of cases of TB in these regions (Qian et al., 1999); they have

also been described as the W-Beijing family of strains (Glynn et al., 2002). The Beijing

group has subsequently become the focus of numerous studies owing to its recent spread

in human populations (Cowley et al., 2008), and association with multidrug resistance

(Borrell & Gagneux, 2009). Whilst the characteristics that predispose this family of

strains to such clinical outcomes have not been fully resolved, Reed et al. (2007) showed

that Beijing strains accumulate large quantities of triglycerides in in vitro aerobic

culture, and that this was linked to the constitutive over expression of genes that are

members of the DosR-controlled regulon. DosR is induced during conditions that are

likely to occur during latent infection, such as by nitric oxide and low oxygen tension

and is thought to contribute to bacterial persistence (Kumar et al., 2007). One

1.4 Linking genotype to phenotype

17

consequence of this constitutive expression is the observed accumulation of large

quantities of triglycerides during in vitro aerobic culture conditions in contrast to non-

Beijing strains. The authors hypothesise that the triglycerides provide an adaptive

advantage to the Beijing strain family by acting as an energy source during infection

(Reed et al., 2007), which would represent the first example of an in vitro phenotypic

characteristic shared at the MTBC strain sub-lineage level (Nicol & Wilkinson, 2008).

From a clinical perspective, early studies of MTBC strain variation found that strains

from South India were less virulent and had increased susceptibility to oxidative stress

compared to strains from Great Britain (Mitchison et al., 1960; Mitchison et al., 1963).

Although these strains were not genotyped at the time, it can be speculated using the

current knowledge MTBC phylogeography that this represents a divide between Lineage

1 (Indo-Oceanic) and Lineage 4 strains (Coscolla & Gagneux, 2010). Another example

of differences between MTBC strains detected at the clinical level is Lineage 2, which

has been associated with extra pulmonary (Kong et al., 2007) and menigeal TB (Caws et

al., 2008) compared to strains from other lineages. Several studies have also associated

Lineage 2 with HIV coinfection (Caws et al., 2006), but the experimental phenotype is

not clear and has been contested in other studies which found no significant associations

(de Jong et al., 2009). In summary, the extent to which clinical MTBC phenotypes are

shared by strains belonging to broader phylogenetic lineages is largely unknown, but

this may reflect the previous paucity of research in this area (Nicol & Wilkinson, 2008).

In the context of increasing evidence that the amount of sequence variation in MTBC

has been underestimated, genetic diversity may have important phenotypic

consequences, including an impact on areas such as drug and vaccine design (Gagneux

& Small, 2007).

1.4 Linking genotype to phenotype

The first step towards understanding the influence of genetic diversity in the MTBC on

TB infection is to understand the molecular mechanisms that link strain diversity to

phenotype. This is a challenging area of research and there are few examples of such

studies for the MTBC. The previously described study by Reed et al. linked the

accumulation of triacylglycerides to the constitutive over-expression of the DosR

regulon (Reed et al., 2007). This has recently been partially associated with a 350 kb

genomic duplication that is present in some strains from the Lineage 2 (Domenech et al.,

2010). A second example is a link between the hypervirulence of some Lineage 2 strains

1.4 Linking genotype to phenotype

18

to the production of the immune modulatory phenolic glycolipid (PGL). It was found

that the laboratory strain H37Rv and other members of Lineage 4 do not produce PGL

due to a seven base pair frameshift deletion in the pks1/15 gene cluster; this encodes a

polyketide synthase involved in the production of PGL (Constant et al., 2002). If

pks1/15 is disrupted in the Lineage 2 laboratory strain HN878, then the

hypoinflammatory and hypervirulent phenotype is lost (Reed et al., 2004). However,

this phenotype is more complex than simply the presence of an intact pks1/15. Insertion

of an intact pks1/15 into the lineage 4 H37Rv laboratory strain did not result in increased

virulence (Sinsimer et al., 2008), thus demonstrating the importance of taking into

account the lineage genetic background of the strain in question.

With the advent of advances in sequencing technology, the number of MTBC strains

sequenced and associated number of SNPs identified is rapidly increasing (Stucki &

Gagneux, 2012). Shown in Figure 1.4 is the number of MTBC genome sequences within

the NCBI Short Read Archive (SRA), which is a repository for all next-generation

genome sequencing data, and currently stands at 4,913 MTBC genome sequences. SNPs

are the most common form of genetic variation in MTBC, followed by insertions and

deletions (indels), and a total of 9,037 SNPs were discovered by sequencing twenty-one

clinical strains of MTBC (Comas et al., 2010). Whilst this presents an opportunity to

understand the impact of such SNPs, there are also considerable challenges due to the

shear number of SNPs identified, which will only grow in size with the associated

increase in comparative genome sequencing studies.

Figure 1.4. The number of MTBC genome sequences in the NCBI Short Read

Archive (SRA). The database was queried on 21-02-2013 using the search term

Mycobacterium tuberculosis complex. The year 2013 is not complete and only

representative of nearly the first two months of the year.

2008 2009 2010 2011 2012 20130

1000

2000

3000

4000

5000

Year

Num

ber o

f gen

omes

in

NC

BI S

RA

1355

1799

46754913

0

1.4 Linking genotype to phenotype

19

1.4.1 In silico prediction of functional SNPs

Whilst identifying SNPs in bacterial genomics studies is becoming relatively simple

through whole genome sequencing using one of the second-generation technologies

(Loman et al., 2012), understanding the effects of sequence variations has become a

major effort in mutation research (Thusberg & Vihinen, 2009). Experimental study of

the molecular effects of all MTBC SNPs identified in recent studies, such as those found

in the above twenty-one genome study, is unfeasible. The development of computational

methods to screen for SNPs likely to have a functional effect from those that are neutral

has therefore been a highly active field within bioinformatics, and a number of

computational tools have been created for this purpose (Bao & Cui, 2005; Cingolani et

al., 2012; Ng & Henikoff, 2006). From here on, the term functional SNP is used to refer

to those SNPs that are expected to alter gene expression or function, and therefore

associated with a phenotype. Use of such methods to predict functional SNPs can help

prioritise additional research on those SNPs more likely to affect protein function.

Methods that predict whether a SNP has a functional effect use either sequence or

structural information, or a combination of both to form the prediction. Such methods

rely on the evidence that mutations which effect protein function tend to occur at

evolutionary conserved positions, or are buried in the interior of the protein structure

(Ng & Henikoff, 2006). Predictions based on sequence information typically follow a

common procedure, as implemented by Ng & Henikoff in their SIFT prediction

algorithm (Ng & Henikoff, 2003). Firstly an input sequence is used in a database search

for homologous sequences. These are used to create a multiple sequence alignment,

which identifies the evolutionary conserved positions, and these are inferred to be

important for function. A scoring method based on the frequency of each amino acid at

each position, and the severity of an amino acid change is then used for each position in

the input sequence. The introduction of an amino acid that does not appear in the

specific amino acid position can still be classified neutral and not functional as

predictions also use the physiochemical properties of the amino acids already present in

the alignment. For example, if a position in an alignment contains the hydrophobic

amino acids isoleucine, leucine and valine, then this position can likely only contain

hydrophobic amino acids, and changes to other hydrophobic amino acids, such as

methionine, will likely not have a functional effect (Ng & Henikoff, 2003; Ng &

Henikoff, 2006).

1.4 Linking genotype to phenotype

20

1.4.2 Gene expression diversity

After the genome sequence of the first M. tuberculosis strain was published in 1998

(Cole et al., 1998), and the extent of genetic diversity was beginning to be uncovered

(Comas et al., 2010; Hershberg et al., 2008), the next logical step in understanding the

consequences of such genetic diversity is to build upwards from the genomic

information layer. Uncovering the complexity of phenotypic differences in the MTBC

likely requires the integration of multiple layers of biological information (Comas &

Gagneux, 2009), and moving from the DNA to RNA level to explore MTBC

transcriptional diversity is discussed in the following section.

In the first systematic survey of variation in mRNA expression, Gao et al. compared the

gene expression of ten clinical isolates of M. tuberculosis in additional to the reference

strains H37Rv and H37Ra (Gao et al., 2005). All isolates were grown in vitro and under

exponential growth conditions. The authors found that 527 (15%) of the genes tested

were variable amongst the isolates, highlighting for the first time strain-to-strain

variability in expression under identical growth conditions. Combined with gene

function information, it was found that variable genes were statistically over-represented

by genes involved in lipid metabolism; it was speculated that this could have

implications in virulence, as lipid and lipid metabolism is thought to have an important

role in host pathogen interactions (Barry, 2001; Forrellad et al., 2012; Reed et al., 2004).

A further 16% of genes represented those consistently expressed, and as might be

expected it was found that this class was over-represented by those found in the

information pathways class; this class consists of genes associated with replication,

transcription and translation (Lew et al., 2011), and are consequently highly expressed

in actively growing bacteria. Approximately two-thirds of the remaining genes in the

study were equally split between low or undetectable and unexpressed classes. Many of

these genes included those that were classed as unknown hypotheticals, and so could

represent incorrect annotation of coding regions, or alternatively discovery bias through

the use of only one culture condition (Gao et al., 2005). Overall the study identified

transcriptional variation amongst a set of clinical isolates, with implications in the

choice of drug targets for vaccine development and diagnostic markers. The study

predates the robust classification of the phylogenetic lineages of the MTBC (Gagneux et

al., 2006a), and so limits the use of the results in a phylogenetic context.

1.4 Linking genotype to phenotype

21

More recently, a study of transcriptional variation amongst clinical isolates of the

MTBC has been undertaken within a phylogenetic framework using microarray

technology (Homolka et al., 2010). The authors included fifteen clinical strains from

four MTBC lineages (Lineages 1, 2, 4 and 6), plus the reference strains H37Rv and

CDC1551, which are part of Lineage 4. Under in vitro exponential growth conditions

the authors identified 364 genes (9.1% of all annotated genes) differentially expressed

between strains of different lineages in at least one pairwise comparison. Several

genotypic signals were identified, such as the dysregulation of virS-mymA operon in

Lineage 1, thought to be involved in maintenance of the cell wall structure (Singh et al.,

2003), and over-expression of the dosR two component regulator in the Beijing strains,

which controls the DosR regulon and described in section 1.3.2. Analyses were extended

to the transcriptional response of intracellular bacilli before and after infection of resting

and activated murine macrophages. Apart from identifying the core universal induction

or repression of 280 genes (7.0%) in all strains regardless of state compared to in vitro

expression, a proportion of genes (293 genes; 7.3%) displayed significant genotypic

patterns in response to the intracellular conditions in the macrophage (Homolka et al.,

2010). This study currently represents the most comprehensive survey of human-

adapted MTBC transcriptional diversity in gene expression. The presence of genotypic

signals implicates the effect of the underlying genotypic diversity, driven by large

deletions, indels, and coding and noncoding SNPs, although this was not explored in the

study.

In 2007, the global transcriptional differences between a strain of M. bovis and the

reference strain H37Rv was investigated by microarray (Golby et al., 2007). This study

provides a useful comparison from the perspective of a human-adapted strain and M.

bovis, which whilst it can be sustained in humans, is regarded as primary pathogen of

wild and domesticated animals (as discussed in section 1.1.2). Under nutrient limited

conditions and in steady state growth, it was found that 92 genes (2.3%) had 3-fold

differential expression. Genes showing higher expression were equally split between the

two strains. Focusing again on the major gene functional categories, a large proportion

of differentially expressed genes encoded proteins involved in the cell wall, lipid

metabolism, gene regulators, the PE/PPE protein family, and toxin–antitoxin (TA) gene

pairs.

The growing understanding that regulatory processes are often mediated by RNA

molecules beyond the classical view of protein based regulation was combined with

1.4 Linking genotype to phenotype

22

advances in sequencing technology to uncover the total transcriptome of M. tuberculosis

by RNA-sequencing (RNA-seq) (Arnvig et al., 2011). The RNA-seq method is

discussed in the following section (1.4.3). All RNA molecules from in vitro exponential

and stationary phase cultures of M. tuberculosis strain H37Rv were sequenced, and it

was found that more than a quarter of all sequence reads mapped to intergenic regions;

this excluded the highly expressed ribosomal RNAs involved in protein synthesis.

Accounting for the size of the intergenic regions based on the H37Rv genome size, this

represented a 2-fold higher density of noncoding RNA expression compared to gene

expression (mRNA transcription). The non-coding RNA ranged from 5’ and 3’

untranslated regions (UTRs), antisense transcripts, and intergenic small RNA (sRNA)

molecules. Although based on the reference strain H37Rv, the work provides an

important benchmark for future studies of transcriptional diversity in MTBC strains,

demonstrating the significant quantity of RNA expression that had not been detectable in

previous microarray based studies.

1.4.3 High-throughput DNA sequencing technology

Our awareness of greater levels of genetic diversity in the MTBC has been largely

driven by technology changes in sequencing, and next-generation high-throughput DNA

sequencing is likely to play an important role in improving our understanding of TB

(Loman et al., 2012); whilst this technology is often described as next-generation

sequencing, this term is likely to become less useful as the technology advances by

further generations. As introduced earlier, in 2010 Comas et al. sequenced twenty-one

representative clinical MTBC strains, and this was performed using Illumina sequencing

by synthesis technology (Comas et al., 2010; Loman et al., 2012). This genome set has

since become an ideal basis on which to perform later phylogenetic studies employing

ever increasing numbers of MTBC strains (Bentley et al., 2012). This section briefly

introduces the technology, focusing specifically on the methods used in this thesis,

namely genome and RNA-sequencing using the Illumina sequencing platform.

Recent advances in DNA sequencing technologies have enabled the determination of

nucleotide sequence at a greater data throughput, a shorter amount of time and at lower

cost than was previously possible using capillary-based Sanger sequencing (Shendure &

Ji, 2008). Several novel approaches have been developed including 454

(pyrosequencing) and Illumina sequencing, previously known as Solexa sequencing.

The Illumina system was established at NIMR in 2010, initially by an Illumina Genome

1.4 Linking genotype to phenotype

23

Analyser IIx sequencer (GA), and later on by the Illumina HiSeq2000 (HS); the HS

sequencer was the result of technical developments and has five times greater data

output than older GA sequencer (Loman et al., 2012). The Illumina method involves

sequencing millions of short reads, initially 36bp but more recently ~100bp, using a

flowcell based system for capturing DNA. It is the flowcell in which the sequencing

reactions take place, which is divided into eight lanes, and therefore up to eight different

samples can be added. This limitation of sample number has been removed by recent

multiplexing technology, which utilises sequence tags to track each sample and therefore

increases the number of individual samples added to each flowcell lane (Meyer &

Kircher, 2010).

Briefly, there are three broad stages in the generation of sequence data: library

preparation, amplification and sequencing. Libraries are initially constructed by one of

several methods that generate a mixture of DNA fragments with ligated adaptor

sequences up to several hundred bp in length. These are amplified using PCR primers

attached to a flowcell, resulting in the physical clustering of the DNA templates across

the flowcell, creating a lawn of sequence fragments (Shendure & Ji, 2008). This is

followed by sequencing, consisting of multiple cycles of single base extensions using

fluorescently labeled reversible terminator nucleotides and imaging to detect which base

has been incorporated, thereby determining the base in the sequence (Bentley et al.,

2008). At the end of each cycle the labeled nucleotide is cleaved and another round of

terminators is added; the number of cycles therefore determines the length of the reads

generated.

The Illumina sequencing platform generates considerable quantities of data per run, with

each flowcell producing up to 6 billion reads which translates into 600 Gigabase (Gb) of

sequence data. Apart from creating demands on storage capacity, with image data from

each flowcell requiring 32 terabytes of temporary storage, a robust informatics pipeline

is required to handle the downstream analysis (Bentley, 2010). There are two main

analytical approaches to using the sequence data, one involves aligning to a reference

sequence, also known as a mapped assembly, and the other is reference free and

therefore a de novo assembly. The short read data generated by the Illumina sequencers

is most applicable to the former method, and is very useful in the discovery of SNPs and

phylogenetics.

1.4 Linking genotype to phenotype

24

High-throughput sequencing has translated into numerous publications that provide new

insight into the evolution and genomic diversity of bacteria (Comas et al., 2010; Holt et

al., 2008; Qi et al., 2009). This technology is being applied to other disciplines, such as

transcriptomics, where whole genome sequencing of RNA transcripts (RNA-seq) is

creating a powerful new approach to characterisation of the bacterial transcriptome

(Perkins et al., 2009). For over ten years, microarray technology has allowed the

simultaneous monitoring of expression levels of all annotated genes in cell populations

(Schena et al., 1998). Whilst microarrays have been instrumental in our understanding

of transcription, generating a wealth of publications and data based on this technology,

limitations in its applicability have begun to be reached (Mortazavi et al., 2008).

Inherent issues such as the limited dynamic range for the detection of transcript levels,

cross hybridisation and the need for normalisation provide some explanation for the

explosion in use of second generation technologies in the analysis of transcriptomes

(Marguerat & Bähler, 2010). As well as surveying the total transcriptional landscape,

adaptation of the library making process can facilitate Transcriptional Start Site (TSS)

mapping, whereby the precise position of transcription initiation can be determined in a

genome-wide manner (Filiatrault et al., 2011; Sharma et al., 2010b). This can provide

greater understanding of the transcriptional output, and in the human pathogen

Helicobacter pylori revealed a complex structure of TSS within operons and opposite to

annotated genes (Sharma et al., 2010b).

1.5 Thesis outline

25

1.5 Thesis Outline

In this thesis, the identification and effect of lineage-specific genetic variation within the

phylogenetic lineages is investigated using computational methods and high-throughput

sequencing technology. This is driven by the overarching hypothesis that fixation of

mutations at evolutionary conserved positions in the lineages of M. tuberculosis, either

due to a relaxed selective constraint or positive selection, has resulted in functional

consequences that separate the MTBC lineages. Chapter 3 begins with the construction

of a representative 28-genome phylogeny using Illumina sequencing data. Comparative

analysis focuses on the detection of all lineage-specific single nucleotide polymorphisms

(SNPs), providing the first glimpse of the total SNP diversity that separates the main

phylogenetic lineages from each other. The lineage-specific coding SNPs are used to

investigate the evolutionary pressures acting within the lineages using population

genetics measures and gene function categories. Chapter 4 applies in silico tools to the

lineage-specific SNPs to predict those likely to have a functional effect. Focus is made

on the largest group of genetic variation, the nonsynonymous SNPs, and a significant

overrepresentation of transcriptional regulators with predicted functional SNPs was

detected. Chapter 5 moves from the DNA to RNA level using a transcriptomic approach.

RNA-sequencing of multiple strains from two lineages was performed, and differential

expression analysis used to define lineage-specific transcriptomes. Along with the

differential expression of genes between the lineages, the experimental method used

allowed novel expression of noncoding and antisense to be detected. In the context of

previously identified lineage-specific SNPs, significant associations were found between

the genomic and transcriptomic data, which were found to arise by three main

mechanisms. These have the potential to alter the response of isolates to differing

microenvironments and to modulate expression of ligands involved in innate immune

recognition.

2 Materials and Methods

26

Chapter 2 Materials and Methods

The following chapter details all protocols used in this thesis. From basic laboratory

methods used in the culture of Mycobacterium tuberculosis and the strains used.

Genome and RNA sequencing are next outlined, alongside the bioinformatics analysis

tools used to interpret this data. Details of MTBC strains and specific bioinformatics

analyses are detailed in results Chapters 3 to 5.

2.1 General microbiological methods

2.1.1 Containment 3 laboratory

All culturing of M. tuberculosis strains was performed in a Biosafety Level 3 laboratory,

and work undertaken within a Class II flow cabinet at a negative pressure of at least

160kPA.

2.1.2 General chemicals and reagents

Unless otherwise stated all laboratory chemicals were purchased from Sigma-Aldrich.

Buffers were prepared as aqueous solutions using distilled water, and solutions were

sterilised either by autoclaving or filtration (Millipore, 0.22μm) depending on the

volume.

2 Materials and Methods

27

2.1.3 Bacterial culture and storage

Growth of M. tuberculosis strains used in this study was performed in liquid

Middlebrook 7H9 growth media (Difco, Becton Dickinson). The 7H9 media was

supplemented with 0.5% glycerol (Fisher Scientific), 10% Middlebrook ADC (Albumin,

Dextrose, Catalase), and to help prevent clumping of the cells during growth, 0.05%

Tween-80. This is standard rich nutrient medium to culture M. tuberculosis (Atlas &

Snyder, 2006). Cultures were grown in one litre roller bottles (Nalgene) in a rolling

incubator at 37oC. For long-term storage all isolates were stored at -20°C in 2ml cryo

tubes (Sigma-Aldrich), and supplemented with 10% glycerol to increase viable cell

number during storage.

2.1.4 Growth curves

Growth curves of the bacterial strains used in this study were performed to determine the

previously unknown growth rates of the clinical isolates, which is critical for the

extraction of RNA from the correct growth phase for subsequent experiments. This

would also provide important phenotypic data on potential differences in in vitro growth

rates between the lineages.

Inoculation of 50ml conical screw cap falcon tubes (Fisher Scientific) with 10mls 7H9

medium was performed two days prior to the start of the growth curve experiment. On

starting the experiment a roller bottle with 100ml 7H9 was inoculated with the pre

culture so that the starting OD was 0.01 (the lower limit of detection by the

spectrophotometer). Samples of 1ml were taken every 24 hrs and the OD measured in

1ml cuvette.

2.1.4.1 Optical density (OD) measurements

The optical density (OD) method was used to measure the growth of mycobacterial

cultures in the above protocol. This is a rapid method that employs a spectrophotometer

to measure the difference in light transmission at a certain wavelength before and when

passing through a path length of a culture sample in a cuvette. Here an Amersham

Bioscience spectrophotometer was used for all OD measurements. All readings were

taken at a wavelength of 600nm (OD600), and sterile 7H9 used as a reference. Saturation

2 Materials and Methods

28

of absorbance occurs > 1 OD, therefore any readings above this were taken from a

diluted sample and multiplied by the dilution factor afterwards (typically 1:10).

2.2 Molecular biology techniques

2.2.1 Genomic DNA extraction

Genomic DNA was extracted using the CTAB method described previously (van

Soolingen et al., 1991). 20mls of culture with an OD of ~0.5 was transferred into a

sterile 50ml conical tube and centrifuged at 3000xg for 10mins to precipitate the

bacteria. The supernatant was decanted and the pellet resuspended in 1ml lysis buffer.

The suspension was transferred into a 2ml screw cap tube, and placed into a water bath

at 90oC for 1hr. Following this step the crude cells and lysate were transferred to a

containment 2 laboratory. The cells were pelleted at 13000xg, the supernatant discarded,

resuspended in 400µl lysis buffer and 100µl of 10mg/ml lysozyme, gently mixed, and

incubated at 37oC for 2 hrs.

The cell lysis step consisted of the addition of 50µl 20% SDS and 25µl Proteinase K to

the cell mix. The sample was incubated at 55oC for 40mins and 250µl of 4M NaCl

added and gently mixed. 160µl of preheated CTAB was added and incubated for 10

minutes. To separate the DNA from protein contamination, 900µl chloroform-isoamyl

alcohol (24:1) was added and the biphasic suspension vortexed, then centrifuged for

10mins at 13000xg at 4oC to separate the phases. The upper phase containing the DNA

mix was transferred to a clean 2ml eppendorf. DNA was purified with 700µl cold

isopropanol and mixed by gently inverting the tube. Following a 2hr or overnight

precipitation, the sample was centrifuged at 13,000xg for 10mins at 4oC. The

supernatant was decanted and the pellet air dried. 1xTE buffer was added to dissolve the

DNA that was then stored at 4oC.

2.2.2 RNA Isolation and handling

Inoculation of 10mls 7H9 media in falcon tubes from previously frozen bacterial stock

was performed per experiment to enable the rapid growth of pre-cultures before scaling

up to larger growth volumes. Following approximately two days and before OD reached

2 Materials and Methods

29

0.8, this culture was used to inoculate a roller bottle containing up to 180mls 7H9 liquid

media.

As determined by growth curve experiments (section 5.3.1), exponential phase cultures

were harvested at an OD of between 0.4 and 0.8, whilst stationary phase cultures were

harvested one week after the OD had reached 1.0. When ready, cultures were cooled

rapidly by addition of ice directly into the culture, and centrifuged at 12,000xg for 15

mins at 4oC. RNA was isolated using the FastRNA Pro blue kit from QBiogene/MP Bio

following the manufacturer’s instructions. The supernatant was subsequently decanted.

Following this procedure, the standard FastRNA Pro blue kit instructions were followed.

Briefly, 1ml of RNApro solution was added to the pellet and the cells resuspended by

pipetting, and 1ml transferred to a blue-cap tube containing Lysing Matrix B. The cell

mix in the tube was homogenised in a FastPrep Ribolyser (QBiogene/MP Bio) for

40secs at a setting of 6.0, and centrifuged at 12000xg for 5mins at 4oC. The upper phase

was transferred to a fresh microcentrifuge tube, incubate for 5mins, 300µl chloroform

added, vortexed for 10secs and further centrifuged at 12,000xg for 5mins. Following

transfer of the upper phase to a fresh microcentrifuge tube, 500µl of cold ethanol was

added and inverted for 5 times.

Following this step the RNA suspension was transferred to containment level 2

laboratory and precipitated for at least 2hrs or alternatively overnight. After

precipitation, the sample was centrifuged at 12,000xg for 15mins at 4oC, the supernatant

removed and pellet washed in 500μl of cold 75% ethanol (made with DEPC-H2O). The

ethanol was aspirated and the pellet air-dried at room temperature for 5mins, then the

RNA resuspended in 100 μl of DEPC-H2O.

2.2.3 Quantification of DNA and RNA by Nanodrop

A Nanodrop spectrophotometer (version ND-1000) was used to detect the quantity of

DNA and RNA following the above protocols. This requires 1μl of sample to be placed

on to the Nanodrop pedestal. Then the Nanodrop measures the absorption of the sample

at a range of wavelengths (230-350nm). This correlates with the concentration of DNA

present, given in ng/μl. The Nanodrop also provides a measure of the quality of DNA or

RNA extraction. Nucleic acids and proteins have absorbance maxima at 260 and 280nm,

respectively. A ratio of ~1.8 is generally accepted as high quality for DNA, a ratio of

~2.0 is generally accepted as high quality for RNA. If DNA or RNA extractions were

2 Materials and Methods

30

appreciably lower than these ratios a repeated round of purification was performed to

remove potential protein or other contamination that may be present in the sample.

2.2.4 Determination of DNA and RNA integrity by micro fluidics

Both RNA and DNA concentration was first measured using Nanodrop, and then

followed by quality control using the Agilent 2100 Bioanalyser. The Bioanalyser is a

chip-based capillary electrophoresis machine for sizing, quantification and quality

control of DNA, RNA, as well as proteins and cells. Depending on the sample type, the

nucleic acid was measured using the Agilent DNA 1000 chip or Agilent RNA 6000 nano

chip following the manufacture’s instructions.

2.2.5 Removal of DNA contamination from RNA samples

Rigorous DNase treatment of all RNA samples was performed using the TURBO DNase

free kit (Applied Biosystems). This procedure can remove > 200µg DNA per ml. Up to

5µg total RNA was treated in volumes of 50µl according to the manufacture’s

instruction. Briefly, 0.1 volume of 10X TURBO buffer and 1µl (2U) TURBO DNase

was added to the 50µl total RNA aliquot and mixed well. This was incubated at 37oC for

20mins, followed by an additional 1µl (2U) TURBO DNase, and 20min incubation. To

terminate the reaction 0.2 volumes DNase Inactivation Reagent was added and

incubated for 5mins at room temperature. The sample was then centrifuged at 13,000xg

for 2mins and the supernatant, containing the DNase free RNA, transferred to a fresh

microcentrifuge tube and stored at -20oC.

2.2.6 Polymerase chain reaction (PCR)

PCR was used to amplify specific regions of DNA. For general PCR amplification of

template DNA Supermix (Invitrogen) was used. Specific protocols including DNA-seq,

RNA-seq and qRT-PCR, used the manufacturers recommended reagents and are

described in the following sections. All PCR reactions were done in 0.2ml RNase- and

DNase-free thin wall PCR tubes (Ambion) using an Applied Biosystems Veriti Thermal

Cycler. As a negative control the same reaction was conducted in the absence of a DNA

template.

2 Materials and Methods

31

2.3 Materials

2.3.1 Mycobacterium tuberculosis strains

At the start of this project, strain stocks were generated for the entire duration of the

project. Stocks were taken from a strain collection at NIMR derived from a global

collection isolated in San Francisco (Gagneux et al., 2006a). Handling of stocks was

kept to a minimum to minimise the effect of laboratory adaptation; strains were cultured

for one week at NIMR to obtain sufficient stocks for this thesis. Stocks were frozen at

OD 0.4-0.8 to prepare stocks for subsequent exponential phase transcriptome sequencing

experiments.

Specific description of the MTBC used in this thesis is described in the respective results

chapters (Chapters 3 and 4).

2.4 DNA-seq

Following extraction and quality control of DNA described in the above method, the

Epicentre Nextera DNA kit was used to generate Illumina sequencing ready DNA

libraries. Briefly, the Nextera method employs in vitro transposition to simultaneously

fragment and tag DNA in a single-tube reaction, thereby facilitating the rapid generation

of DNA libraries; accounting for all quality control procedures, libraries can take less

than two days. The manufacturer’s instructions were followed, and the High-Molecular-

Weight Buffer (HMW) used, which generates fragments of 175-700bp and is

recommended for paired-end sequencing. A limited PCR step was performed, consisting

of a 72°C 3min extension step to denature the templates, followed by nine cycles of

95°C for 10secs, 62°C for 30secs and 72°C for 3mins. The amplified DNA fragments

were subsequently purified using the Zymo column DNA Clean & Concentrator-5 kit.

Additional MTBC strains that were not part of this study were also generated using the

above method at the same time, and therefore the Nextera barcoded adapters were used

in the above PCR step. This can be used to add up to twelve unique barcodes to the

Nextera library, enabling multiplexing of the libraries to reduce the sequencing cost.

2 Materials and Methods

32

2.5 RNA-seq

Following trialling of several methods to generate cDNA libraries ready for sequencing

from the RNA extractions, two methods were chosen for the generation of

transcriptomes in this thesis. The two methods are described below; one generates

transcriptomes for differential expression analysis (2.5.1), whilst the other was used for

transcriptional start site (TSS) mapping analysis (2.5.2).

2.5.1 Strand-specific RNA-seq libraries

The strand-specific protocol for transcriptome sequencing is largely based on the small

RNA sample preparation protocol from Illumina (part # 1001375), but with exclusion of

polyA-tail and size selection methods in order to capture all RNA species. Total RNA

from the above DNase treated RNA extraction was randomly fragmented, specific 5’

and 3’ adapters attached to both ends of the RNA; the adapters are complementary to

oligonucleotides immobilised on the glass surface of the Illumina flowcell. The protocol

consists of six main steps: fragmentation, phosphatase treatment, PNK treatment,

ligation of the adapters, reverse transcription and PCR amplification. These are followed

by purification steps using Solid Phase Reversible Immobilisation (SPRI) beads.

Fragmentation: Initially between 3-5µg of DNase treated RNA was fragmented

following the described Illumina protocol with the 10X fragmentation reagent. This was

stopped with the stop solution and put ice, the volume increased to 100µl with RNase

free water and precipitated by adding 3 volumes of 100% ethanol, 0.1 volumes of

sodium acetate (3M) (Ambion Cat # AM9740) and 0.05 volumes of glycogen. This was

precipitated for at least 30 minutes at -20°C. The pellet was washed with 500µl of 70%

ethanol, air dry the pellet on ice and resuspended in 16µl of RNase free water in a 200µl

PCR tube.

Phosphatase treatment: The sample was treated with 2µl Antartic phosphatase with 10X

Phosphatase buffer (NEB Cat # M0289S) and incubated for 30mins at 37°C, 5mins at

65°C and held at 4°C. PNK treatment: To the previous PCR tube 2µl T4 Polynucleotide

Kinase (PNK) (NEB Cat # M0201S), 17µl water, 5µl 10X PNK buffer, 5µl ATP

(10mM) (Epicentre Cat # R109AT) and 1µl RNAse OUT (Invitrogen, part # 10777-019)

was added and incubated for 60mins at 37°C and held at 4°C.

2 Materials and Methods

33

Phenol purification: In a new 1.5ml microcentrifuge tube the sample was transferred

and volume increased to 200µl by addition of RNase free water (Ambion, Cat #

AM9920). After 200µl acid phenol (Ambion Cat # AM9720) was added, vortexed, and

after centrifuging for 15mins at room temperature the upper phase was transferred to a

new microcentrifuge tube. 3 volumes of cold 100% ethanol, 0.1 volumes of sodium

acetate and 0.05 volumes of glycogen was added and precipitated for 30mins or

overnight. Following precipitation the sample was centrifuged for 25mins at 4°C, the

pellet washed in 70% ethanol and air dried on ice. Once dry 5µl RNase free water was

added to the pellet.

Ligation of the adapters: Adapters were from the Illumina small RNA kit preparation kit

with the v1.5 sRNA 3’ Adaptor (Illumina cat # FC-102-1009). Following the

manufacturers instructions the 3’ sRNA adaptor v1.5 and then SRA 5’ Adapter was

ligated to 5µl RNA from previous step.

Reverse Transcribe and Amplify: 4µl of the 5’ and 3’ ligated RNA was mixed with 1µl

diluted (1:5) SRA RT primer from the Illumina small RNA kit and heated at 70°C for

2mins. The standard SuperScript II Reverse transcriptase kit with 100mM DTT and 5X

first strand buffer (Invitrogen, part # 18064-014) was used to reverse transcribe the

ligated RNA following the manufacturer’s instructions.

PCR Amplification: Using the Phusion DNA Polymerase kit (NEB part # M0530S)

following the manufacturer’s instructions, 10µl of the product from the reverse

transcription reaction was amplified in a thermal cycler using the following conditions:

30secs at 98°C, 17 cycles of: 10secs at 98°C, 30secs at 60°C, 30secs at 72°C, followed

by 10mins at 72°C and then holding at 4°C.

Purification of libraries: The SPRI bead purification system (Agencourt AMPure from

Beckman Coulter Genomics) was used to remove residue reagents from the previous

steps to leave a purified DNA sample. The standard manufacturer’s instructions were

used for two rounds of SPRI bead purification. The final supernatant was transferred to a

fresh labelled RNase free tube, along with another 4µl aliquot for assessing library

concentration and purity (using a Bioanalyser), and stored at -20°C.

2 Materials and Methods

34

2.5.2 TSS 5’ enriched RNA-seq libraries

Terminator-5’-phosphate-dependent exonuclease (Epicentre Biotechnologies) was used

to deplete processed RNAs in cDNA samples used in TSS mapping analysis. Total RNA

was sent to Vertis Biotechnologie AG (Freising, Germany) and Illumina ready libraries

were constructed using the same protocol as above, but with the addition of the

Terminator-5’-phosphate-dependent exonuclease step to remove all RNA transcripts

without a 5’ triphosphate cap. This step removes degraded mRNAs and rRNAs, thereby

biasing the sequencing of only the 5’ end of mRNA transcripts and facilitating the

mapping of transcriptional start sites (TSS).

2.6 Illumina sequencing DNA (genome) and cDNA (RNA-seq) libraries

The library sequencing stage was performed by the high-throughput sequencing (HTS)

group at NIMR under the supervision of Abdul Sesay. Generated libraries were quality

checked by Agilent DNA 1000 chip and quantified by Qubit (Invitrogen).

Briefly, sequencing libraries were denatured with sodium hydroxide and a dilution of

2nM of the library loaded onto a single lane of an Illumina Genome Analyser 2x (GA)

or HiSeq2000 (HS) flowcell. Cluster formation, primer hybridisation and single or

paired-end sequencing were performed using proprietary reagents according to

manufacturer’s recommended protocol (Illumina).

2.7 Quantitative RT-PCR

To confirm differential expression identified by RNA-seq, qRT-PCR was carried out on

a 7500 Fast Real-Time PCR System (Applied Biosystems) using Fast SYBR Green

Master Mix (Applied Biosystems). To minimise across plate normalisation problems

arising, each 96-well plate consisted of a closed experimental plate design, with all

clinical strain samples included. RNA without RT (RT-) was analysed alongside cDNA

(RT+). Standard curves were performed for each gene analysed, and the quantities of

cDNA within the samples were calculated from cycle threshold values. Three biological

replicates were tested, consisting of three qRT-PCR plates per gene tested. Data was

averaged, adjusted for chromosomal DNA contamination (RT+ minus RT-) and

normalised to corresponding 16S RNA values.

2 Materials and Methods

35

cDNA for quantitative RT-PCR was made with random primers and Superscript III

according to manufacturer's instructions (Invitrogen). 2µg of DNase treated total RNA

from each respective strain was used as the starting material. Three biological replicates

per strain were used in this study.

2.7.1 Primer sequences

Primers were designed using the Primer 3 software (Rozen & Skaletsky, 2000), and

ordered from Sigma at 100≤µM concentration in 100µl aliquots, and stored at -20oC.

Primers used in the RNA-seq study in Chapter 5 are shown in Table 2.1.

Table 2.1. Primer sequences used in the qRT-PCR study. Seven toxin-antitoxin

genes were measured by qRT-PCR, and the 16S rRNA sequence was used in

normalisation. In the sequence column the suffix denotes the forward (F) and reverse (R)

primers.

Gene qRT-PCR primer Sequence (5’ - 3’)

Rv2063 mazE7_F TCCACGACGATTAGGGTTTC

Rv2063 mazE7_R ACATCGAGATTCCCCGTTC

Rv2274A mazE8_F CGAACCAGAAACCCTTCCT

Rv2274A mazE8_R GACGACTCTGCTCCCAACTC

Rv2830c vapB22_F GATCGAGATCACCAAACACG

Rv2830c vapB22_R GGTGGTGAAGAGTTCGTCGT

Rv2758c vapB21_F GTATGCTCTCCGGGTGTGAC

Rv2758c vapB21_R TGTCGTGGTACCCAGTTCCT

Rv1398c vapB10_F GGACCTGCAGGCTATAAACG

Rv1398c vapB10_R GCAAGGTGCTGTTCACGAC

Rv1397c vapC10_F TGGACTTGGCGACTATCTGA

Rv1397c vapC10_R GGAAATGCCACACGTTGAG

Rv2527 vapC17_F CGATATCGGCGAACTTGAAT

Rv2527 vapC17_R CAGTGACGTTTGTTGGCTGT

16S 16S_F AAGAAGCACCGGCCAACTAC

16S 16S_R TCGCTCCTCAGCGTCAGTTA

2 Materials and Methods

36

2.8 MTBC annotation datasets

2.8.1 Coding sequence annotations

All gene annotations were based on the reference H37Rv genome sequence (Cole et al.,

1998) and using the most recent annotations from the Tuberculist database, release 24

(December 2011) (Lew et al., 2011). In total there are 4,015 protein coding gene

sequences, 13 pseudogenes, 45 tRNAs and 3 rRNAs.

2.8.2 Functional Categories

The genes can be classified based on the function of the encoded proteins. Using the

Tuberculist database annotations there are ten functional categories, listed below (Lew et

al., 2011):

1. virulence detoxification and adaptation

2. lipid metabolism

3. information pathways

4. cell wall and cell processes

5. intermediary metabolism and respiration

6. unknown

7. regulatory proteins

8. conserved hypotheticals

9. insertion sequences and phages

10. PE/PPE

2.8.3 Essential M. tuberculosis genes

Definition of gene essentiality was based on experiments using transposon mutagenesis

to generate single gene knockouts, followed by transposon site hybridization after

growth on 7H11 agar or in mice (Sassetti et al., 2003; Sassetti & Rubin, 2003). On the

basis of these studies a total of 760 genes fell into the category of essential genes and the

remaining genes were classed as nonessential. This follows the same convention as

Comas et al. (Comas et al., 2010).

2 Materials and Methods

37

2.9 Bioinformatics software

2.9.1 Artemis

The genome browsing and annotation tool, Artemis (Carver et al., 2008), from the

Wellcome Trust Sanger Institute was used extensively throughout this work.

Importantly, this tool enables new features to be overlaid onto published annotations,

and the user plot function allows transcription data to be plotted against the genome.

2.9.2 Quality control of raw RNA-sequencing data

2.9.2.1 FastQC

Raw reads were first filtered to discard low quality reads, which improves the mapping

through a decrease in time and higher number of mapped reads. Raw fastq files

deposited from the Illumina machine were inspected using FastQC version version 0.9.3

(downloaded 20-6-11, Babraham Bioinformatics). FastQC provides a modular set of

analyses in a GUI environment written in the JAVA language. The Phred quality of

scores across the read length displayed in a box whisker plot, per base N content and

over-represented Illumina primer sequences were used to determine if a run has passed

QC.

2.9.2.2 SolexaQA

Fastq files passing the initial QC were filtered using SolexaQA version 1.7 (Cox et al.,

2010) (downloaded April 2011). SolexaQA is a Perl-based software package for quality

analysis of Illumina data. The DynamicTrim.pl script within this package was used to

remove poor quality bases from reads. Specifically, bases with Phred scores < 13 (which

corresponds to a p>0.05) were trimmed from the 5’ and 3’ ends of reads until all bases

were above this parameter. The Perl scripts were run on a linux server.

Trimming of reads was performed with the command:

$ DynamicTrim.pl [in.fastq] –h 13

2 Materials and Methods

38

The resulting output trimmed fastq file was used with the LengthSort.pl script. This

removes reads that were poor for a high percentage of the read length and are not

sufficiently long enough for mapping. The default parameter was used, removing reads

< 25 bases:

$ perl LengthSort.pl [in.fastq] > [out.fastq]

2.9.3 Transcriptome mapping software

An analysis pipeline was created to manage the high throughput sequencing datasets

generated by this study. Each file can contain about 150 million reads consisting of 10

Gigabases of sequence data. A reference based assembly was used for this study, and

mapping was performed against the reference genome H37Rv using BWA (version

0.5.9) (Li & Durbin, 2009). The raw sequence data file in the fastq format (Cock et al.,

2009) was mapped to the reference genome in fasta format using the following

commands:

Index the reference sequence using bwa index.

$ bwa-0.5.9 index [in.fasta]

The reads in the fastq file were mapped to the indexed fasta using the following

commands:

$ bwa-0.5.9 aln -I [in.reference.fasta] [in.fastq] > [out.fastq.sai]

$ baw-0.5.9 samse [in.reference.fasta] [in.fastq.sai] [in.fastq] > [align.sam]

For later processing and storage the mapped file in sam format is converted to the binary

format, BAM, using SAMtools (Li et al., 2009).

$ samtools view -bS [in.align.sam] > [out.align.bam]

The bam file is sorted to further reduce storage size and indexed for viewing the BAM

file in Artemis.

2 Materials and Methods

39

$ samtools sort [in.align.bam] [out.align.sorted.bam]

$ samtools index [in.align.bam]

Basic mapping statistics after this stage were viewed using the SAMtools idxstats

command.

$ samtools idxstats [in.align.bam]

Artemis plots were produced using the unix command.

$ paste [genomeCoverageBed reverse strand.out] [genomeCoverageBed forward

strand.out] > artemis.plot.out

2.9.4 Calculation of mapped read frequencies per feature region

Genome coverage of reads mapping to sense and antisense gene annotations and sRNAs

were calculated using the BEDtools package (Quinlan & Hall, 2010). Specifically, the

coverageBed and genomeCoverageBed utilities were used for extraction of gene regions

and whole genome coverage plots respectively. BEDtools is based on four widely used

file formats used in HTS data: BED, GFF, VCF and SAM/BAM. Gene and intergenic

annotations based on H37Rv were parsed into the BED (Browser Extensible Data)

format using standard linux command line tools. The Bed format consists of one line per

feature, each line containing a minimum of three fields of tabbed delimited information:

chr (chromosome name), chr start (start position), chr end (end position). Two of the

optional fields were used in this study: name (feature e.g. gene name), strand (either

forward or reverse strand). These optional fields enable the calculation of the reads

number that map to either the coding (sense) or non-coding (antisense) strand of the

gene in question.

As described above, the coverageBed script was used to identify the number of reads

mapping to each annotated feature, such a gene. The following was used to identify

reads mapping to each specific strand in the fastq file, in this case the forward strand.

$bamToBed -i [in.align.bam] | grep -w + | coverageBed -a stdin -b [annotations.bed] >

[plus.strand.out]

2 Materials and Methods

40

The genomeCoverageBed provides a useful base-per-base output of read depth that can

be imported into the Artemis, and was also used in deletion analysis. The following was

used to identify all read depths on the forward strand:

$genomeCoverageBed -strand + -d -ibam i [in.align.bam] -g [genome_length.bed] >

[plus.strand.out]

2.9.5 R

R is an open source statistical programming analysis environment (Team_RDC, 2008).

The Bioconductor package programmed in R was used as it provides tools for the

analysis and comprehension of high-throughput genomic data. Specific packages used

are described in the Methods of Chapter 5 in relation to RNA-seq analysis.

2.9.6 Perl scripts

Adhoc Perl scripts were written to aid in the parsing of flat file formats for use in such as

Artemis and R. In addition to these, the Perl script genomicDeletions.pl was written to

identify genomic deletions in genome sequencing data (Appendix A).

2.9.7 Graph pad prism 5.0

For the plotting and analysis of data used the program Graph Pad Prism 5.0c for OSX

was used. The software contains comprehensive statistical analysis and presentation

tools.

3.1 Introduction

41

Chapter 3 Lineage-specific SNPs

3.1 Introduction

Genetic variation within the M. tuberculosis complex (MTBC) is higher than previously

recognised. From studies of Large Sequence Polymorphisms (LSPs), to targeted multi

locus sequence analysis (MLSA), and finally whole genome sequencing (WGS), each

method has provided a greater resolution of the genetic variation that exists between

clinical isolates (Comas et al., 2010; Gagneux & Small, 2007; Hershberg et al., 2008).

The most comprehensive set of phylogenetically representative strains sequenced using

new high throughput sequencing (HTS) technology was published recently (Comas et

al., 2010). For the first time all branches within the MTBC phylogenetic tree could be

resolved, encompassing the six major MTBC phylogenetic lineages. Genome sequences

of the twenty-one clinical strains sequenced in the previous study are publicly available,

making this an ideal reference phylogeny on which to base further analyses. The

genomes were sequenced at high depth (40 to 90-fold coverage) using the Illumina

sequencing platform, making it possible to capture the most complete picture yet of

MTBC nucleotide diversity.

Single Nucleotide Polymorphisms (SNPs) are the most common form of genetic

variation in the MTBC, and driven by advances in sequencing technology an extensive

and ever growing catalogue of SNPs amongst clinical isolates of M. tuberculosis have

been identified (Comas et al., 2010; Stucki & Gagneux, 2012). As described in Chapter

1, analysis of SNPs in 89 genes from 99 human MTBC isolates provided strong

evidence that human MTBC originated in Africa and accompanied the Out-of-Africa

migrations of modern humans approximately 70,000 years ago (Hershberg et al., 2008).

The six human MTBC lineages exhibit a strong global population structure (Gagneux et

3.1 Introduction

42

al., 2006a) and phenotypic diversity has been associated with the different MTBC

lineages. This includes the ability to elicit an immune response in vivo (Portevin et al.,

2011), and clinical associations with extra pulmonary tuberculosis (Kong et al., 2005;

Kong et al., 2007). However, the effect that MTBC genomic diversity plays in TB

disease remains an open question, but one that can now be explored using a rational data

driven approach (Coscolla & Gagneux, 2010).

Using available MTBC genome datasets, it is now possible to identify all SNPs that

contribute to the background genetic variation of the six lineages. Due to the clonal

population structure of MTBC (Supply et al., 2003), the majority of this variation is

expected to be exclusive to the lineage in question, and therefore private from all other

lineage strains. This presents an opportunity to understand the nature of this lineage-

specific variation, and is expected to provide insight into how the hypothesised reduced

purifying selection in the MTBC has shaped the lineages (Hershberg et al., 2008).

3.1.1 Aims

The aim of the work presented in this chapter was to characterise whole genome

variation within the MTBC at the lineage-specific level using M. tuberculosis and M.

africanum clinical isolates. As the identification of the lineage-specific SNPs is reliant

on a representative phylogeny, the initial aim was to generate a robust phylogeny

comprising of strains sequenced using second-generation sequencing technology.

Following generation of a robust phylogeny, specific aims of the analysis were to:

• identify lineage-specific SNPs from the main six lineages. These SNPs make up

the basal branch of each lineage

• gain insights into the evolution of the MTBC, focusing on the type and

frequency of genetic changes within and across the phylogenetic lineages.

• measure the selective pressures on different gene function categories across the

lineages.

3.2 Materials and Methods

43

3.2 Materials and Methods

3.2.1 Genome collection used in study

In total twenty-eight phylogentically representative strains were used in this study.

Twenty-seven were collected from previously published resources, either through

deposited data in public databases or published studies (Comas et al., 2010). Accession

numbers are as follows: SRP001137, SRA009341, SRA009367, SRA008875,

SRA009637. An additional strain was sequenced as part of this study (Lineage 2 strain

N0031). Data has been deposited in the EBI SRA under the accession number:

ERX192819. Details of the strains, country of isolation, and metrics from the mapping

performed for this study is shown in Table 3.1.

3.2.2 Genome sequencing.

Genomic DNA for N0031 was extracted using the CTAB method described [previously

in Methods], and 2µg DNA used for sequencing on the Illumina HiSeq platform.

Sequencing libraries were constructed using the Epicentre Nextera DNA kit according to

manufacturer’s instructions. Paired-end 75 base read sequencing was performed in a

single Illumina flowcell lane as part of a multiplexed run. In total 10.6 million reads

were generated, corresponding to an average sequence depth of 180 reads.

3.2.3 Mapping genome sequences

MAQ (Li et al., 2008) was used to map the reads produced by the Illumina sequencer to

the reference genome. The most recent common ancestor of MTBC was used as the

reference sequence as described previously (Comas et al., 2010). This sequence is based

on the H37Rv genome (NC_000962) but substituting H37Rv alleles with those of the

3.2 Materials and Methods

44

reconstructed common ancestor of the strains. Standard MAQ parameters were used,

removing SNPs with a Phred score <30, read depth of <5, and non-unique matches. A

non-redundant list of variable positions called with high confidence in at least one strain

was constructed and used to recover the base call in all other strains. SNPs and indels

called within repetitive regions (genes annotated as PE/PPE/insertions/phages) were

removed.

3.2.4 Phylogenetic analysis

Phylogenetic analysis was based on filtered SNPs detected when each strain was

compared against the most common recent ancestor of the sequences, as explained in the

above (section 3.3.3). Concatenated SNPs from 13,086 variable genomic positions were

used to infer the phylogenetic relationships between strains using the neighbour-joining

method. Both coding and noncoding were included. The resulting tree was generated

with MEGA (Tamura et al., 2011), using 1000 bootstrap replications for clade support,

and the observed number of substitutions as the measure of genetic distance. In cases

where SNP calls were missing from individual strains, pairwise-deletion was performed

and missing data in the specific comparison ignored. As an outgroup, the distantly

related M. canetti (strain K116) was used to root the tree. For presentation purposes the

branch length of the M. canetti outgroup was reduced by only including SNP positions

shared by the MTBC and M. canetti. Trees in Newick tree format were imported into

FigTree v1.3.1, a graphical viewer of phylogenetic trees and as a program for producing

publication-ready figures. FigTree was downloaded from: http://

tree.bio.ed.ac.uk/software/figtree/.

3.2.5 Categorising SNPs

SNPs were categorised as nonsynonymous (an amino acid change) or synonymous (no

change) using snpEff (Cingolani et al., 2012). Source code was downloaded from:

https://snpeff.svn.sourceforge.net/svnroot/snpeff/SnpEffect/trunk, and run as a local

installation. As an input snpEff takes two files: a database for the reference genome, and

a SNP file in the Variant Call Format (VCF). It was necessary to generate a custom

reference database based on the ancestral genome sequence of the MTBC. The database

was built within snpEFF using the packages command line modules, and the ancestral

sequence in fasta format was parsed into the Genome Transfer Format version 2.2 (GTF

3.2 Materials and Methods

45

2.2), and using the Tuberculist database gene annotations, version 22 (May 2011) to

define regions encoding genes. Annotation of SNPs by functional category was based on

the Tuberculist database. Genes are grouped into ten functional categories as described

previously (section 2.8.2).

3.2.6 dN/dS calculation

dN/dS was calculated by division of the two rate ratios dN and dS. dN is calculated by

dividing the sum of nonsynonymous SNPs by the total number of potential

nonsynonymous sites in coding sequences, and dS is the sum of synonymous SNPs

divided by the total number of synonymous sites in coding sequences. Due to the low

number of SNPs in the MTBC, instead of calculating the dN/dS per gene, gene

concatenates were generated based on different classification. Firstly, genes defined as

essential and nonessential on the basis of Transposon screens (Sassetti et al., 2003;

Sassetti & Rubin, 2003), and secondly using the Tuberculist gene functional categories.

For each concatenate, the Nei-Gojobori method was implemented in SNAP to define

synonymous and nonsynonymous substitutions by pairwise comparison using the

inferred ancestral genome (Korber, 2000).

3.2 Materials and Methods

46

Tabl

e 3.

1. T

wen

ty e

ight

str

ains

use

d in

this

stu

dy. P

atie

nt p

lace

of

birth

and

stra

in is

olat

ion

give

n. D

epth

of

cove

rage

and

nu

mbe

r of

filte

red

SNPs

rela

tive

to th

e re

fere

nce

H37

Rv

is s

how

n. 1

Alte

rnat

ive

stra

in n

ame

as u

sed

in p

revi

ous

MLS

A a

nd

geno

me

stud

y is

incl

uded

to p

rese

rve

the

link

to a

new

sys

tem

atic

nam

ing

conv

entio

n us

ed in

the

stra

in c

olle

ctio

n (C

omas

et a

l.,

2010

; Her

shbe

rg e

t al.,

200

8).

2 Bas

ed o

n H

37R

v re

fere

nce

geno

me.!

Tab

le X

. B

ase

d o

n m

ap

pin

g t

o H

37

Rv

Stra

in n

ame

Alte

rnat

ive

nam

e 1

Lin

eage

Patie

nt p

lace

of

birt

hC

ount

ry o

f is

olat

ion

Ave

rage

m

appe

d de

pth

Num

ber

of

read

sPe

rcen

t gen

ome

cove

rage

2Fi

ltere

d SN

PsSt

udy

sour

ce o

f gen

ome

MTB

_95_

0545

N00

32Li

neag

e 1

Laos

San

Fran

cisc

o77

.37

7,62

1,94

699

.75

1,83

4C

omas

et a

l., (2

010)

MTB

_T17

N01

21Li

neag

e 1

The

Phili

ppin

esSa

n Fr

anci

sco

72.5

97,

130,

412

99.3

61,

867

Com

as e

t al.,

(201

0)N

0157

MTB

_T92

Line

age

1Th

e Ph

ilipp

ines

San

Fran

cisc

o46

.01

5,06

8,05

398

.85

1,88

3C

omas

et a

l., (2

010)

MTB

_K21

-Li

neag

e 1

Zim

babw

eSa

n Fr

anci

sco

77.9

97,

112,

888

99.2

91,

937

Com

as e

t al.,

(201

0)M

TB_K

67-

Line

age

1C

omor

o Is

land

sSa

n Fr

anci

sco

78.2

97,

097,

284

98.9

51,

910

Com

as e

t al.,

(201

0)M

TB_K

93-

Line

age

1Ta

nzan

iaSa

n Fr

anci

sco

65.5

26,

017,

391

99.2

21,

883

Com

as e

t al.,

(201

0)N

0070

EAS0

50Li

neag

e 1

Indo

nesi

aSa

n Fr

anci

sco

55.0

73,

421,

436

99.0

41,

290

Unp

ublis

hed

(Com

as e

t al.,

201

3)N

0072

EAS0

53Li

neag

e 1

Indi

aSa

n Fr

anci

sco

59.4

93,

696,

378

98.8

919

34U

npub

lishe

d (C

omas

et a

l., 2

013)

N01

53M

TB_T

83Li

neag

e 1

Vie

tnam

San

Fran

cisc

o61

.56

3,57

3,05

897

.18

1854

Bro

ad In

stitu

te (S

RA

0093

41)

MTB

_00_

1695

N00

01Li

neag

e 2

Japa

nSa

n Fr

anci

sco

77.9

27,

394,

236

99.0

21,

280

Com

as e

t al.,

(201

0)N

0031

MTB

_94_

M42

41A

Line

age

2C

hina

San

Fran

cisc

o17

9.69

21,1

38,7

2899

.23

1229

This

stud

y (S

RA

TB

C)

N00

52M

TB_9

8_18

33Li

neag

e 2

Chi

naSa

n Fr

anci

sco

64.4

96,

395,

114

99.1

01,

279

Com

as e

t al.,

(201

0)M

TB_M

4100

AN

0110

Line

age

2So

uth

Kor

eaSa

n Fr

anci

sco

40.4

74,

022,

290

98.9

41,

276

Com

as e

t al.,

(201

0)N

0145

MTB

_T67

Line

age

2C

hina

San

Fran

cisc

o78

.77

7,61

6,60

398

.73

1,29

3C

omas

et a

l., (2

010)

MTB

_T85

N01

55Li

neag

e 2

Chi

naSa

n Fr

anci

sco

61.6

56,

159,

284

99.0

41,

305

Com

as e

t al.,

(201

0)M

TB_9

1_00

79N

0022

Line

age

3Et

hiop

iaSa

n Fr

anci

sco

74.0

37,

228,

038

99.1

41,

271

Com

as e

t al.,

(201

0)M

TB_S

G1

N01

14Li

neag

e 3

Indi

aSa

n Fr

anci

sco

66.3

43,

850,

822

99.2

513

30B

road

Inst

itute

(SR

A00

9637

)M

TB_K

49-

Line

age

3Ta

nzan

iaSa

n Fr

anci

sco

75.5

26,

845,

266

99.2

51,

263

Com

as e

t al.,

(201

0)H

37R

v-

Line

age

4U

SA-

Ref

eren

ce-

--

-M

TB_4

783_

04-

Line

age

4Si

erra

-Leo

neSa

n Fr

anci

sco

78.1

27,

466,

814

98.7

873

3C

omas

et a

l., (2

010)

MTB

_GM

_150

3-

Line

age

4Th

e G

ambi

aSa

n Fr

anci

sco

82.2

67,

891,

933

99.0

879

1C

omas

et a

l., (2

010)

MTB

_K37

-Li

neag

e 4

Uga

nda

San

Fran

cisc

o59

.86

5,48

0,45

198

.85

661

Com

as e

t al.,

(201

0)M

TB_E

rdm

an-

Line

age

4-

-32

.69

4,33

3,18

498

.22

862

Bro

ad In

stitu

te (S

RA

0088

75)

MTB

_KZN

_K60

5-

Line

age

4So

uth

Afr

ica

Sout

h A

fric

a93

.51

11,4

58,6

4399

.52

771

Bro

ad In

stitu

te (S

RA

0096

37)

MA

F_11

821_

03-

Line

age

5Si

erra

-Leo

neSa

n Fr

anci

sco

78.2

27,

491,

737

99.0

21,

959

Com

as e

t al.,

(201

0)M

AF_

5444

_04

-Li

neag

e 5

Gha

naSa

n Fr

anci

sco

79.7

57,

578,

690

98.9

21,

959

Com

as e

t al.,

(201

0)M

AF_

4141

_04

-Li

neag

e 6

Sier

ra-L

eone

San

Fran

cisc

o72

.62

7,02

7,14

398

.61

2,04

5C

omas

et a

l., (2

010)

MA

F_G

M_0

981

-Li

neag

e 6

The

Gam

bia

San

Fran

cisc

o76

.39

7,35

0,87

399

.00

2,06

5C

omas

et a

l., (2

010)

MTB

_K11

6-

M. c

anet

tiD

jibou

tiD

jibou

ti93

.01

6,54

4,25

496

.32

1,01

8C

omas

et a

l., (2

010)

1 A

ltern

ativ

e na

me

as u

sed

in H

ersh

berg

et a

l (20

08) a

nd C

omas

et a

l (20

10).

Pres

erve

s lin

k fr

om p

revi

ousl

y pu

blis

hed

and

trans

ition

to sy

stem

atic

nam

ing

conv

entio

n2 B

ased

on

H37

Rv

refe

renc

e ge

nom

e

3.3 Results

47

3.3 Results

3.3.1 A globally representative 28-genome human-adapted MTBC phylogeny

To identify and extract all lineage-specific SNPs, a representative genome collection

was built from previously published and newly sequenced M. tuberculosis strains (Table

3.1). This set of genomes formed the dataset to identify the lineage-specific SNPs

analysed in this study; a subset of these strains will also be followed in Chapter 5 using a

transcriptomic approach (RNA-sequencing). The majority of the strains used in this

phylogeny were published by Comas et al., (2010), consisting of twenty-one genomes

sequenced on the Illumina platform. As previously reported, these genomes have mean

72-fold sequence depth, with 98.9% coverage of the reference genome (Comas et al.,

2010). A further six genomes sequences were downloaded from the European

Nucleotide Achieve (ENA), and the last strain, N0031, was sequenced as part of this

study. Strain N0031 was included in the previous MLSA study and therefore known to

be a rare Lineage 2 strain that is ancestral to the Beijing sub group (Hershberg et al.,

2008). For this reason the strain was selected for sequencing to capture the greatest

possible within-lineage diversity. All strains were sequenced using the Illumina platform

and with a minimum 32-fold average sequence depth, seen in Table 3.1.

Using the H37Rv genome as reference, a mapping assembly was built for the twenty-

eight strains using MAQ (Heng, 2008). SNPs were filtered if they had low associated

Phred quality scores, read numbers, or if they fell within annotated repeat regions such

as PE/PPE regions (see 3.2.3). Such regions are families of genes encoding proteins

carrying Proline-Glutamic acid (PE) or Proline-Proline-Glutamic acid (PPE) motifs

found near the N-terminus (Cole et al., 1998), and are inherently difficult to map using

short read technology such as Illumina. In total 39,764 SNPs were identified in the

strains relative to the reference, and the frequency of filtered SNPs per strain is shown in

Table 3.1. Many of these SNPs are present in more than one strain, leaving a high level

3.3 Results

48

of redundancy in the SNP lists. A non-redundant list of SNPs was constructed,

highlighting 13,088 nucleotide positions that were variable across the 4.4Mb genome.

These positions will therefore harbour a SNP in one or more of the 28 strains, and were

subsequently used to derive a genome wide phylogeny. A Neighbour-Joining phylogeny,

constructed using MEGA5 (Tamura et al., 2011), is shown in Figure 3.1.

Strains group into six main phylogenetic lineages, with bootstrap values indicating

strong statistical support (Figure 3.1). The phylogenetic structure and strain groupings

are completely congruent to the most recent whole genome based phylogeny (Comas et

al., 2010), and previous MLSA and gene deletion based phylogenies (Comas et al.,

2010; Gagneux et al., 2006a; Hershberg et al., 2008). The same lineage colouring

scheme used in previous studies is continued here, and this will be continued where

applicable throughout the thesis (Comas et al., 2010; Hershberg et al., 2008). Naming of

lineages from 1 to 6 follows the convention of Comas et al. (2010). Mycobacterium

canetti (strain K116) was used to root the phylogenetic tree, as it is the closest known

relative to the MTBC (Gutierrez et al., 2005). The number of SNPs has been artificially

reduced for M. canetti in the phylogeny (Figure 3.1). This reduction was performed for

aesthetic reasons due to the large number of singletons between M. canetti and any of

the other MTBC strains used in this study. For example, between the reconstructed most

recent common ancestor of the MTBC sequence used in this study (see section 3.2.3)

and the M. canetti genome sequence used there are 12,319 SNPs, compared to the

~1,500 SNPs between any other MTBC strain and the ancestral sequence.

3.3 Results

49

Figure 3.1. Neighbour-joining phylogeny based on 13,088 variable common

nucleotide positions across 28 human-adapted MTBC genome sequences. Scale bar

shows the number of SNPs. The six lineages are coloured as defined previously

(Hershberg et al., 2008). The root has been truncated due to the large numbers of

changes that separate M. canetti from the rest of the phylogeny. Node support after

1,000 bootstrap replications with all nodes > 75. M. canetti strain K116 was used as the

phylogenetic outgroup.

200 SNPs

MAF_11821_03

MTB_N0153

MTB_KZN_605

MTB_N0052

MTB_GM_1503

MTB_M4100A

MAF_5444_04

MTB_T17

MTB_K21

MTB_91_0079

MTB_K93

MTB_T85

MTB_95_0545

MTB_K49

MAF_4141_04

MTB_N0072

MCAN_K116

MAF_GM_0981

MTB_N0157

MTB_erdman

MTB_K67

MTB_00_1695

MTB_H37Rv

MTB_K37

MTB_N0070

MTB_N0145

MTB_4783_04

MTB_N0031

MTB_SG1

100

100

100

81

100

100

100

100

100

100

100

98

100

100

100

100

100

100

100

100

77

100

100

100

100

100

Lineage 4

Lineage 2

Lineage 3

Lineage 1

Lineage 5

Lineage 6

3.3 Results

50

Tabl

e 3.

2. E

stim

ates

of e

volu

tiona

ry d

iver

genc

e be

twee

n st

rain

s. Pa

irwis

e SN

P di

stan

ces

in th

e 28

gen

ome

phyl

ogen

y. S

train

na

mes

in th

e m

atrix

are

the

sam

e as

in F

igur

e 3.

1. !

MAF 11821 03

MAF 4141 04

MAF 5444 04

MAF GM 0981

MTB 00 1695

MTB 4783 04

MTB 91 0079

MTB 95 0545

MTB N0052

MTB erdman

MTB GM 1503

MTB H37Rv

MTB K21

MTB K37

MTB K49

MTB K67

MTB K93

MTB KZN 605

MTB M4100A

MTB SG1

MTB T17

MTB N0145

MTB N0153

MTB T85

MTB N0157

MTB N0070

MTB N0072

MTB N0031

MCAN K116

MA

F 11

821

030

MA

F 41

41 0

419

460

MA

F 54

44 0

446

619

220

MA

F G

M 0

981

1973

701

1958

0M

TB 0

0 16

9519

4620

2619

3520

500

MTB

478

3 04

1890

1969

1881

1989

1215

0M

TB 9

1 00

7919

4120

2319

3120

5011

3412

070

MTB

95

0545

1899

1972

1895

1998

1833

1769

1831

0M

TB N

0052

1945

2024

1938

2040

589

1206

1143

1830

0M

TB e

rdm

an18

3819

0418

2719

2611

6680

511

6417

0611

630

MTB

GM

150

319

1519

9119

0920

0912

1672

112

2417

9512

1782

90

MTB

H37

Rv

1959

2045

1959

2065

1280

733

1271

1834

1279

862

791

0M

TB K

2119

8620

7019

7820

8819

0918

6919

1390

019

0218

0218

8819

370

MTB

K37

1827

1911

1825

1932

1152

562

1150

1713

1144

750

666

661

1812

0M

TB K

4919

2820

2119

2220

3811

1311

9036

618

2311

1611

5412

0812

6319

0011

330

MTB

K67

1954

2031

1945

2052

1884

1836

1891

868

1883

1773

1856

1910

924

1776

1872

0M

TB K

9319

3019

9819

1520

2518

5518

0218

6283

818

5517

3318

2418

8389

217

5018

5031

70

MTB

KZN

605

1880

1956

1873

1981

1205

706

1196

1755

1208

799

468

771

1850

644

1183

1816

1789

0M

TB M

4100

A19

3420

1619

2420

4086

711

9211

1318

1186

411

4312

1012

7619

0211

4110

9718

6618

3811

930

MTB

SG

119

9420

7919

8621

0311

9112

6643

618

9211

8612

2112

7913

3019

4912

1439

019

2518

9512

5411

700

MTB

T17

1921

2004

1915

2025

1833

1800

1845

907

1839

1731

1807

1867

967

1744

1831

934

908

1781

1833

1897

0M

TB T

6719

5520

3619

5020

5233

312

2611

4818

4959

911

8312

2812

9319

2211

6911

3018

9218

6712

1887

012

0318

520

MTB

N01

5318

8719

6418

7319

9018

2717

8018

3250

318

2817

1418

0018

5487

717

2518

1883

680

917

6418

0118

6388

518

380

MTB

T85

1974

2058

1972

2079

357

1242

1170

1859

621

1187

1249

1305

1946

1182

1144

1917

1884

1233

888

1220

1872

231

1857

0M

TB N

0157

1948

2031

1943

2054

1859

1815

1870

932

1860

1743

1827

1883

990

1761

1856

954

923

1801

1838

1908

336

1869

902

1891

0M

TB N

0070

1971

2044

1952

2064

1894

1848

1899

879

1896

1773

1863

1920

931

1793

1884

590

562

1829

1872

1931

940

1906

838

1928

955

0M

TB N

0072

1973

2050

1963

2076

1894

1859

1906

886

1891

1787

1868

1934

940

1801

1894

596

570

1842

1876

1940

953

1902

839

1926

973

349

0M

TB N

0031

1925

2006

1911

2023

855

1176

1111

1802

852

1132

1189

1229

1882

1117

1088

1848

1825

1169

837

1146

1809

863

1790

880

1828

1847

1848

0M

CA

N K

116

1062

1139

1048

1156

984

935

999

976

988

892

963

1018

1055

888

983

1012

991

936

965

1038

975

992

956

1018

1004

1019

1032

963

0

3.3 Results

51

Across and within-lineage genetic diversity was next investigated using the phylogeny.

A SNP distance matrix was constructed based on the number of base differences per

pairwise strain comparison, shown in Table 3.2. Across the phylogeny the average

number of SNPs per pairwise comparison is 1544, which translates to an average of one

SNP per 2.857 kb sequence length, as based on the H37Rv reference sequence genome

(4.411532 Mb). Contrasting to the phylogenetic outgroup M. canetti, there was on

average one SNP per 0.358 kb sequence, which is nearly 8 times higher SNP density

than the MTBC.

Within-lineage variation was next measured by taking the average of all pairwise

comparisons for each lineage strain, shown in Figure 3.2. Average within-lineage

diversity ranged from 397 SNPs (sd=36) between any Lineage 3 strain, to 811 SNPs

(sd=193) between any Lineage 1 strain. Lineages 2 and 1 have the greatest within-

lineage variation, with a standard deviation of 222 and 193 SNPs respectively. This is

nearly twice that of Lineage 4 (sd=104) and over five times the variation seen in Lineage

3 (sd=36). Lineage 1 also has the greatest number of genome sequences in the

phylogeny (9 strain genomes). This might indicate a discovery bias, where the

increasing number of genome sequences is uncovering more within-lineage variation.

Whilst this cannot be ruled out, there was not a significant correlation between the

number of strains per lineage and average within-lineage variation (Pearson r = 0.73, p =

0.10). Furthermore, the M. africanum lineages, (Lineage 5 and 6) had the least

representative strains per lineage sequenced at the time of this study owing to the

restricted number of strains avaliable, but diversity is still greater than Lineage 3, with

Lineage 6 diversity comparable to all but Lineage 1. Overall it would appear that

Lineages 1 and 2 have the greatest within-lineage SNP diversity.

3.3 Results

52

Figure 3.2. Within-lineage SNP diversity. The number of SNPs per pairwise

comparison of all strains per lineage. Lineages are ordered by Ancient and Modern

groups. Error bars indicate mean and standard deviation (sd). There was not a significant

correlation between the number of strains per lineage and average within-lineage

variation (Pearson r = 0.73, p = 0.10).

Linea

ge 1

Linea

ge 5

Linea

ge 6

Linea

ge 2

Linea

ge 3

Linea

ge 4

0

400

800

1200

Num

ber o

f SN

Ps

Ancient Modern

3.3 Results

53

3.3.2 Identification of all lineage-specific SNPs

Using the underlying connections from the derived whole genome phylogeny, it was

possible for the first time to identify and extract all SNPs that are common to all strains

from each of the six lineages. Due to the clonal nature of the MTBC (Supply et al.,

2003), SNPs within these branches are largely exclusive to the respective lineage. All

alleles on the derived phylogeny were traced throughout the tree and the nodes for each

lineage branch were used to isolate all SNPs that contribute to this branch (Figure 3.3A).

For example, the 163 SNPs between node 5 and 7 define Lineage 4 strains (red lineage),

and in all but a few rare cases are exclusive to the lineage. The SNPs were subsequently

defined as lineage-specific, and form the main dataset for the following analysis; SNPs

found in more than one lineage branch represent homoplasic nucleotide positions and

are described later in section 3.3.4.

In total 2,794 lineage-specific SNPs were identified (Figure 3.3B), and these are

distributed throughout the genome, shown in Figure 3.4 (the full list in shown in

Appendix B). Lineage-specific SNPs frequencies range from 124 (Lineage 2) to 698

(Lineage 5). The highest number of lineage-specific SNPs is in the two M. africanum

lineages (Lineages 5 and 6). In addition to the six lineages, SNPs from the relatively

long phylogenetic branch that is basal to the three modern lineages (Lineages 2, 3 and 4)

have also been included in this study (Figure 3.3B). This branch defines the three

modern lineages and consists of 319 SNPs. From here on this branch is called the

modern lineage branch.

3.3 Results

54

Figure 3.3. Isolating lineage-specific SNPs from the phylogeny. A. Ancestral states

reconstructed at each node of the tree to extract SNPs belonging to the lineage branches

– so called lineage-specific SNPs. For example163 SNPs between node 5 and 7 define

Lineage 4 strains (red lineage), B. All SNPs identified from the lineage branches of the

six lineages, including the Modern lineage branch (coloured in black), which defines the

three modern lineage strains. Arrows show the number of lineage-specific coding and

noncoding SNPs. Scale bar at bottom indicates number of SNPs.

3.3 Results

55

Figure 3.4. Distribution of the lineage-specific SNPs across the genome. Genes on

forward and reverse strands shown in outer rings as blue and red respectively. Mapped

lineage-specific SNPs depicted in six inner rings, with the SNP colouring based on

lineage phylogeny colours. From the innermost ring: Lineage 4, 3, 2, 1, 6, and 5.

Genome structure and size based on H37Rv.

3.3 Results

56

3.3.3 Distribution of SNPs

The most recent M. tuberculosis annotations at the time of this study were used to

classify the lineage-specific SNPs as non-coding (intergenic SNPs) and coding

(Tuberculist database release 24). The average percentage of SNPs falling into these two

regions across all the lineages is shown in Figure 3.5. It can be seen that vast majority of

SNPs (86.4%) fall within annotated coding regions. This is not unexpected, as the

percentage of the M. tuberculosis genome annotated as coding is 91.3% (based on the

H37Rv reference). However, adjusting for the differences in sequence length between

coding and noncoding regions, the number of SNPs falling across coding and non-

coding is not equal, with a nearly 2-fold higher SNP density in intergenic regions (1.0

SNPs per kb of intergenic sequence compared to 0.6 SNPs per kb coding sequence) (X2,

p <0.0001). This may not be surprising as SNPs in coding regions are more likely to be

removed through purifying selection; the selective pressures acting on the coding

regions is investigated later in the chapter (section 3.3.7). Coding SNPs can be further

divided into those that cause a change in the amino acid encoded by the codon (a

nonsynonymous SNP), or cause no change in the amino acid (a synonymous SNP). On

average 55% of all SNPs are nonsynonymous, shown in Figure 3.5. Table 3.3 shows the

frequency of SNP types for each lineage. Although rare, nonsynonymous SNPs were

also found to cause the introduction of a stop codon (1.3% of all SNPs), and these were

found across all lineages (Table 3.3). Conversely, three nonsynonymous SNPs removed

an existing stop codon, contributing to < 0.1% of all lineage-specific SNPs.

The direction of amino acid change was determined using the reconstructed ancestral

sequence of the MTBC. This sequence is similar to the H37Rv genome structure and has

the same nucleotide length, but with H37Rv alleles substituted by those inferred from a

reconstruction of the ancestral states using the derived phylogeny (section 3.2.4).

Inference of the ancestral alleles is possible because the chromosome is effectively a

single linkage group and all descendants share characteristics of the single ancestral cell

(Comas et al., 2010). Therefore, using the ancestral sequence is advantageous as it

enables the evolutionary direction of nucleotide change to be determined, instead of

basing the change from the reference strain H37Rv, which can be problematic as it is a

Lineage 4 strain.

3.3 Results

57

Figure 3.5. The average number of lineage-specific SNPs broken down into non-

coding and coding types. Coding SNPs are further subdivided into synonymous,

nonsynonymous, and nonsynonymous SNPs that affect stop codons, either through an

introduction of a stop codon in a coding sequence, (stop gain) or removal of existing

stop codons (stop loss).

Table 3.3. Summary of lineage-specific SNPs. This total includes the nonsynonymous

SNPs indicated in the table that affect stop codons, either through an introduction of a

stop codon in a coding sequence, (stop gain) or removal of existing stop codons (stop

loss).

SNP type

Lin

eage

1

Lin

eage

5

Lin

eage

6

Lin

eage

2

Lin

eage

3

Lin

eage

4

Mod

ern

linea

ge

Intergenic 59 90 86 16 53 18 57

Nonsynonymous 248 395 381 75 183 99 184

Stop gain 8 10 6 1 3 3 5

Stop loss 0 0 0 0 2 0 1

Synonymous 156 213 207 33 117 46 78

Total SNPs 463 698 674 124 353 163 319

!"#$

%%#$

!#$<1%

&'#$

Intergenic

Non-synonymous

Stop gain

Stop loss

Synonymous

3.3 Results

58

There were 1,556 genes (38.7% of all annotated genes) with one or more lineage SNP.

Three quarters (75.1%) of the genes with a lineage SNP harboured a single SNP (Figure

3.6A). The distribution of SNPs per gene followed a Poisson distribution, suggesting

that there is no clustering of SNPs at the gene specific level, ranging from 0 to a

maximum of 8 SNPs per gene (Figure 3.6B). The single gene with the highest frequency

of SNPs (Rv2424c, fas), encodes a probable fatty acid synthase and has multiple SNPs

present in Lineages 4, 5 and 6. Typical of lipid associated genes in M. tuberculosis, fas

is quite long at 9.21kb, compared to the average M. tuberculosis gene length at 1.0kb.

This is likely the cause of the high number of SNPs, and plotting the nucleotide length

of all genes with a lineage-specific SNP against SNP frequency found a positive

correlation (Pearson r = 0.43, p<0.0001), which is shown in (Figure 3.6C).

Figure 3.6 Distribution of lineage SNPs per gene. A. Frequency of SNPs per gene,

with actual SNP numbers recorded at top of bars. B. Poisson model (shown in red) fitted

to the data. The y-axis is plotted as a log10 scale to better show the SNP distribution. C.

Correlation between the number of SNPs per gene and gene length.

0 1 2 3 4 5 6 7 80

500

1000

1500

2000

2500

Number of SNPs per gene

Num

ber o

f gen

es

2464

1000

377

109 51 11 5 2 1

0 1 2 3 4 5 6 7 80.1

1

10

100

1000

10000

Number of SNPs per gene

Num

ber o

f gen

es (l

og10

)

0 2 4 6 80

5000

10000

15000

Number of SNPs per gene

Gen

e le

ngth

(nuc

leot

ides

)

A.! B.!

C.!

3.3 Results

59

3.3.4 Monomorphic population structure and homoplasic SNPs

The MTBC displays a highly clonal population structure (Supply et al., 2003); (Hirsh et

al., 2004). Consistent with this structure a negligible degree of homoplasy was observed

in the lineages. Of the 2,794 lineage-specific SNPs identified, four homoplastic SNPs

were found, corresponding to 0.14% of the lineage SNPs being homoplastic (Table 3.4).

The SNPs have the same nucleotide change across two or more of lineages, and three of

the four cause synonymous changes.

As shown in Table 3.4, the first homoplasy (SNP 1) at genomic position 1480945,

introduces a synonymous C to G mutation into codon 519 in Rv1319, which encodes a

possible adenylate cyclase (Cole et al., 1998). This mutation occurs in all Lineages 3

and 5 strains (Figure 3.7A), indicating convergent evolution of this nucleotide position

between an ancient and modern lineage. Interestingly, the homoplasy 2 also occurs in

Rv1319, at position 1480948, and is three nucleotides from the first homoplasy.

Furthermore, this also occurs in the same lineages (Lineage 3 and 5), causing a

synonymous C to T mutation in the preceding codon (codon 518). It was confirmed that

this was not an artefact from poor sequencing over this region by inspection of the MAQ

alignment files, and it was found that the surrounding 100bp region in strains from

Lineage 3 and 5 were mapped with high confidence, shown by MAQ quality scores of

1.0 (Heng, 2008). If an insertion or deletion was present this could cause erroneous

SNPs to be called in close proximity, but again this would cause a loss in the associated

MAQ quality scores for the region, and this was not found to be the true. Together this

would suggest that these two SNPs have been called with a high confidence, and the two

homoplasies are likely true.

The third and fourth homoplasic SNPs occur in Rv2082, which encodes a conserved

hypothetical protein. The two homoplasies are present in Lineages 1, 2 and 6,

introducing synonymous (A94A) and nonsynonymous (T96A) SNPs (Figure 3.7B).

Again these are in modern and ancient lineages, and located closely together, this time

within four nucleotides of each other. The gene is a conserved hypothetical with no

known function, but independent mutation of the same allele across three lineages might

suggest biological relevance. Although not within the lineage branch, some strains from

Lineage 4, including H37Rv, also have these two homoplasies as a sub-lineage

homoplasy.

3.3 Results

60

Table 3.4. Homoplasic nucleotide positions within the lineage branches. Independent

mutation of the same nucleotide position occurring across the phylogenetic tree. SNP

position based on the reference strain H37Rv genome coordinates. H

omop

lasy

Gen

e

SNP

posit

ion

Anc

ient

alle

le

SNP

alle

le

Mutation Lineages Gene

product

1 Rv1319c 1480945 C G T519T 3, 5 adenylate

cyclase

2 Rv1319c 1480948 C T E518E 3, 5 adenylate

cyclase

3 Rv2082 2338990 C G A94A 1, 2, 6 hypothetical

protein

4 Rv2082 2338994 A G T96A 1, 2, 6 hypothetical

protein

Figure 3.7. Homoplasic lineage SNPs. A. Homoplasy 1 and 2 occur in Rv1319c in

Lineages 3 and 5. B. Homoplasy 3 and 4 occur in Rv2082 in Lineages 1, 2 and 6.

A.! B.!

3.3 Results

61

In addition to the four homoplasic positions at the nucleotide level shown in Table 3.4,

there was one intergenic SNP at nucleotide position at 2566768 that was mutated to an

Adenosine in Lineage 4, but a Cytosine in Lineage 1 (Table 3.5). Finally, at the amino

acid level, the residue at position 733 within Rv0339c, harbours different

nonsynonymous SNP in Lineages 3 and 5. Rv0339c encodes a transcriptional regulatory

protein, and the two SNPs result in change to different amino acids in the lineages

(Table 3.5).

Table 3.5. Variable genomic positions within the lineages. Two nucleotide positions

harbour different SNPs across the lineages.

SNP

Gen

e

SNP

posit

ion

Anc

ient

alle

le

SNP

alle

le

Mutation Lineage Gene product

1 Rv2294-

Rv2295 2566768 G A intergenic 4

hypothetical

protein

Rv2294-

Rv2295 2566768 G C intergenic 1

hypothetical

protein

2 Rv0339c 406251 A G D733G 3

transcriptional

regulatory

protein

Rv0339c 406251 A C D733A 5

transcriptional

regulatory

protein

3.3 Results

62

3.3.5 Creation of pseudogenes

In total, thirty-nine SNPs were found to affect stop codons. SNPs can either cause the

premature introduction of a new stop codon at any point in the annotated coding

sequence (a nonsense SNP), or more rarely remove an existing stop codon. Thirty-six of

the SNPs cause the former type of nonsense mutation. As shown previously (section

3.3.3), the majority of SNPs occur in isolation within genes, and nonsense SNPs follow

this distribution, thus leading to the potential generation of thirty-five pseudogenes in

the respective lineages (Table 3.6A). The remaining three nonsynonymous SNPs have

the reverse effect, causing the loss or removal of an existing stop codon (Table 3.6B).

Whilst all lineages have accumulated nonsense SNPs, the three ancient lineages have the

greatest frequency, with nearly two-thirds of nonsense SNPs (24 out of 39 nonsense

SNPs). To test if this is due to the longer branch lengths of these lineages compared to

the modern lineages, and so a reflection of the greater time that these lineages have had

to accumulate nonsense mutations, the number of nonsense SNPs was compared to the

total number of SNPs found in each respective lineage branch, shown in Table 3.7.

Lineage 4 has the shortest branch length and one nonsense SNP, whilst Lineage 5 has

the longest branch and the most nonsense SNPs. A significant correlation was found

between branch length and the number of pseudogenes (Pearson r= 0.8477, p= 0.0160).

It can be seen from Table 6 that a large proportion of the nonsense SNPs are within

genes annotated as encoding hypothetical proteins (21 out of 39 SNPs). Using the formal

functional gene categories defined by Tuberculist, it was tested if the nonsense SNPs

were distributed across all gene function categories. Whilst all categories were affected

by one or more nonsense SNP, as expected the hypothetical category contained the

largest proportion (15 SNPs, 38.7%). Due to the low number of nonsense SNPs, it was

not possible to stratify into functional groups by each lineage, but the distribution of

nonsense SNPs was not significantly different for any of the functional categories using

the ancient and modern lineage groupings (Table 3.8) (Mann-Whitney U test, p= 0.24).

3.3 Results

63

Table 3.6 Nonsense SNPs. In total thirty-nine SNPs cause a change in the encoded stop

codon. A. Introduction of a stop codon within the coding sequence. B. Removal of an

existing stop codon. The stop codon is indicated by an asterisk (*) in column 3. Rows

are ordered by gene.

A. Stop introduction Gene Mutation Lineage Gene product

Rv0064 Q862* 5 hypothetical protein Rv0134 ephF W152* 1 epoxide hydrolase Rv0146 Y94* 3 hypothetical protein Rv0325 Q75* 4 hypothetical protein Rv0329c R141* 6 hypothetical protein Rv0368c S277* 5 hypothetical protein Rv0402c mmpL1 R376* 5 transmembrane transport protein Rv0457c W119* 1 peptidase Rv0490 senX3 R410* 6 two component sensor histidine kinase Rv0574c Q149* 5 hypothetical protein Rv0610c Q305* 1 hypothetical protein Rv0621 W355* Modern hypothetical protein Rv0836c W218* 4 hypothetical protein Rv0906 Q183* 1 hypothetical protein Rv1251c E875* 3 hypothetical protein Rv1504c E200* Modern hypothetical protein Rv1870c L212* Modern hypothetical protein Rv1912c fadB5 G63* 3 oxidoreductase Rv1965 yrbE3B W11* 5 integral membrane protein Rv2079 Q609* 2 hypothetical protein Rv2132 Y60* 5 hypothetical protein Rv2187 fadD15 Y81* 6 long-chain-fatty-acid-CoA ligase Rv2187 fadD15 W43* 1 long-chain-fatty-acid-CoA ligase Rv2299c htpG Q109* 6 heat shock protein 90 Rv2339 mmpL9 S917* 5 transmembrane transport protein Rv2690c R658* Modern hypothetical protein Rv2788 sirR Q131* 1 transcriptional repressor Rv2797c Q273* 5 hypothetical protein Rv2818c Q304* 6 hypothetical protein Rv2850c R515* 5 magnesium chelatase Rv2994 W68* 1 integral membrane protein Rv3079c E120* 1 hypothetical protein Rv3373 echA18 G214* Modern enoyl-CoA hydratase Rv3416 whiB3 E71* 5 transcriptional regulatory protein Rv3729 W369* 6 transferase Rv3898c Q111* 4 hypothetical protein B. Stop removal

Gene Lineage Gene product Rv0257 *23R Modern hypothetical protein Rv1641 infC *202S 3 translation initiation factor IF-3 Rv1921c lppF *424G 3 lipoprotein

3.3 Results

64

Table 3.7. Nonsense SNPs by lineage. Thirty-six lineage-specific nonsynonymous

SNPs result in the introduction of a stop codon within the coding sequence (nonsense

SNP). The number of nonsense SNPs is correlated to branch length.

Lineage Nonsense Branch length

(SNPs)

1 8 463

5 10 698

6 6 674

2 1 124

3 3 353

4 3 163

Modern 5 319

Table 3.8 Nonsense SNPs grouped by functional category. Nonsense SNPs separated

by functional category of the affected gene, and into modern (Lineages 2, 3 and 4) and

ancient groups (Lineages 1, 5 and 6). Rows are ordered by descending total number of

SNPs per functional category.

Lineage

Functional category Total Modern Ancient

conserved hypotheticals 15 7 8

cell wall and cell processes 7 3 4

lipid metabolism 5 3 2

intermediary metabolism and

respiration 3 0 3

regulatory proteins 3 0 3

virulence, detoxification, adaptation 3 0 3

unknown 2 1 1

information pathways 1 1 0

3.3 Results

65

3.3.5.1 Nonsense and stop codon removal SNPs in essential genes

The thirty-eight genes harbouring the thirty-nine nonsense and stop codon removal

SNPs were next grouped by gene essentiality. These groups are based on the genome-

wide analyses of mutants that were unable to grow in vitro on Middlebrook 7H11 agar

or in the spleens of intravenously infected mice (Sassetti et al., 2003; Sassetti & Rubin,

2003).

Strikingly, all but two of the genes harbouring a SNP involved in creation or removal of

a stop codon were nonessential. There were 36 SNPs in nonessential genes compared to

2 in essential, out of a genome-wide number of 2,986 nonessential and 760 essential

genes (X2 test; p = 0.0362). Given that nonsense SNPs within essential genes would

highly likely cause a loss of function for the encoded protein that leads to cell death, this

result is perhaps unsurprising. One of the two exceptions is in Lineage 6, an M.

africanum lineage. Here an amino acid change at position 410 in senX3 (Rv0490), leads

to the change of an Arginine residue for a stop codon. SenX3 encodes a predicted

secreted two component sensor histidine kinase (Malen et al., 2007). Whilst this has the

potential to severely affect the function of the encoded protein, the precise position of

the SNP within the gene will determine the length of protein truncation, and so the likely

severity. The amino acid length of SenX3 is 410, which places the new stop codon

directly adjacent to the existing ancestral stop codon, and an ensuing loss of only one

amino acid residue from the protein C-terminus; such a short truncation is likely to have

little or no effect on gene function which may explain why the SNP is allowed to persist

in the lineage.

A similar scenario exists in the second essential gene harbouring a stop codon affecting

SNP. infC (Rv1641) encodes the translation initiation factor-IF3, one of the three

initiation factors in bacteria (Malys & McCarthy, 2011). IF3 binds to the 30S ribosomal

subunit, and shifts the equilibrium between 70S ribosomes and their 50S and 30S

subunits by promoting dissociation of 30S from 50S, and thereby subsequent binding of

mRNA (Liveris et al., 1993); it is therefore required for the initiation of protein

biosynthesis in bacteria. Lineage 3 strains carry a nonsynonymous SNP that removes the

existing stop codon at codon position 202, and introduces a Serine residue (Table 3.6B).

This could lead to transcription of infC into the following intergenic region and potential

fusion to the next encoded gene, rpmI. However, 27 nucleotides downstream from the

removed stop codon is another in frame stop codon at position 1852903. Therefore infC

3.3 Results

66

in Lineage 3 is 27 nucleotides longer, and the protein 9 amino acids longer, than in the

rest of the MTBC. Again, this is unlikely to be harmful to the cell.

3.3.5.2 Length of protein truncation

The majority of SNPs that affect a stop codon cause the introduction of termination

codon within the coding sequence (36 nonsense SNPs). Whilst this has the potential to

severely affect the function of the encoded protein, it has been demonstrated previously

(section 3.3.5.1) that the position of the SNP within the gene should also be taken into

account. Comparing the full-length ancestral protein sequence to the truncated protein

revealed that truncations were distributed throughout the protein length (Figure 3.8A).

There was only one example of more than one nonsense SNP within a gene. Lineage 6

strains have two SNPs within fadD15, both of which would cause >85% loss of the

protein length. The most extreme truncation, in yrbE3B (Rv1965), will lead to a protein

96.3% shorter in length than the ancestral protein. Although yrbE3B encodes a protein

of unknown function, it is highly similar to other membrane proteins, and forms one of

the mammalian cell entry operons in M. tuberculosis (Mce3) (Cole et al., 1998). Overall,

14 SNPs (38.9% of all nonsense SNPs) cause the deletion of >50% of the ancestral

amino acid sequence; such a deletion might be expected to have severe effects on the

function of the gene product.

It can been seen in Figure 3.8A that nine of the nonsense SNPs cause <1% of the

protein being truncated. Apart from senX3, which has one amino acid truncated and

described in the above section (3.3.5.1), the remaining eight genes affected by nonsense

SNPs have 0% deletions. This is an artefact of basing the length of truncations on

H37Rv strain annotations, a Lineage 4 strain. Therefore the analysis is identifying

proteins with a premature stop codons introduced either from Lineage 4 or more basal

Modern lineage branch, which have then been integrated into the H37Rv annotations.

Interestingly, in four cases the nonsense SNPs have created two open reading frames

that have been annotated as separate genes in H37Rv: these genes are Rv0325-Rv0326,

Rv1504c-Rv1503c, Rv3373-Rv3374 and Rv3898c-Rv3897c. Whilst these are

effectively new genes in the respective lineages, all but one are annotated as encoding

hypothetical proteins. The single exception is echA18 (Rv3373) and echA18.1 (Rv3374)

which encode probable Enoyl-CoA hydratases, but was previously a single open reading

frame (Figure 3.9).

3.3 Results

67

The three SNPs that remove existing stop codons lead to proteins that are 104.3-563.6%

greater in amino acid length compared to existing annotations (Figure 3.8B). This is

based on the next in frame stop codon from the 3’ end of the annotated gene. infC has

the smallest increase in length, and was described previously (section 3.3.5.1). The

remaining two genes, lppF and Rv0257, increase by 110 (110.2% increase) and 104

(563.6%) amino acids.

Figure 3.8. Change in protein length due to nonsense SNPs. A. Distribution of

protein truncations due to thirty-six nonsense SNPs causing premature stop codon

introductions. Truncations expressed as percentage change based on H37Rv annotations.

Note fadD15 shown twice due to two SNPs that introduce stop codons. Black bars

indicate the deletion; grey bars are remaining protein. B. Percentage increase in protein

length from three SNPs that remove existing stop codons. Striped bars indicate new

protein sequence.

A.!

B.!

0 20 40 60 80 100 120 300 600Rv0257

lppFinfC

Percentage increase in protein length

0 20 40 60 80 100yrbE3BfadD15fadD15Rv2994

htpGRv0457c

fadB5Rv0146

Rv0574cmmpL1

Rv3079cRv3729

Rv2797cRv0906

ephFsirR

Rv0329cRv0368c

whiB3Rv1251cRv2132

Rv0610cRv2818cRv2850cRv0064Rv2079mmpL9senX3

Rv0325Rv0621

Rv0836cRv1504cRv1870cRv2690c

echA18Rv3898c

Percentage of protein truncated

3.3 Results

68

Figure 3.9. Gene creation by nonsense SNPs. echA18 (Rv3373) and echA18.1

(Rv3374) is a contiguous open reading frame in the ancient sequence, but introduction

of a nonsense SNP in the modern branch led to the annotation of two genes in the

reference H37Rv, and all other modern lineage strains.

3.3 Results

69

3.3.6 SNPs within genes associated with antibiotic resistance

Many drug resistance-conferring mutations have been identified in the MTBC and are

held in the publicly available TBDReaMDB database (Sandgren et al., 2009).

Identification of such mutations has been important in the development of molecular

genotypic based assays for drug resistance (Boehme et al., 2011; Hillemann et al.,

2007). However, as shown in this study, many SNPs in the MTBC are phylogenetic

markers for the lineage, and so it is important to understand the underlying phylogeny to

distinguish SNPs within drug resistant genes that are unlikely to be the cause of drug

resistance but instead phylogenetic markers.

Using the above database, the lineage-specific SNPs were screened to identify SNPs

within genes associated with drug resistance. In total, forty-six coding SNPs were

identified, thirty-two were nonsynonymous and fourteen synonymous. Lineage-specific

SNPs were found in genes associated with resistance to six of the nine antibiotics used

in the treatment of tuberculosis, these were: Ethambutol (SNPs in 9 out of 13 associated

genes), ethionamide (2 of 3), flurorquinolones (2 of 2), isoniazid (11 of 23), rifampicin

(1 of 2) and streptomycin (1 of 3) (Figure 3.10). A further two intergenic SNPs were in

potential regulatory regions (<100bp from the translational start site) of the genes ahpC

and rpoB, which are associated with isoniazid and rifampicin resistance respectively

(Ramaswamy & Musser, 1998; Sherman et al., 1996). Whilst more SNPs in drug

resistance associated genes were found in the two M. africanum lineages (11 SNPs

each), all lineages harboured at least one example (see Appendix C for details).

Figure 3.10 Lineage-specific SNPs within genes associated with drug resistance. In

total 46 SNPs were identified.

Ethambu

tol

Ethion

amide

Flurorq

uinolo

nes

Isonia

zid

Rifampic

in

Strepto

mycin

0

5

10

15

20

Num

ber o

f SN

Ps

3.3 Results

70

Whilst one of the genome sequences used to construct the whole genome phylogeny is

extensively drug resistant (XDR) (Lineage 4 strain KZN 605), the inherent nature of this

study excludes SNPs only present in one strain (singleton SNPs), and so all of the

lineage-specific SNPs are not directly involved in causing drug resistance. Interestingly,

nine of the forty-six lineage-specific SNPs identified above were found within the

TBDream database (19.6%), shown in Table 3.9. It is therefore likely that these lineage-

specific SNPs have been incorrectly associated with drug resistance.

It can be seen at the top of Table 3.9 that a cysteine to tyrosine mutation (C110Y) within

embR (Rv1267c) was found in a study by (Srivastava et al., 2009). This SNP is present

within all strains from Lineage 1. In the former study, three genes implicated in

ethambutol resistance (embB, embC and embR) were sequenced in 44 ethambutol

resistant clinical strains isolated in India (Srivastava et al., 2009). The C110Y mutation

was found within one of the study strains, which also had two mutations in embC

(G288W and V303G). The C110Y mutation therefore identifies this strain as likely

belonging to Lineage 1. Lineage 1 is not prevalent in the country from which the strains

were isolated (Gagneux et al., 2006a; Gagneux & Small, 2007), which might account for

there only being one instance of the SNP out of the 44 strains in the study. Interestingly,

Lineage 1 strains also harbour two more lineage-specific SNPs within genes involved in

ethambutol resistance, one in embA (Rv3794), a P913S mutation, and another within

embC (Rv3793), a N394D mutation but these were not found in the study. However,

embA was not sequenced and the primers used to sequence embC did not extend beyond

the 5’ 308bp region of embC that has sequence homology to the resistance-determining

region (ERDR), and so missed the Lineage 1 SNP that is in the middle of the gene

(Sreevatsan et al., 1997b; Srivastava et al., 2009). It is therefore not possible for C110Y

SNP to be involved directly in drug resistance to ethambutol.

The above study and others have identified the most common mutation reported in

embC at codon 270 (I270T) (Srivastava et al., 2006; Srivastava et al., 2009). However,

in this study the mutation was found to be a modern lineage SNP, and so is present

within Lineages 2, 3 and 4. This would make the mutation highly prevalent in the study

areas where the strains were isolated (Srivastava et al., 2006; Srivastava et al., 2009).

The mutation is typically reported as the conversion of an existing Tyrosine residue, but

most studies use the reference strain H37Rv as the ancient allele, and therefore the

direction of change is reported incorrectly; this agrees with the findings of (Koser et al.,

2011).

3.3 Results

71

Table 3.9. Putative mutations found in drug resistance studies incorrectly

associated with drug resistance. All SNPs are lineage-specific, and therefore

phylogenetic markers of the respective lineages.

Lineage Gene Mutation Drug

resistance Primary reference

1 Rv1267c embR C110Y ethambutol Srivastava et al., 2009

3 Rv3264c manB D152N ethambutol Ramaswamy et al., 2000

Modern Rv3793 embC I270T ethambutol Sreevatsan et al., 1997b;

Srivastava et al., 2009

1 Rv3793 embC N394D ethambutol Ramaswamy et al., 2000

3 Rv3793 embC R738Q ethambutol Ramaswamy et al., 2000

1 Rv3794 embA P913S ethambutol Ramaswamy et al., 2000

Modern Rv3795 embB A378E ethambutol Srivastava et al., 2006

4 Rv1908c katG L463R isoniazid Heym et al., 1995

3 Rv2242 M323T isoniazid Ramaswamy et al., 2003

3.3 Results

72

3.3.7 Conservation and removal of lineage-specific nonsynonymous SNPs

In the following section the extent to which nonsynonymous SNPs are removed from the

lineages was analysed. The commonly used method to detect selection by measuring the

proportion of nonsynonymous nucleotide changes (dN) to synonymous nucleotide

changes (dS) was applied to the lineage-specific SNPs (see 3.2.6). A dN/dS >1 indicates

positive selection, <1 indicates purifying selection and a ratio at or close to 1 is regarded

as neutral, or a balance of the two former selective forces. The rate of nonsynonymous

SNP accumulation was first compared across the six lineages. The relatively low

number of SNPs within the MTBC made calculation of dN/dS for individual genes of

questionable value and impossible for the 2,459 (61.2%) genes with no lineage-specific

SNPs. As an alternative approach the dN/dS ratio was calculated using gene

concatenates based firstly on all genes, then gene essentiality and functional categories.

The mean dN/dS for the lineages was 0.67 (ranging from 0.54-0.79), corresponding to

nearly two thirds (64.8%) of SNPs causing a change to the encoded amino acid (Table

3.10). This finding is consistent with the average dN/dS based on all SNPs identified in

21 MTBC genome sequences (dN/dS=0.59), and the sequencing of 89 genes from 108

MTBC strains (dN/dS=0.57) (Comas et al., 2010; Hershberg et al., 2008). If the lineages

are grouped into the ancient and modern categories, the mean dN/dS was 0.61 and 0.72

respectively; whilst a higher rate of nonsynonymous SNP accumulation was found in the

modern lineages, the difference between two is not significant (Mann Whitney U test,

p=0.2118). High dN/dS ratios are often considered to indicate a reduction in purifying

selection (He et al., 2010; Hershberg et al., 2008; Holt et al., 2008), which would

suggest here that all lineages are experiencing the same weak purifying selection.

Alternatively, signals of weak purifying selection may be due to the close relatedness of

the MTBC strains. Rocha et al. (2006) has shown that dN/dS is often higher when the

organisms compared are closely related. Therefore dN/dS becomes dependent on time

due to a lag in the time to remove deleterious nonsynonymous mutations by purifying

selection, and so elevating dN/dS.

To test how the frequencies of nonsynonymous SNPs vary over different timescales, the

ratio of nonsynonymous to synonymous SNPs was compared in different branches of the

phylogenetic tree. No significant difference was found in the SNP ratio from the lineage

branches compared to the external branches, which includes SNPs from the twenty-eight

extant strains used in the phylogeny (Mann Whitney U test, p = 0.1033). The mean

3.3 Results

73

lineage branch ratio was 1.9, whilst the external branches 1.7 (Appendix D), suggesting

that nonsynonymous SNP accumulation in the MTBC is a consistent feature irrespective

of time.

Table 3.10. The rate of nonsynonymous SNP accumulation across the lineages. The

dN/dS ratio was used, which measures the accumulation of nonsynonymous SNPs

against the background rate of synonymous SNPs.

Lineage

5

Lineage

6

Lineage

1

Lineage

4

Lineage

2

Lineage

3

Modern

Nonsynonymous

SNP

385 374 238 96 74 182 172

Synonymous

SNP

213 206 156 46 33 117 78

Nonsynonymous

positions (N)

2968425 2968425 2968425 2968425 2968425 2968425 2968425

Synonymous

positions (S)

1052024 1052024 1052024 1052024 1052024 1052024 1052024

dN rate 0.000130 0.000126 0.000081 0.000032 0.000025 0.000061 0.000058

dS rate 0.000202 0.000196 0.000148 0.000044 0.000031 0.000111 0.000074

dN/dS 0.64 0.64 0.54 0.74 0.79 0.55 0.78

3.3.7.1 Nonsynonymous SNPs within essential genes

The previous method was based on total sequence concatenates which is quite a blunt

method for detecting selection, likely averaging both purifying and potential positive

selection in the sequences. Further concatenates were generated based on biologically

relevant categories. Firstly, genes were grouped by those shown to be essential for

growth by transposon mutagenesis (Sassetti et al., 2003; Sassetti & Rubin, 2003). Based

on the findings in other bacteria and evolutionary theory, it would be expected for less

nonsynonymous SNPs to accumulate within genes that are essential to the cell (Jordan et

al., 2002). There were 335 (14.0%) nonsynonymous SNPs and 212 (8.9%) synonymous

SNPs within essential genes, leaving the remaining 1215 (50.8%) nonsynonymous and

630 (26.3%) synonymous SNPs within nonessential genes. Adjusting for differences in

the nucleotide length of the two categories using the number of potential

3.3 Results

74

nonsynonymous SNP positions, it was found that significantly less nonsynonymous

SNPs were within essential genes (X2, p=0.0011). Whilst the average dN/dS for essential

genes was lower than nonessential (0.56 and 0.68 respectively), indicating that essential

genes are more conserved than nonessential.

3.3.7.2 Nonsynonymous SNPs within functional gene categories genes

Gene concatenates were next generated for all gene functional categories based on the

Tuberculist database. Seven categories were tested: 1. information pathways, 2.

intermediate metabolism and respiration, 3. lipid metabolism, 4. cell wall and cell wall

processes, 5. conserved hypothetical, 6. virulence-detoxification and adaptation and 7.

regulatory proteins (Lew et al., 2011). In Figure 3.11A, the dN/dS ratios across these

categories are shown. A one-way ANOVA of the dN/dS for each lineage and functional

category found an uneven distribution (Kruskal-Wallis test, p=0.0084). Following

multiple testing correction it was seen that the dN/dS between the information pathways

and regulatory protein categories was significantly different (Dunn's Multiple

Comparison Test, p<0.05). It might be expected for the information pathways class to

have the lowest number of nonsynonymous SNPs due to the critical function of these

genes within cell, such as in DNA replication and repair. This was confirmed by

comparison of the percentage of essential genes per functional category to the dN/dS

ratio, which found a significant correlation (Spearman r = -0.8929, p = 0.0123) (Figure

3.11B).

Whilst there was evidence of gene function categories varying by the level of low

purifying selection, only genes within the regulatory category showed strong signs of

positive selection in multiple lineages (mean dN/dS = 1.16) (Table 3.11). Stratifying the

regulatory functional category by lineage, the dN/dS was > 1 in Lineages 3, 4, 5 and 6.

Focusing on this category, 84 regulatory proteins harboured 132 lineage-specific SNPs -

101 nonsynonymous and 31 synonymous. This corresponds to a nonsynonymous to

synonymous ratio of 3.3, compared to the mean of ratio of 1.9 found across all

functional categories. Potential positive selection (dN/dS >1) was also seen in the

intermediary metabolism and respiration category for just Lineage 2 (dN/dS=1.49), and

in lipid metabolism also for Lineage 2 (dN/dS=1.13) and the modern lineage branch

(dN/dS =1.39) (Figure 3.11A).

3.3 Results

75

A.

B.

Figure 3.11. The rate of nonsynonymous SNP accumulation by functional category.

A. Lineage dN/dS by functional category. Lineages coloured as previously and bars

represent mean dN/dS. Information pathways dN/dS significantly lower than regulatory

proteins (one-way ANOVA with Dunn’s post-hoc test, p <0.05). B. Correlation between

essential genes per functional category (as percentages) and dN/dS. Spearman r = -

0.8929, p = 0.0123.

!"#$%&'(!$")*'(+,'-.

/011),'11)'"2)/011)*%$/0..0.

1!*!2)&0('3$1!.&

!"(0%&02!'%-)&0('3$1!.&)'"2)%0.*!%'(!$"

/$".0%402)+-*$(+0(!/'1.

4!%510"/06)20($7!#!/'(!$")'"2)'2'*'('(!$"

%0851'($%-)*%$(0!".

9:9

9:;

<:9

<:;

=:9 >>

2?@2A

0.0 0.5 1.0 1.50

20

40

60

80

100

information1pathways

regulatory1proteins

dN/dS

Percentage1of1essential1genes1

in1functional1category

3.3 Results

76

Table 3.11. The rate of nonsynonymous SNP accumulation in each functional

category. The nonsynonymous/synonymous ratio and dN/dS ratio is shown. N= all

possible nonsynonymous positions, S = all possible synonymous positions.

N

onsy

nony

mou

s SN

Ps

Syno

nym

ous

SNPs

nons

ynon

ymou

s /sy

nony

mou

s

N S dN/dS

information pathways 96 73 1.3 202427 70831 0.46 lipid metabolism 168 99 1.7 294422 102505 0.59 intermediary metabolism and respiration 394 237 1.7 765073 268496 0.58 cell wall and cell processes 377 197 1.9 595063 214706 0.69 conserved hypotheticals 344 175 2.0 594619 209012 0.69 virulence, detoxification, adaptation 64 29 2.2 106294 38498 0.80 regulatory proteins 101 31 3.3 123975 44208 1.16

3.4 Discussion

77

3.4 Discussion

3.4.1 Strengths and limitations of this study

This study used recently published MTBC genomes sequenced by high-throughput

sequencing technology to identify for the first time all SNPs that contribute to the

background genetic variation within the six lineages of the MTBC. At the time of this

study about thirty globally representative strains from all of the lineages had been

sequenced and the genomes made publicly available. It is likely that a discovery bias

exists within this small genome set, as illustrated in Figure 3.2 where it was seen that the

lineages with the most genome sequences (Lineages 1, 2 and 4) had the greatest within-

lineage diversity. Lineages 5 and 6 only had two genome sequences available to use in

this study. However, this study was designed to capture variation within the internal

basal branches of each lineage through exploitation of the clonal population structure of

the MTBC, and this should circumvent any discovery bias. Theoretically, as backward

mutations are rare in the MTBC, genome sequences from two strains belonging to the

same lineage would capture all lineage-specific SNPs for the respective lineage, and

additional genomes will only serve to reduce the branch length and so the number of

lineage-specific SNPs. Finally, twenty-one of the genomes used to construct the genome

phylogeny were selected from a wider collection of 875 strains characterised previously

by the analysis of deletions across the genome (Comas et al., 2010; Gagneux et al.,

2006a; Hershberg et al., 2008). Therefore, whilst it is expected that future studies will

sequence ever-greater numbers of MTBC strains, the lineage-specific SNPs identified in

this study are expected to be robust.

Removal of SNPs found within repetitive regions, such as in phages, and the PE and

PPE gene families, will likely have resulted in the loss of potentially important variation

within the MTBC lineages. Pe genes are characterised by the presence of a proline-

glutamic acid (PE), whilst ppe genes contain a proline-proline-glutamic acid (PPE); both

3.4 Discussion

78

families are highly variable in size and contain extensive repetitiveness of their C-

terminal regions (Cole et al., 1998). Excluded regions total ~10% of the coding genome,

and recently it has been shown that the large pe and ppe families harbour about 3-fold

higher frequency of nonsynonymous SNPs compared to non-pe/ppe genes (McEvoy et

al., 2012), which would suggest that a pool of lineage-specific variation might have been

missed in this study. It was necessary to remove SNPs identified in these regions due to

inherent difficulties encountered in sequencing through repetitive regions using the

second generation short read technologies, such the Illumina sequenced strains used in

this study. SNPs were detected in these regions in the lineage branches, but they would

need to be confirmed by methods beyond the scope of this study. This is a common

disadvantage of current short read sequencing technology (Loman et al., 2012), and

developments in sequencing technology with longer read lengths will likely remove this

current limitation (Branton et al., 2008).

3.4.2 General characteristics of lineage-specific diversity

Prior to identification of the lineage-specific SNPs, a 28-genome phylogeny was built

using a non-redundant set of variable nucleotide positions derived from the genome

sequences. The phylogeny was largely derived from the genome sequences published

previously (Comas et al., 2010), and supplementing by other recently published genome

sequences available in the EBI SRA. An additional strain (N0031), known to be a rare

Lineage 2 strain based on a previous MLSA study, was sequenced for this project to

widen diversity in this lineage (Hershberg et al., 2008). The topology of the resulting

phylogeny was highly congruent with other MTBC phylogenies based on SNPs and

other markers, such as deletions, further highlighting the clonal population structure of

the MTBC (Comas et al., 2010; Gagneux et al., 2006a).

In total 2,794 SNPs lineage-specific SNPs were identified, with each lineage differing

by an average of 400 SNPs. The ancient lineages (Lineages 1, 5 and 6) harboured the

most lineage-specific SNPs, which is likely a reflection of the greater time that these

lineages have had to accumulate mutations. On average, two-thirds of all coding SNPs

were nonsynonymous and therefore cause a change in the encoded amino acid. This is a

feature of the MTBC, and has been previously identified at the genome level

(Fleischmann et al., 2002; Hershberg et al., 2008). Nonsynonymous SNPs are more

3.4 Discussion

79

likely than synonymous SNPs to have a functional effect, which raises the possibility

that this variation will have functional consequences in the respective MTBC lineages.

The ability to isolate the total background SNP variation that contributes to the diversity

of all strains from a particular lineage (lineage-specific SNPs) was fundamental to this

study. This was only possible due to the negligible level of recombination seen in the

MTBC (Liu et al., 2006), and because back mutations are rarely observed (Casali et al.,

2012). Therefore a SNP in the parental strain becomes a defining marker for the rest of

the progeny. It has previously been reported that homoplasic nucleotide positions are

rare in the MTBC, in which a SNP cannot be explained without convergence when

mapped onto the tree, and typically found only in cases of drug resistance or

compensatory mutations (Casali et al., 2012; Comas et al., 2011). Similar examples

have been found in other bacterial studies, such as the sequencing of MRSA strains,

where the authors found few homoplasic SNPs but when identified, corresponded to

mutations conferring antibiotic resistance (Harris et al., 2010). In this study it was

found that there were only four cases of homoplasic SNPs (0.14% of all lineage-specific

SNPs), in which lineage-specific SNPs with the same nucleotide change were present in

more than one lineage (Table 3.4). Independent fixation of SNPs across multiple

lineages could represent signals of selective pressure acting on these positions, and this

was strengthened by the distribution of the four SNPs, whereby they cluster within two

genes and are within a few nucleotides of each other. Whilst these may have biological

significance, the respective genes are not associated with drug resistance. Further work

would be needed to confirm these SNPs and to understand if these SNPs have biological

function.

The lineage-specific SNPs can also be exploited in SNP typing assays to genotype

strains, either at the lineage or from any sub-lineage level (Bergval et al., 2012; Kahla et

al., 2011; Stucki et al., 2012). SNP typing is suggested to be the new gold standard of

phylogenetic classification of MTBC (Comas et al., 2009), and the majority of the SNPs

identified in the lineage branches in this study, excluding the above homoplasies, would

be applicable to such typing assays. At the epidemiological level, genotyping of strains

has also been driven by the need for rapid tests to identify drug resistant strains.

Resistance to first-line TB drugs rifampicin and isoniazid (Multidrug resistance or

MDR-TB), and now also to some second-line drugs (extensively drug resistant

tuberculosis or XDR-TB) has led to a growth in molecular genotypic drug susceptibility

testing, such as the Genotype MTBDRplus (Hain Life science) and Xpert MTB/RIF

3.4 Discussion

80

(Cepheid) (Boehme et al., 2011; Hillemann et al., 2007; McNerney et al., 2012). Several

SNPs were identified within drug resistant associated genes that are not associated with

drug resistance, but act as evolutionary markers (Table 3.9). Previous studies have

identified highly prevalent mutations within drug resistant strains, but these have been

shown here to be lineage-specific markers. Other studies have also questioned some

associations of SNPs with drug resistance. A significant association of a SNP within

Rv2629 and rifampicin resistance was found based on a study of over 100 rifampicin resistant strains (Wang et al., 2007), but this was subsequently shown to be a

phylogenetic marker of Lineage 2, specifically of the Beijing group of strains(Homolka

et al., 2009)(Homolka et al., 2009)(Homolka et al., 2009)(Homolka et al.,

2009)(Homolka et al., 2009). Similar approaches have been applied to inhA SNPs with

isoniazid resistance, and embC SNPs with ethambutol resistance, that are instead

phylogenetic markers and unlikely the cause of drug resistance (Projahn et al., 2011;

Ramaswamy et al., 2000). From the perspectives of typing strains for evolutionary

analysis, and linking genotype to phenotype to identify potential molecular causes of

drug resistance, it is clear that an understanding of the underlying phylogenetic structure

of the MTBC is critical.

Whilst several lineage-specific SNPs within genes associated with drug resistance are

unlikely to be direct causes of resistance, some could play an indirect role in modulating

the fitness cost of drug resistant mutations. It has been shown that strains from different

lineages but with identical rifampicin resistance mutations show different levels of

fitness cost (Gagneux et al., 2006b). In a wider context, the Beijing family of strains

within Lineage 2 is often associated with drug resistance (Borrell & Gagneux, 2009;

Parwati et al., 2010). It has been suggested therefore that strain genetic background

plays a role in the spread of drug resistance strains (Muller et al., 2013), although the

actual molecular mechanisms of this are currently unknown. Pre-existing mutations in

genes associated with drug resistance, such as the lineage-specific SNPs found in this

study, may increase the tolerance of the cell to future drug resistance mutations through

higher baseline fitness, or epistatic interactions between the genetic background of the

strain and drug resistance mutations (Muller et al., 2013).

3.4.3 Insights into the evolution of M. tuberculosis lineages

It has been hypothesised that, due to historical human migrations and serial transmission

bottlenecks due to the low-infectious dose of tuberculosis, the MTBC have small

3.4 Discussion

81

effective populations size (Hershberg et al., 2008). This phenomenon can lead to

increased random genetic drift compared to natural selection, limiting the removal of

potential functional mutations (Smith et al., 2006a). As discussed above, about two-

thirds of all coding SNPs cause a change in the encoded amino acid, however

nonsynonymous SNPs that cause the introduction or change of existing stop codons

would highly likely cause a loss of function. Although rare (1.3% of all lineage-specific

SNPs), thirty-five lineage-specific pseudogenes were identified due to the introduction

of stop codons in the lineage branches. These genes may have been allowed to lose their

function either due to the genome-wide loss of selective constraint in the MTBC, or

potentially selection may have been relaxed during adaptation to a new niche in the

respective lineages. The former hypothesis is more likely however, as no difference was

found between the frequency of pseudogene creation or functional category of affected

gene and the specific lineage. Furthermore, most genes were conserved hypotheticals

and all but one nonessential to growth; the exception was senX3, but the nonsense SNP

in Lineage 6 resulted in a modest loss of one amino acid, unlikely to affect function. The

annotated H37Rv genome sequence contains thirteen pseudogenes (Lew et al., 2011),

and it is likely that all of these pseudogenes are the result of random drift, which will

eventually be removed by deletions leaving a tighter packed and eventually more

reduced genome.

With such little variation in MTBC it is not currently possible to measure selection in

each gene, although future whole genome studies employing low hundreds to thousands

of MTBC genomes may enable this. An approach to analyse selection in DNA sequence

data is to use dN/dS ratio, which provides a measure of the accumulation of

nonsynonymous SNPs against the background of assumed silent synonymous SNPs. The

dN/dS measure has been applied to many bacterial species to understand the

evolutionary histories, including Salmonella typhi (Roumagnac et al., 2006),

Clostridium difficile (He et al., 2010) and previously in the MTBC (Hershberg et al.,

2008). However, the method was originally developed for the analysis of genetic

sequences from divergent species (Kimura, 1977), and it has recently been suggested

that it is inappropriate for the analysis for variation within a population (Kryazhimskiy

& Plotkin, 2008). The problem with such comparisons is the potential short times scales

involved, whereby slightly deleterious mutations that will have been removed by

selection cannot be separated from substitutions that are fixed in the population (Rocha

et al., 2006); this has been shown to lead to high dN/dS values for closely related

bacteria, often approaching 1 (Rocha et al., 2006). If this is the case in the MTBC, it

3.4 Discussion

82

might be expected for the external branches of the phylogeny, which includes SNPs

from the extant strains, to harbour more nonsynonymous SNPs than the lineage-specific

SNPs that were the focus of this study. Mutations would be expected to decrease over

time as they are purged by purifying selection. In this study, the ratio of nonsynonymous

to synonymous SNPs was not different between the external tips of the tree compared to

the lineage branches (ranging from a ratio of 1.9 in the lineage branches, to 1.7 in the

external). This is in agreement with other studies (Hershberg et al., 2008), and together

shows that nonsynonymous SNPs are not more intensely purged than synonymous

SNPs, which would suggest that the high dN/dS is not due to close relatedness of the

strains.

Previous studies of MTBC variation found genome-wide dN/dS values of 0.57

(Hershberg et al., 2008) and 0.60 (Comas et al., 2010). These suggest strongly reduced

purifying selection acting within MTBC. It has been suggested that the cause of this

reduced selection is due to the small effective population size of the MTBC, which is a

consequence of the clonality of the MTBC and serial population bottlenecks during

transmission of TB (Hershberg et al., 2008; Smith et al., 2006a). The mean dN/dS for

the lineage branches found in this study was 0.67, with no significant difference in the

overall dN/dS per lineage. The lack of significant differences between the lineages

suggests that the hypothesised reduction in purifying selection is a general feature across

the lineages. Categorising all genes by essentiality, the effects of purifying selection

could however still be detected in the MTBC, with significantly fewer nonsynonymous

SNPs in essential genes. Furthermore, splitting genes by annotated function, the gene

category with critical function to the cell had the lowest dN/dS. This information

pathways category consists of genes involved in critical cellular functions, including

genes involved in transcriptional and translational machinery. At the other end of the

spectrum, the regulatory gene category had the greatest accumulation of

nonsynonymous SNPs; four of the lineages (Lineages 3, 4, 5 and 6) had dN/dS ratios >1,

indicating potential positive selection within this class, with three to five times more

nonsynonymous to synonymous SNPs.

High frequencies of nonsynonymous SNPs in regulatory genes have been detected

previously. In 2011, Schürch et al. sequenced several isolates from the Beijing family of

the MTBC, a subgroup of Lineage 2, and found overrepresentation of nonsynonymous

SNPs in the regulatory and associated signalling transduction pathways (Schürch et al.,

2011). As previously discussed, in this study gene concatenates were used, which has

3.4 Discussion

83

the disadvantage of averaging the selective forces acting on the sequences and thereby

providing a summary of the pressure acting on the sequences; it is not possible to

identify individual genes potentially under positive selection. Furthermore, analysing the

frequency of SNPs clustering within genes, it was found that no genes harboured a rate

that deviated from the expected Poisson distribution. This suggests that specific genes in

the regulatory category are not highly variable, but that the whole category is

accumulating the greatest ratio of nonsynonymous SNPs, which in turn may affect the

regulatory networks of the respective lineages. Overall, this has shown that the loss of

selective constraint is a common feature of all lineages, and functional genetic diversity

is anticipated, specifically due to the high number of amino acid changing SNPs.

4.1 Introduction

84

Chapter 4 In silico prediction of functional

Single Nucleotide Polymorphisms

4.1 Introduction

Current knowledge on the effect of genetic variation in the M. tuberculosis Complex

(MTBC) is limited, but it has been suggested that much of the genetic variation in the

MTBC will have functional consequences due to a reduction in purifying selection

(Hershberg et al., 2008). This concept was further investigated by Hershberg et al.

through comparison of the rates of nonsynonymous SNPs, and therefore amino acid

changes, within conserved amino acid positions between the MTBC and M. canetti

(Hershberg et al., 2008). Positions were classified as conserved based on the gene

sequences of all other mycobacterial species. Reduced selection would be detected by a

difference in the number of amino acid changes falling in conserved and variable sites

between M. canetti and the MTBC. This was found to be the case, with nonsynonymous

SNPs falling in conserved amino acid positions 27% of the time in M. canetti, but just

over double the frequency (58%) was found in MTBC.

While underscoring the reduced selective constraint in MTBC, this also raises the

possibility that much of the genetic variation could have a functional impact.

Nonsynonymous SNPs have the potential to affect gene expression or the function of the

encoded protein, which can have a range of phenotypic consequences to the cell. Most

nonsynonymous SNPs are deleterious and eventually removed through the process of

purifying selection (Balbi & Feil, 2007), but as demonstrated in this and other studies,

the capacity to remove such SNPs is diminished in the MTBC due to low levels of

purifying selection. This raises the question of how many and which nonsynonymous

4.1 Introduction

85

SNPs actually have a functional consequence. Based on an extrapolation of the

aforementioned MLSA dataset, the actual number of functional SNPs was estimated.

Specifically, the decreased number of nonsynonymous SNPs falling in conserved

positions in M. canetti was used to estimate the number of nonsynonymous SNPs that

would have been removed in the MTBC if purifying selection was similar to that of M.

canetti, or any other Actinobacteria. It was suggested that about 40% of the amino acid

changes in the MTBC would result in functional consequences, and if the small gene set

was unbiased, genome-wide this translates to about 300 functional SNPs per average

pairwise comparison of MTBC strains; strains that diverged at a closer time point would

have would have few functional SNPs whilst the most divergent strain comparisons

would have up to 500 functional SNPs (Hershberg et al., 2008).

Whilst the study represented the most complete analysis of genetic diversity at the time,

the MLSA approach assays variation within a small sample of the genome. Whole

genome sequencing datasets enable this hypothesis to be tested without risk of potential

gene selection bias, and critically all of the predicted functional SNPs can be identified

for the first time. Focus is made on the nonsynonymous SNPs identified in Chapter 3.

This is the dominant SNP type identified in the MTBC, and is more amenable to in

silico prediction methods due to the inherent property of causing amino acid change,

which can be measured by the methods described below.

The main body of research into predicting the effects of nonsynonymous SNPs has been

undertaken in eukaryotic systems, specifically in human based genetics studies (Ng &

Henikoff, 2006). SNPs constitute about the 90% of human protein sequence variability

(Collins et al., 1998), and the importance of nonsynonymous SNPs in humans is

illustrated by the database containing disease-causing variants, the Human Gene

Mutation Database (HGMD) (Stenson et al., 2012). In this database, nonsynonymous

SNPs make up about half of the genetic variants that are known to cause disease

(Stenson et al., 2012). In silico methods fall into two main groups, either based on

sequence or structural information, and some hybrid methods now exist using a mix of

the two approaches (Thusberg & Vihinen, 2009). The overarching basis of all amino

acid substitution based predictions is the evidence that mutations which effect protein

function tend to occur at evolutionary conserved positions, suggesting that predictions

could be based on sequence homology (Miller & Kumar, 2001). It was also found that

mutations had common structural features that distinguish them from neutral SNPs,

suggesting that structural features could also be used in predictions (Sunyaev et al.,

4.1 Introduction

86

2000; Wang & Moult, 2001). In 2001, Wang & Moult used the human SNPdb database

to model disease-causing mutations onto their respective wild-type protein structures

and found that 83% of disease-causing mutations affect protein stability. These key

studies spawned the development of algorithms to differentiate between functional and

neutral SNPs. Some are based on sequence homology, such as SIFT (Ng & Henikoff,

2003) and PANTHER (Thomas et al., 2003), whilst others use structural features such as

TopoSNP (Stitziel et al., 2004). As described, some combine many predictive features,

and one example is the prediction method PolyPhen (Ramensky et al., 2002).

4.1.1 Aims

The work presented in this chapter is a comprehensive genome-wide prediction and

characterisation of MTBC lineage-specific nonsynonymous SNPs. The specific aims

were to:

• computationally predict functional nonsynonymous SNPs.

• gain insight into the impact of functional SNPs across the lineages.

• generate a focused SNP set that can be followed in experimental systems.

4.2 Materials and Methods

87

4.2 Materials and Methods

4.2.1 SIFT

Prediction of nonsynonymous SNPs likely to affect protein functional was performed

using the Sorting Intolerant From Tolerant (SIFT) algorithm (Ng & Henikoff, 2003).

SIFT version 4.0.2 (downloaded February 2010) was installed as a stand-alone version

on a Linux server. A custom bash routine was written to analyse all SNPs in several

batches.

The SIFT prediction is based on sequence conservation and the type of amino acid

change. Briefly, SIFT looks for homologs in other bacteria of the gene of interest and 1)

scores the conservation of the positions where mutations are found, and 2) weights this

score by the nature of the amino acid change. These measures are incorporated into a

normalised probability score, with scores ≤ 0.05 indicating a functional SNP prediction.

The classification threshold was previously optimised for performance on a data set

comprising of 55 LacI-related sequences, including paralogs (Ng & Henikoff, 2001).

Furthermore, if sequence alignments over the SNP position were at a depth <3 then

prediction was excluded.

A further conservation measure was also used to prevent the prediction of mutations on

sequences too conserved, which would contaminate the multiple sequence alignment

and bias SIFT to predicting more functional SNPs. The recommended <3.5 conservation

score threshold was used, thereby filtering those genes and associated predictions above

this threshold. As a bacterial database to generate the protein sequence alignment, all

publicly available mycobacterial genome sequences outside of the M. tuberculosis

complex (MTBC) were used. Therefore predictions were based on mycobacterial

homologs, but not on species that are evolutionary too close to the query sequences,

which could again contaminate the alignment with sequences likely to harbour the SNP

4.2 Materials and Methods

88

allele to be tested. The MTBC database consisted of thirteen complete mycobacterial

genomes, seen in Figure 4.1 and Table 4.1.

Figure 4.1. SIFT database phylogeny. BLAST database constructed for SIFT.

Neighbour-Joining phylogeny based on concatenated 16S RNA and rpoB nucleotide

sequences from the thirteen available mycobacterial genomes. Node support after 1000

bootstrap repetitions shown on branches. Scale bar indicates number of SNPs. The tree

is rooted using the outgroup Nocardia farcinica. The MTBC was not included to prevent

contamination of the predictions by closely related sequences; if present the MTBC

would diverge from M. leprae.

M. leprae (TN)

M. leprae (Br4923)

M. ulcerans (AGY99)

M. marinum (M)

M. avium subsp. paratuberculosis (K10)

M. avium (104)

M. abscessus (ATCC 19977)

M. smegmatis (MC2 155)

M. sp. JLS

M. sp. MCS

M. sp. KMS

M. gilvum (PYR-GCK)

M. vanbaalenii (PYR-1)

Nocardia farcinica (IFM 10152)

100

100

100

91

100

100

100

100

88

100

56

50

4.2 Materials and Methods

89

Table 4.1. SIFT database of non-MTBC species. Thirteen complete whole genome

sequences were published at time of this study. Genomes downloaded from NCBI.

4.2.2 Indels

Short indels (ranging from 1 to about 20 nt) were identified in Lineage 1 and 2 genome

strains using the indelpe module in MAQ (Li et al., 2008). All Lineage 1 and 2 genomes

used in Chapter 3 were used in this analysis. The tab delimited output file includes: start

position, indel type (inserted/deleted nucleotides). From this file it was possible to

identify frameshift mutations as those not divisible by three, the codon length. Indels are

inherently difficult to identify in short read data, and so only a targeted analysis of two

lineages was performed.

4.2.3 Homology modelling

Prediction of protein structure was performed using Protein Homology/analogy

Recognition Engine V 2.0 (Phyre2) (Kelley & Sternberg, 2009). Phyre2 is available at:

http://www.sbg.bio.ic.ac.uk/phyre2. Detailed description of the Phyre2 server has been

Genome Description M. leprae TN Causative agent of human leprosy. Leads to permanent

damage to the skin, nerves, limbs and eyes if left untreated

M. leprae Br4923 As above M. ulcerans AGY99 An emerging pathogen that causes Buruli ulcer M. marinum M Causes a tuberculosis-like disease in cold-blooded

animals, and a peripheral granulomatous disease in humans

M. avium subsp. Paratuberculosis K10

Causes tuberculosis in birds and disseminated infections in immunocompromised humans

M. avium 104 See above M. abscessus ATCC 19977 Environmental bacterium that causes lung, wound, and

skin infections M. smegmatis str. MC2 155

Generally non-pathogenic, capable of causing soft tissue lesions

M. sp. JLS A pyrene-degrading bacterium isolated from the soil M. sp. MCS As above M. sp. KMS As above M. gilvum PYR-GCK As above M. vanbaalenii PYR-1 Capable of degrading a variety of aromatic hydrocarbons

4.2 Materials and Methods

90

previously described (Bennett-Lovsey et al., 2008; Kelley & Sternberg, 2009; Mao et

al., 2012). Briefly, nine ancestral (wild-type) regulatory protein-coding sequences were

submitted to the phyre2 server. A non-redundant fold library is constructed based on

known protein sequences mined from the Structural Classification of Proteins (SCOP)

database and Protein Data Bank (PDB). The query protein sequence is scanned against a

non-redundant sequence database, and a profile Hidden Markov model (HMM)

generated. A PSI-Blast is used to collect close and remote sequence homologues, an

alignment is constructed and secondary structure predicted. The profile HMM and the

secondary structure are then used to scan the fold library. This alignment process returns

a score on which all alignments are ranked, and an E-value is generated. Top twenty

scoring matches are then used to generate full 3-D models of each sequence and reported

to the user. For each regulatory protein, the highest confidence model (>99%) with the

greatest coverage was used in the subsequent analysis. Whilst it was possible to generate

a homology model for all regulators, for four proteins the structure did not cover the

SNP region and so was not used in later analysis.

4.2.4 Change in protein stability

Prediction of SNPs that cause a destabilisation of the protein structure was made using

CUPSAT (Parthiban et al., 2006). The CUPSAT server is available at: http://cupsat.tu-

bs.de/. CUPSAT predicts the change in free energy of protein unfolding between wild-

type and mutant proteins (ΔΔG) using structural environment specific atom potentials

and torsion angle potentials. The prediction is based on existing PDB protein structures,

or user supplied structures. The output consists of information about mutation site, its

structural features (solvent accessibility, secondary structure and torsion angles), and

comprehensive information about changes in protein stability for nineteen possible

substitutions of a specific amino acid mutation (Parthiban et al., 2006). Protein stability

is categorised as destabilising by a loss of protein stability (-ΔΔG) or stabilising if

protein stability increases (+ΔΔG). Changes in stability of < 0.5 ΔΔG are not

considered significant, and are classified as neutral mutations.

4.3 Results

91

4.3 Results

4.3.1 Predicting functional SNPs within control set

The Sorting Intolerant From Tolerant (SIFT) algorithm was first tested on a set of SNPs

that are highly likely to affect protein function in the MTBC. Drug resistance in the

MTBC is largely caused by SNPs (Ramaswamy & Musser, 1998; Riska et al., 2000),

and many of these drug resistance-conferring mutations have been identified and are

housed in the TBDream database (Sandgren et al., 2009) (database downloaded on 07-

06-10). In total a non-redundant set of 87 SNPs was extracted, consisting of SNPs from

the following genes: ahpC, kasA and katG (SNPs associated with Isoniazid resistance),

embB (ethambutol resistance), gyrA and gyrB (fluroquinolone resistance), pncA

(pyrazinamide resistance) rpoB (rifampicin resistance).

In addition to the drug resistance conferring SNPs, a literature search of experimentally

determined functional SNPs in the MTBC was conducted to supplement the test set.

SNPs from two additional genes: pykA and mmaA3 were included from this search (Behr

et al., 2000; Keating et al., 2005). One of the early signs of variation amongst the

MTBC was the variation in carbon utilisation (Goldman, 1963; Winder & Brennan,

1966). A characteristic of M. bovis was the inability to grow on glycerol as a sole carbon

source, unlike M. tuberculosis, and instead requiring the addition of pyruvate to the

growth medium in vitro (Wayne, 1994). A mutation within pykA in M. bovis, encoding

pyruvate kinase, was found to render this enzyme inactive and thereby disrupting the use

of carbohydrates as an energy source (Keating et al., 2005). The nonsynonymous SNP

(E220D) is also found in strains of M. africanum and M. microti (an infection in Voles),

and these cultures are also supplemented with pyruvate (Keating et al., 2005; Wayne,

1994). The second nonsynonymous SNP (G98D) in mmaA3 is present within most

strains of M. bovis BCG, such as BCG-Pasteur (Behr et al., 2000). A defining

characteristic of mycobacteria is their capacity to synthesise mycolic acids, and it had

4.3 Results

92

been known that some BCG strains could not synthesis methoxymycolates, one type of

mycolic acid (Minnikin et al., 1983). The G98D mutation was subsequently found to be

responsible for this difference (Behr et al., 2000; Yuan et al., 1998).

SIFT was applied to the test SNP set and the results filtered as described (section 4.2.1),

removing regions covered by <3 homologs and alignments with too little sequence

variation with which to form a reliable prediction. In total 63 SNP predictions were

made for the control set, and 48 (78.7%) of the drug resistance associated SNPs were

predicted to impact protein function, leaving the remaining 13 SNPs (21.3%) predicted

to be tolerated. The two pykA and mmaA3 SNPs were also predicted functional, both

receiving the lowest SIFT scores of 0.00. Together, nearly 80% of the SNP set was

predicted functional, which may suggest a false negative error rate of 20%. Although it

should be stressed that not all of the SNPs within the drug resistance set are

experimentally confirmed to be involved in drug resistance, and instead causally

associated. Additionally, promoter mutations could also be the cause of drug resistance,

such as the inhA promoter mutations that cause isoniazid resistance (Musser et al.,

1996); non-coding SNPs can inherently not be tested in this type of analysis.

4.3.2 Predicted functional nonsynonymous SNPs

All lineage-specific nonsynonymous SNPs indentified in Chapter 3 were entered into the

dataset for this study (N=1550 SNPs). Predictions could be made for 1339 (86.4%) of

the SNPs. Removal of predictions based on genes that were highly conserved reduced

this set by 37.8% (506 SNPs), leaving 833 SNP predictions. SNPs within genes that

harboured little sequence diversity were not included as such predictions would be

biased, potentially causing increased functional mutation calls and thereby increased

false positive error rate (Ng & Henikoff, 2003).

In total, 371 nonsynonymous SNPs were predicted to affect gene function (Table 4.2).

The ancient lineages (Lineages 1, 5 and 6) were found to harbour nearly double the

number of predicted functional SNPs than the modern lineages (246 vs 125 functional

SNPs respectively). However, the three ancient lineages also have the longest branch

lengths as shown in Chapter 3. To counter for any influence of gene branch length, the

number of functional and tolerated SNPs was expressed as percentages (Figure 4.2). The

percentage of SNPs predicted functional, for which predictions could be made, ranged

4.3 Results

93

from 40.9-48.4% across the Lineages, with a mean of 44.5%. There was no significant

difference between the frequency of predicted functional and tolerated SNPs across the

lineages (Mann Whitney, p = 0.4817). Additionally, no difference was observed

between the number of functional SNPs by the ancient and modern classification, with a

mean of 44.7% and 44.1% predicted functional SNPs respectively.

As a further control, all genes with predicted functional SNPs were categorised as

essential or nonessential on the basis of transposon mutagenesis screens (Sassetti et al.,

2003; Sassetti & Rubin, 2003). Using these two categories 54 genes (14.6%) of the

functional predictions were essential. This would suggest a 14.6% false positive error

rate for SIFT predictions, which is also close to the previously described false positive

error rate for the SIFT algorithm (~20%) (Ng & Henikoff, 2003).

4.3 Results

94

Table 4.2. Predicted tolerated and functional SNPs using SIFT. Based on SIFT score

≤ 0.05 are predicted functional, and genes with conservation scores not < 3.5 were

filtered.

Lineage Tolerated Functional

Total

predictions

L1 79 74 153

L2 25 18 43

L3 52 44 96

L4 33 23 56

L5 111 89 200

L6 118 83 201

Modern branch 44 40 84

462 371 833

Figure 4.2. SIFT predictions. To account for differences in lineage branch lengths, the

percentage of SNPs predicted as being functional and tolerated is shown. Horizontal

dashed line indicates the average percentage of predicted functional SNPs (44.5%).

Linea

ge 1

Linea

ge 5

Linea

ge 6

Linea

ge 2

Linea

ge 3

Linea

ge 4

Modern

linea

ge0

20

40

60

80

100

Tolerated SNP

Functional SNP

Num

ber o

f SN

Ps (

%)

4.3 Results

95

4.3.3 Impact of nonsynonymous SNPs outside of the human adapted MTBC

To test if the high percentage of predicted functional SNPs is restricted to the MTBC or

is a common phenomenon in mycobacteria, all SNPs were identified between the

reconstructed ancestor of the MTBC sequences and M. canetti, the closely related

outgroup of the MTBC. Out of a total 12,319 coding SNPs, 4,245 (34.5%) were

nonsynonymous. Compared to the percentage of nonsynonymous SNPs found within the

lineage branches of the MTBC (64.8%), M. canetti has nearly half the number of

nonsynonymous SNPs. Screening these nonsynonymous SNPs for potential functional

impact using SIFT, it was found that there were significantly more predicted functional

SNPs in the MTBC. Out of total 2,416 possible predictions, 522 (21.6%) were predicted

functional (chi-square, p<0.0001). This would suggest that in contrast to the MTBC, the

majority of changes in M. canetti are functionally neutral.

4.3.4 Clustering of functional SNPs

There was little evidence of functional SNPs clustering within specific genes, which

could be indicative of adaptive selection. The majority of genes did not harbour a

predicted functional SNP (3701 genes, 92.1%), whilst those that did ranged from 0-5

SNPs per gene, as shown Figure 4.3A. The frequency of SNPs mainly followed the

expected distribution seen by the Poisson model fitted to the data, however there were a

few exceptions: Rv2079, fadD15 (Rv2187) and Rv0465c. The three genes that deviate

from the expected number of SNPs had SNP numbers ranging from 4-5 per gene (Figure

4.3B).

All three genes are above the average gene length of 1003nt, ranging from 1425-2514nt,

which could account for the increased number of predicted functional SNPs. However,

out of the fifteen nonsynonymous SNPs found within the three genes, only one was not

predicted to be functional, which would not be expected based on the genome-wide

distribution of predicted functional and tolerated SNPs (chi-square, p=0.0002).

Therefore, whilst these are relatively long genes, this does not account for the skewed

number of predicted functional nonsynonymous SNPs.

Not much is known about Rv2079, which has four predicted functional SNPs. It is a

conserved hypothetical gene of unknown function, and SNPs are found in four lineages

4.3 Results

96

(1, 2, 5 and 6); in Lineage 2 a nonsynonymous SNP causes the introduction of a stop

codon. Combined with evidence that this gene is nonessential for growth based on

transposon screens (Sassetti et al., 2003; Sassetti & Rubin, 2003), it is possible that

functional mutations are accumulating as Rv2079 it is either incorrectly annotated as a

gene, or in the case of Lineage 2 has become a pseudogene. The other outliers were

fadD15 and Rv0465c, which contain five predicted functional SNPs each. The genes

belong to different functional categories, lipid metabolism and regulation proteins,

respectively. As before, fadD15 functional SNPs are across multiple lineages (1, 3, 4

and 6), and one SNP is also present in the modern lineage branch. Therefore all the

Modern lineages have one or two functional SNPs in fadD15. Furthermore, in Lineages

1 and 5, the two SNPs are nonsense and result in the introduction of stop codons in the

lineages. Function is again not known for fadD15, but it is encodes a fatty-acid-CoA

synthetase and is likely involved in lipid degradation (Cole et al., 1998).

The other gene with five predicted functional SNPs, Rv0465c, is a probable

transcriptional regulator (Cole et al., 1998). It shares high sequence identity with the

RamB protein from Corynebacterium glutamicum, which is in the same phylum as M.

tuberculosis. As well as binding to its own promoter to autoregulate expression, RamB

controls isocitate lyase (icl1) which is part of the glyoxyate cycle (Micklinghoff et al.,

2009). Although not annotated in the most current release of the Tuberculist database

(Release 26, December 2012), it has been given the gene name ramB by Micklinghoff et

al. (2009), and this has been adopted in the following sections. Characteristic of

regulators, the mycobacterial ramB has a DNA binding domain, which is in the N-

terminus of the 465 amino acid protein, including the helix-turn-helix domain (HTH),

from amino acid residues 21 to 40, as based on the PROSITE database. One of the two

predicted functional SNPs in Lineage 6 is located within the HTH domain (N36D),

which might be expected to directly affect the capacity of the protein to bind DNA. All

other functional SNPs, found in Lineages 1, 4 and 5, are located throughout the first half

protein length, leaving only Lineage 2 and 3 with a likely functioning ramB.

Next, the distribution of the predicted functional SNPs across the genome was

calculated, shown in Figure 4.4. Functional SNPs were located across the genome, and

appear to follow the same distribution profile of the nonsynonymous SNP frequencies,

as identified in Chapter 3. On average, there is one functional SNP per 10.9kb of coding

sequence.

4.3 Results

97

Figure 4.3. Distribution of predicted functional SNPs per gene. A. SNPs per gene

range from 0-5, with actual number of genes shown at top of bar. Line indicates

predicted values under a Poisson distribution fitted to the data. B. y-axis potted on a

log10 scale to highlight deviation from the expected number at high SNP numbers per

gene.

A

B

0 1 2 3 4 50

1000

2000

3000

4000

27835 3 1 2

3701

Number of SNPs in gene

Pre

dict

ed fu

nctio

nal

SN

Ps

0 1 2 3 4 5

1

10

100

1000

Number of SNPs in gene

Pre

dict

ed fu

nctio

nal

SN

Ps

(Log

10)

4.3 Results

98

Figure 4.4. Frequency distribution of predicted functional SNPs across genome.

SNPs were placed into bins of 0.1Mb. Right y-axis predicted functional SNPs, left y-

axis nonsynonymous SNPs.

0 1 2 3 40

5

10

15

20

0

20

40

60Predicted functional SNPsNonsynonymous SNPs

Genome position (Mb)

Pre

dict

ed fu

nctio

nal S

NP

s Nonsynonym

ous SN

Ps

4.3 Results

99

4.3.5 Functional category analysis of functional SNPs

To determine if the predicted functional SNPs are within specific gene categories or

instead evenly distributed, the genes with predicted functional SNPs were grouped by

the Tuberculist functional categories (Lew et al., 2011). The percentage of functional

SNPs within each of the eight functional categories was compared to the percentage

representation of the respective category genome-wide, and is shown in Figure 4.5. In

this way, the unequal distribution of genes within specific categories was normalised

and functional SNP distribution expressed as a ratio. Ratios >1 represent functional

categories over-represented with functional SNPs, whereas <1 indicates under-

representation. Categories significantly over-represented with functional SNPs were

lipid metabolism (2.4-fold) and regulatory proteins (1.6-fold) (chi-square, false

discovery rate adjusted p < 0.05). Interestingly, information pathways were the most

under-represented category, with 2.0-fold less predicted functional SNPs that would

have been expected (chi-square, false discovery rate adjusted p=0.04) (Table 4.3). Genes

within the conserved category were also significantly under-represented.

Figure 4.5. Functional category representation. Values on the x-axis are ratios,

representing the deviation from the expected number of predicted functional SNPs per

category. Ratios > 1 indicate overrepresentation, <1 underrepresentation, and ~1

indicates that the number of predicted functional SNPs is on par with the expected

number. Categories are based on Tuberculist annotations. * indicates p <0.05 by

individual chi-square test followed by multiple test correction (False Discovery Rate

method) (Benjamini & Hochberg, 1995) .

!3 !2 1 2 3

information-pathways

conserved-hypotheticals

virulence,-detoxification,-adaptation

intermediary-metabolism-and-respiration

cell-wall-and-cell-processes

regulatory-proteins

unknown

lipid-metabolism *

*

*

*

Functional-category-representation

4.3 Results

100

Table 4.3. Functional category representation. The number of predicted functional

SNPs within genes from each respective category. Representation of category expressed

as ratios. Independent chi-square tests performed for all categories, followed by multiple

test correction (False Discovery Rate method) (Benjamini & Hochberg, 1995).

Functional category Gene

number

Functional

SNPs Representation

chi-square

(adjusted

p-value)

information pathways 242 12 -2.0 0.04

conserved hypotheticals 1031 63 -1.6 <0.01

virulence, detoxification,

adaptation 238 17 -1.4 0.27

intermediary metabolism

and respiration 936 88 -1.1 0.55

cell wall and cell processes 773 91 1.2 0.18

regulatory proteins 198 31 1.6 0.04

unknown 16 3 1.9 0.55

lipid metabolism 271 66 2.4 <0.01

An alternative method to account for the number of functional SNPs per category was

also calculated. This was based on the number of functional SNPs per potential

nonsynonymous SNP position in each functional category. Using this method, it was

again found that the information pathways category had accumulated the least number of

functional SNPs (12 functional SNPs out of 202,427 potential nonsynonymous

positions, 0.006%). The lipid and regulatory categories had accumulated the most

functional SNPs, with 0.02% and 0.03% of all potential nonsynonymous positions

harbouring a functional SNP respectively. In summary, this method highlights the same

gene categories over and under represented found previously.

Stratification of the predicted functional SNPs by lineage in the functional categories by

one-way ANOVA found no significant difference (Kruskal-Wallis test, p=0.99). This

would suggest that whilst there is a significant difference in representation of functional

SNPs within the above four gene categories, it is not driven by specific lineages but

instead a phenomena across the MTBC lineages.

4.3 Results

101

4.3.6 Functional impairment of Lineage 1 and 2 regulatory proteins

It has been shown that two functional categories, regulatory proteins and lipid

metabolism, have accumulated a greater number of predicted functional SNPs than

expected. The following section focuses on the over-represented regulatory category,

and specifically on the predicted functional mutations within Lineages 1 and 2, which

are the focus of the transcriptomic study in Chapter 5. This provides an opportunity to

combine additional predictive information such as structural features, to the previous

sequence based predictions, whilst also providing a reduced SNP set to initially guide

the transcriptome analysis.

Eleven genes within the two lineages harbour lineage-specific SNPs predicted by SIFT

analysis as likely to impair protein function, and a further gene harbours a nonsense

mutation (Table 4.4). Targeted analysis of insertion and deletion (indel) mutations in the

lineage branches identified a further two genes with mutations that cause frameshift

mutations (Table 4.4). The frameshift mutation in Lineage 2 removes the existing stop

codon, likely causing run through and fusion with the downstream gene Rv3829c.

Similarly, the two base frameshift deletion within Rv1028c (kdpD) at chromosome

position 1151486 leads to the introduction of stop codon at codon position 235 and a

resulting 625 (72.8%) amino acid truncation of the ancestral protein. kdpD is a two

component transcriptional sensor and controls the expression of the kdpABC operon,

which in Escherichia coli is involved in potassium transport at low potassium

concentrations (Walderhaug et al., 1992). A third indel was found within mce1R

(Rv0165c), at chromosome position 194305. However, the same two-nucleotide

insertion (consisting of two CC nucleotides) was found across all Lineage 1 and 2

strains, and so was removed from the analysis as this likely represents a two base

deletion that is specific to the H37Rv sequence used in the reference based mapping.

4.3 Results

102

Table 4.4. Transcriptional regulators with predicted functional SNPs and indels.

Eleven SNPs with prediction functional SNPs based on SIFT analysis. One SNP causes

a nonsense mutation (stop gain). Two indels cause frameshift mutations. n/a: not

possible to predict with SIFT.

4.3.6.1 Change in protein stability

The sequence-based predictions of functional impairment of transcriptional regulators

were refined through incorporation of structural based information. The location of each

SNP was placed in the context of protein domain information, such as identification of

SNPs within the functionally important DNA binding helix-turn-helix (HTH) domain.

Protein domain annotations were extracted from the Pfam database (Punta et al., 2012).

These were then complemented with predictions on the protein stability (ΔG) of wild-

type and mutant protein structures, enabling the change in protein stability (ΔΔG) to be

Gene Regulator type SNP Mutation Lineage SIFT

score

Rv1846c BlaI penicillinase repressor T 2096430 G L57R 1 0.05

Rv3082c VirS AraC T 3447480 G L316R 1 0.01

Rv3167c TetR C 3536008 A P17Q 1 0.02

Rv0465c RamB HTH-XRE A 555945 G Q121R 1 0.02

Rv1032c TcrS 2-component sensor C 1157771 G S62C 1 0.01

Rv3736 AraC G 4187063 A G144R 1 0.01

Rv0844c NarL 2-component regulator G 940602 C G169R 2 0.00

Rv0377 LysR G 455325 C R302P 2 0.00

Rv0275 TetR T 331588 C L24S Modern 0.00

Rv0981 MprA 2-component regulator A 1097023 G S70G Modern 0.04

Rv2359 Zur Fur G 2641840 A R64H Modern 0.02

Rv2788 SirR Fe-dependent

repressor

C 3097349 Q131X 1 n/a

Rv3830c TetR insertion:

4305063 T

S208

frameshift

2 n/a

Rv1028c KdpD 2-component sensor deletion:

1151486 AC

H67

frameshift

1 n/a

4.3 Results

103

calculated. Compromised protein folding and decreased stability of the protein product

are major pathogenic consequences of nonsynonymous SNPs, affecting the ability of the

protein to function (Wang & Moult, 2001; Yue et al., 2005).

To calculate ΔΔG it is necessary to have protein structures for each of the regulators.

Only two of the eleven regulators with predicted functional SNPs have had their protein

structures resolved and are publicly available in the Protein Data Bank (PDB) (Burley,

2013); these are BlaI (PDB ID: 2G9W) and NarL (3EUL) (Sala et al., 2009; Schnell et

al., 2008). For the remaining nine regulators, homology modeling was performed using

the Phyre2 server (Kelley & Sternberg, 2009) as described in Methods (section 4.2.3).

Following this it was still not possible to construct protein models for four of the

regulators, either due to the low quality of the model or because the SNP position was

not covered. The remaining seven regulators were entered into the analysis.

The CUPSAT server was used to predict ΔΔG (Parthiban et al., 2006). Protein stability

is categorised as destabilising (-ΔΔG), neutral (0 ΔΔG) or stabilising (+ΔΔG). Changes

in stability of < 0.5 ΔΔG are not considered significant (see section 4.2.4). Five of the

regulator SNPs were predicted to cause a loss of protein stability, one protein structure

increased in stability following the SNP, and one prediction of energy change was too

small to classify as either stabilising or destabilising, and so is likely neutral (Table 4.5).

Combined with the protein domain information, five of the destabilising SNPs were

located within the HTH DNA binding domains, and likely affect the regulatory function

of the protein: Rv0275, Rv0844c (narL), Rv1846c (BlaI), Rv3082c (virS) and Rv3167c.

These were classified as having “high predictive scores” and form a reduced set of

transcriptional regulators predicted to be functionally impaired (Table 4.5). For example,

a SNP in Lineage 1 strains introduces an arginine residue into the conserved position of

the virS HTH domain, which is predicted to destabilise the structure and cause a loss of

function (Figure 4.6).

4.3 Results

104

Table 4.5. Regulatory proteins with predicted functional SNPs and indels in

Lineages 1 and 2. Sequence based predictions of functional SNPs are combined with

Pfam protein domain information and prediction of changes in protein stability (ΔΔG).

n/a: unable to calculate ΔΔG as the mutation is an indel or nonsense SNP, unkn: unable

to generate a protein structure using homology modelling.

Gene Mutation Lineage Domain Protein stability

(ΔΔG; kcal/mol)

high predictive score

Rv0275 L24S Modern helix-turn-helix -3.18

Rv0844c NarL G169R 2 helix-turn-helix -4.66

Rv1028c KdpD H67 frameshift 1 2-component

sensor

n/a

Rv1846c BlaI L57R 1 helix-turn-helix -8.72

Rv2788 SirR Q131X 1 Fe-dependent

repressor

n/a

Rv3082c VirS L316R 1 helix-turn-helix -2.03

Rv3167c P17Q 1 helix-turn-helix -1.21

Rv3830c S208 frameshift

fusion

2 low complexity n/a

low predictive score

Rv0465c RamB Q121R 1 low complexity unkn

Rv0377 R302P 2 low complexity unkn

Rv0981 MprA S70G Modern cheY 2.83

Rv1032c TcrS S62C 1 low complexity unkn

Rv2359 Zur R64H Modern helix-turn-helix 0.47

Rv3736 G144R 1 arabinose-

binding

unkn

4.3 Results

105

Figure 4.6. Predicted loss of function of virS transcriptional regulator in Lineage 1.

Homology model of wild-type VirS protein, covering amino acid residues 214 - 334.

Arrow indicates Lineage 1 SNP at amino acid position 316 within the HTH domain.

CUPSAT analysis of the ancestral and mutant protein predicts a destabilisation of the

structure (ΔΔG = -2.03 kcal/mol). Sequence conservation of region used in SIFT

prediction shown on right hand side, with the ancestral MTBC sequence shown at the

top of the sequence alignment. Standard one-letter amino acids nomenclature used, and

X indicating a gap in the alignment.

L316R

281QUERY LIERERRAQA ARYLAQPGLY LSQIAVLLGY SEQSALNRSC RRWFGMTPRQ YRAYGGVSGR *mmi:MMAR_3320 VVDDVRREVT ERYLRDSDMT LTHLARQLGY AEQSVLSRSC QRWFGASPAS LRAXXXXXXX Xmmi:MMAR_5276 LIDEVRKETA DRYLRTTAMS LSHLARELGY AEQSVLTRSC KRWFGIGPAA YRAXXXXXXX Xmul:MUL_4350 LIDEVRKETA DRYLRTTAMS LSHLARELGY AEQSVLTRSC KRWFGIGPAA YRAXXXXXXX Xmab:MAB_3997c LVDQIRREAA ERLLSDTDLS LDHLSRQLGY AEQSVFTRSC KRWFGTTPSA YRSXXXXXXX Xmgi:Mflv_5495 LVDQTRRDTA QRLLLDTALS LDQLACPLXX XXXXXXXXXX XXXXXXXXXX XXXXXXXXXX Xmab:MAB_3623 LLDTIRLDLA DHLVTSDRHS LTEISEMLAF SSPSNFSRWF RGHRAMSPRT WRXXXXXXXX Xmmc:Mmcs_3216 LRQSFLRERA ILRLLDRSLS VSEIAAELGY AELTNFTHAF KRWTGRSPRH FRXXXXXXXX Xmkm:Mkms_3278 LRQSFLRERA ILRLLDRSLS VSEIAAELGY AELTNFTHAF KRWTGRSPRH FRXXXXXXXX Xmjl:Mjls_3227 LRQSFLRERA ILRLLDRSLS VSEIAAELGY AELTNFTHAF KRWTGRSPRH FRXXXXXXXX Xmgi:Mflv_4594 LRQSCLRESA MMLLITRSMS ASQIATELGY GDLANFSHAF KRWTGRSPSE YRXXXXXXXX Xmab:MAB_0715c IRDAALRTEA IKSLEDGSES LNDLSVRLGF SELSAFTRAF RRWTGASPAQ YRXXXXXXXX Xmab:MAB_2050 LRQSFLQERA ILRILDRSVS VSEIAAELGY ADLTNFTHAF KRWTGRSPRH FRXXXXXXXX Xmmi:MMAR_3156 LRQAFLRERA MLQLLDRSLS VSEIATDLGY SDLANFSHAF KRWTGRSPSE FRXXXXXXXX X

320 310

4.4 Discussion

106

4.4 Discussion

4.4.1 Strengths and limitations of the study

The overall aim of this study was to computationally measure the impact of SNPs in the

MTBC, focusing specifically on SNPs that contribute to lineage-specific variation

identified in Chapter 3. Over 1,500 nonsynonymous SNPs were identified in the lineage

branches of MTBC, and the phenotypic effects of these are unknown. Unlike other

bacterial species, the majority of SNPs in MTBC (over two-thirds) are nonsynonymous,

and this SNP set was the focus of this computational study. Such SNPs are more

tractable to computational prediction methods than synonymous and intergenic SNPs, as

the impact of the amino acid substitutions can be measured using the properties of the

amino acid, such as residue volume change, as well as the evolutionary conservation of

the specific nucleotide position based on multiple sequence alignments. This is reflected

in the development of computational prediction methods based mainly on

nonsynonymous SNPs (Ng & Henikoff, 2006). However, clearly noncoding SNPs can

also have an impact on gene function, such as the mutation of regulatory regions found

in M. tuberculosis drug resistance (Müller et al., 2011; Riska et al., 2000). More recently

it has also been suggested that synonymous SNPs are less silent than previously

assumed (Plotkin & Kudla, 2011). Despite not having an effect on the resulting protein

sequence, synonymous SNPs, and therefore synonymous codon changes, have shaped

gene expression through the phenomenon of codon-usage bias (Plotkin & Kudla, 2011).

Differential use of synonymous codons can effect RNA processing, protein translation

and protein folding (Plotkin & Kudla, 2011); industrial applications have exploited this

to increase gene expression over 1000-fold through introduction of synonymous SNP

changes (Gustafsson et al., 2004). Furthermore, in human based studies, a synonymous

mutation has also been shown to change the substrate specificity of the multidrug-

resistance protein 1 (MDR1), although the precise mechanism is not yet understood

(Kimchi-Sarfaty et al., 2007; Komar, 2007). Together this demonstrates the potential

4.4 Discussion

107

functional importance of all SNP types, and it is likely that future study of M.

tuberculosis genomic variation will attribute instances of functional variation not only to

nonsynonymous SNPs but the latter two SNP types as well.

Whilst experimental methods exist to characterise the functional effect of SNPs, such as

site-directed mutagenesis, studying the molecular effects of mutations in the MTBC is

time-consuming, laborious and unfeasible at this scale, therefore computational methods

can provide useful and reliable information about the effects of amino acid substitutions

at an initial stage. There are two main methods to predict the functional effect of coding

nonsynonymous SNPs. The first relies on mapping the SNP to the three-dimensional

protein structure and the latter takes a sequence-based approach, assessing the nature of

the position and introduced amino acid type. At the time of writing, there were protein

structures for 259 (6.4%) of all annotated M. tuberculosis proteins in the Protein Data

Bank (Burley, 2013). This number has not increased significantly in the interim period,

and currently 314 genes have associated protein structures (December, 2012) (Burley,

2013). To ensure that this was a comprehensive study of the effects of lineage-specific

nonsynonymous SNPs, it was decided to use the latter prediction method based on

sequence homology, thus maximising the number of SNP predictions. The method

chosen was the Sorting Intolerant From Tolerant (SIFT) algorithm (Ng & Henikoff,

2003). Although SIFT relies solely on amino acid sequence to make the prediction, it

has been shown to perform similarly to methods based on different evolutionary and

structural features, and critically can be applied to many more of the lineage-specific

SNPs (Saunders & Baker, 2002; Sunyaev et al., 2001). It has been suggested that a

combination of the two main prediction methods (sequence and structural based) will

likely improve the accuracy of predictions (Bao & Cui, 2005; Thusberg & Vihinen,

2009), but the chosen method was viewed as an acceptable trade-off. More in depth

structural work can be applied at a later targeted stage, as was used in this study on the

genes within the regulatory protein category. However, even at this stage, four of the

eleven (36.4%) regulatory proteins with nonsynonymous SNPs could not be entered into

structural based predictions, owing to the lack of structural information; for the

remaining proteins only two had been experimentally determined, requiring intensive

homology modeling to increase the size of the structural dataset.

Moving from SNPs, short insertion and deletions (indels) also have potential functional

consequences, particularly indels that are of a length not divisible by three and so lead to

a change in the reading frame. However, inference of indels from next-generation

4.4 Discussion

108

sequence data is challenging, and so far methods for identifying these lag behind

methods for calling SNPs in terms of sensitivity and specificity (Albers et al., 2011). For

this reason, it was decided to not include a genome-wide analysis of indels, but focus on

a few potential indels in genes involved in regulatory function instead. Indels are also

more rare than SNPs in the MTBC, and for these reasons the identification of SNPs has

had the greatest attention in such studies so far. They are effectively the lower hanging

fruit. It is likely that these issues will be resolved and indels will have more attention as

newer algorithms to detect them are developed (Albers et al., 2011), and as potentially

longer reads from third generation sequencing technologies are utilised.

4.4.2 Validation of the SIFT method

For the first time it was possible to identify all potential functional SNPs in the lineages

of MTBC. As described in Chapter 3, these SNPs represent the background variation

that contributes to the underlying lineage genetic diversity. Identification of SNPs more

likely to contribute to functional diversity focuses later analyses on predicted

phenotypically important SNPs, and on a broader scale tests the hypothesis that a high

proportion of SNPs within the MTBC will be functional, likely due to reduced purifying

selection acting within MTBC (Hershberg et al., 2008).

The SIFT algorithm was first run on a test SNP set that would be expected to be

enriched for functional SNPs, and so act as positive control for the performance of the

method. The set was based on SNPs associated with drug resistance from the current

release of the TBDReam database (Sandgren et al., 2009). It was found that 79.4% of

SNPs were predicted functional by SIFT, leaving 20.6% of SNPs associated with drug

resistance predicted to be functionally neutral. This potential false negative error rate of

~20% is close to that previously described by the authors of SIFT (Ng & Henikoff,

2001; Ng & Henikoff, 2003). However, it is important to note that the majority of SNPs

in the positive control set are putative mutations found in drug resistant clinical M.

tuberculosis isolates, and so may be causally related and not involved in drug resistance

(Sandgren et al., 2009). This will likely mean that the control set has some SNPs that are

not functional and so is not a completely robust test of the SIFT algorithm. As an

alternative test, it was found that significantly fewer predicted functional SNPs were

found within the genes previously characterised as being essential for growth, and that

functional SNPs that did fall within the group of essential genes (14.6%) is again close

4.4 Discussion

109

to the expected false positive error rate of SIFT. Together this provides confidence in the

later SNP predictions.

4.4.3 Half of lineage-specific SNPs are predicted to have functional consequences

Applying SIFT to all lineage-specific SNPs, it was possible to make predictions for

>85% of the set, and strikingly it was found that just under half were predicted to have a

functional effect. The mean percentage of functional SNPs for all lineages was 44.5%

and no significant difference was found between the individual lineages, or by grouping

lineages into ancient and modern categories. This prediction is very close to the estimate

made by Hershberg et al. (2008). The authors of this former study estimated that ~40%

of the SNPs within MTBC are functional by extrapolating from the SNPs found within

the set of 89 genes sequenced in 99 human M. tuberculosis isolates (Hershberg et al.,

2008). In contrast to the high proportion of functional SNPs in the MTBC, all SNPs

between an M. canetti strain, the closely related outlier from the MTBC, and the

reconstructed M. tuberculosis ancestor were identified and it was found that only 21.6%

of the nonsynonymous SNPs were predicted to be functional, which is less than half of

the proportion seen in the MTBC. This suggests that the hypothesised low frequency of

purifying selection acting with MTBC is generating substantial diversity. Interestingly, a

similar phenomenon has been observed in humans, where recent demographic

expansions have led to the accumulation of low frequency genetic variants associated

with strong functional effects (Keinan & Clark, 2012; Tennessen et al., 2012).

Considering the tight link between the MTBC and its human host, it is interesting to

speculate that these human expansions might have had a similar effect on the genetic

diversity of the MTBC (Hershberg et al., 2008).

Although purifying selection is likely reduced in MTBC, it was still possible to detect

signals of this force through increased removal of predicted functional SNPs within

genes classed as essential for growth compared to nonessential genes and also by

clustering of SNPs beyond the expected distribution. When grouped by functional

category, genes encoding proteins involved in the information pathways category

accumulated significantly less predicted functional SNPs than expected. Conversely,

genes encoding proteins that perform regulatory functions and those involved in lipid

metabolism were over-represented with functional SNPs. Interestingly, it was also found

that the transcriptional regulator ramB had accumulated more functional SNPs than

4.4 Discussion

110

expected, spanning four of the lineages. Following the regulatory protein category, focus

was made on Lineage 1 and 2 SNPs; the two respective lineages form the transcriptomic

study in Chapter 5, and so a focused analysis was performed through integration of

additional mutational and structural information to identify likely impaired functional

regulators for the proceeding study. It was found that several SNPs lie within the HTH

DNA binding domain of the regulatory proteins, such as a Lineage 1 SNP in virS. VirS

regulates its own transcription and is also a positive regulator of an adjacent divergently-

expressed MymA locus, which has experimentally been shown to be involved in

virulence in guinea pigs (Singh et al., 2003; Singh et al., 2005). Together with several

frameshift mutations arising from short indels, it is hypothesised that specific lineages

have functionally impaired regulators and this has the potential to give rise to

phenotypic diversity. Such SNPs should be detectable at the transcriptional level, and

part of the following chapter (Chapter 5) explores this hypothesis.

In summary, this study has identified a set of nonsynonymous SNPs likely to have

functional consequences in MTBC. However, it is not possible using the SIFT

predictions to predict how these mutations affect protein function. There are four

possible evolutionary fates for SNPs: The mutant is beneficial; causes a severe fitness

cost and so is lost from the population; is functionally neutral; or finally is neither

beneficial or excessively harmful, but slightly deleterious (Balbi & Feil, 2007). Slightly

deleterious SNPs are the largest class, and in Escherichia coli it has been estimated that

for every beneficial mutation there are 105 slightly deleterious mutations (Kibota &

Lynch, 1996). As seen in Figure 4.7, it can be anticipated that many of the predicted

functional SNPs identified in this study will fall within this slightly deleterious category,

whilst the proportion of SNPs that have a greater impact or are “more” functional is

unknown, but likely determined by a combination of selective and stochastic forces,

such as the level of purifying selection acting within the organism.

4.4 Discussion

111

Figure 4.7. Spectrum of functional SNPs. The consequence of nonsynonymous SNPs

range from tolerated/neutral to functional and at the extreme results in cell death, and

therefore are not observed in the bacterial population. In MTBC ~40% SNPs were

predicted functional in this study, but severity is unknown.

Increasing severity of SNP

Incr

easi

ng n

umbe

r of

SN

Ps

harmful, cell death

functional SNPs (~40%)

tolerated SNPs

(~60%)

“more” functional SNPs (?%)

5.1 Introduction

112

Chapter 5 Screening the effect of lineage-

specific variation by sequence-based

transcriptional profiling

5.1 Introduction

M. tuberculosis infection is defined by a typically protracted period of asymptomatic

infection followed by progression to active disease in a minority of individuals.

Throughout these stages of infection, M. tuberculosis is exposed to a range of

microenvironments, including acidic pH, reactive oxygen species, and nutrient

starvation (Barry et al., 2009). Genome sequencing of the M. tuberculosis reference

strain H37Rv by Cole et al. revealed a complex network of transcriptional regulation,

including thirteen sigma factors, eleven two-component regulators, eleven serine-

threonine protein kinases and over one hundred predicted transcription factors (Cole et

al., 1998). At the initiation of this study, the extent of transcriptional variation between

clinical isolates from the six main lineages was unknown, and the effect of the

underlying genetic diversity to such variation was an open question.

In 2007, a microarray based study comparing H37Rv and the animal adapted M. bovis

growing under steady state conditions revealed that the human and bovine pathogens

showed differential expression of ninety two genes, which encoded a range of functions,

including cell wall and secreted proteins, transcriptional regulators, PE/PPE proteins,

lipid metabolism and toxin–antitoxin pairs (Golby et al., 2007). It is now known that

there are on average ~1500 SNPs separating any MTBC strain (section 3.3.1), which

raises the likelihood that human-adapted MTBC strains will also display a similar

5.1 Introduction

113

quantity of differential expression. Shortly after identification of the main six human

adapted MTBC lineages, a microarray-based study in 2010 surveyed for the first time

differences in gene expression amongst clinical isolates of the MTBC (Homolka et al.,

2010). The study was based on a total fifteen MTBC clinical isolates from Lineage 1,

the Beijing group of Lineage 2, two sub-lineages from Lineage 4 and Lineage 6. The

study found specific transcriptional patterns in vitro and in intracellular growth based on

the ancient and modern lineage groupings, demonstrating that strains from defined

phylogenetic groups display similar gene expression, which suggests the importance of

understanding the underlying genetic background. The strains used in the study were not

genome sequenced which limited the scope of the study, and it was not possible to relate

to specific genetic variation.

The previous chapters would not have been possible without the availability of whole

genome sequences, and such data now is crucial to experiments linking genotype to

phenotype. Previous transcriptomic studies have relied on microarray based methods,

but recent advances in DNA sequencing technologies has enabled the determination of

RNA expression through sequencing of cDNA prepared by reverse transcription of total

cellular RNA (RNA-seq), which provides dynamic ranges several orders of magnitude

greater than other technologies, whilst at the greatest possible resolution. The first

sequence based transcriptome of M. tuberculosis strain H37Rv was published in 2011 by

Arnvig et al., and whilst this was not a clinical isolate, this demonstrated the power of

RNA-seq to capture the complete transcriptional landscape of M. tuberculosis (Arnvig et

al., 2011).

5.1.1 Aims

The aims of this chapter were to survey the transcriptome profiles of M. tuberculosis

clinical isolates from Lineages 1 and 2, and to understand the effects of lineage-specific

variation identified in the previous Chapters. Specific aims were to:

• characterise M. tuberculosis transcriptomes using a sequence based approach

• capture lineage-specific transcription profiles in the transcriptome sets

• explore the functional impact of lineage-specific SNPs identified in Chapter 3

and 4

5.2 Methods

114

5.2 Methods

5.2.1 Clinical isolates in study

5.2.1.1 Strains sequenced using RNA-seq

Strains are from a collection of M. tuberculosis isolates from foreign-born tuberculosis

patients in San Francisco, who contracted the infection in their country of origin

(Gagneux et al., 2006a). All strains are drug susceptible and have been typed in studies

(Table 5.1) (Gagneux et al., 2006a; Hershberg et al., 2008). Three strains were selected

from Lineages 1 and 2 respectively, to represent the genetic diversity in the lineages.

Figure 5.1 shows the previously described MTBC phylogeny based on MLSA analysis,

and the strains used in the RNA-seq study are highlighted (Hershberg et al., 2008). From

Lineage 1, two strains are from the large Rim of Indian subgroup (strains N0072 and

N0153) and a representative of the Philippines subgroup (strain N0157). Two Beijing

strains from Lineage 2 were selected (strain N0145 and N0052) and a less common non-

Beijing strain (N0031). Figure 5.1 uses the original naming schema, but from this point

on the later adopted ‘N’ number strain naming will be referred to. To preserve the two

naming conventions both have been used in Table 5.1. All strains have been genome

sequenced in previous studies or as part of this thesis.

5.2.1.2 Additional growth curve experiment strains

The determination of growth rates for the RNA-seq study strains was supplemented by

the clinical isolates shown in Table 5.2. In total six strains from Lineage 1 and 2 were

included to explore potential lineage-specific differences in exponential phase growth

rate. The reference laboratory strain H37Rv was also included.

5.2 Methods

115

Table 5.1. Lineage 1 and 2 strains used in the RNA-seq study. All strains were

previously genome sequenced except strain N0031, which was sequenced for this thesis

in Chapter 3. This study refers to the strain names used in the Gagneux group, but

original strain names used by Hershberg et al. (2008) are shown for reference. In

addition to lineage, the region of difference (RD), which has been historically used to

type the strains, is indicated. Geographic distribution and prevalence of lineage based on

previous classifications (Coscolla & Gagneux, 2010).

Strain MLSA strain name Lineage RD

lineage

Lineage geographic distribution

Patient origin

N0153 T83 1 RD239 Rim of Indian Ocean Vietnam

N0072 EAS053 1 RD239 Rim of Indian Ocean India

N0157 T92 1 RD239 The Philippines The Philippines N0145 T67 2 RD105 Beijing China N0052 98_1833 2 RD105 Beijing China

N0031 94_M4241A 2 RD105 Non-Beijing China

Table 5.2. Additional strains used in growth curve experiment. Three additional

strains from Lineage 1 and Lineage 2 were included in the growth curve experiments in

combination with the previously described six RNA-seq study strains. All are clinical

strains and isolated as part of the San Francisco strain collection (Gagneux et al.,

2006a). Genome column indicates genome sequencing status of strain.

Strain Strain ID Lineage RD lineage Patient origin Genome

N0043

96_4329

1 RD239

Burma

Y

N0075 EAS080 1 RD239 Vietnam N

N0121

T17

1 RD239

The Philippines

Y

N0041

96_2104

2 RD105

Vietnam

N

N0053

98_1863

2 RD105

China

Y

N0140

T47 2 RD105 Macau N

5.2 Methods

116

5.2.1.3 Additional qRT-PCR strains

The confirmation of select lineage-specific expression of genes by qRT-PCR used all

previous RNA-seq strains and the addition of four Lineage 1 and 2 strains. These are

shown below in Table 5.3.

Table 5.3 Additional strains used in qRT-PCR confirmation. Two strains from

Lineage 1 and Lineage 2 were included in the RNA-seq confirmation. All are clinical

strains and isolated as part of the San Francisco strain collection (Gagneux et al.,

2006a). Genome column indicates genome sequencing status of strain. One strain is

currently not genome sequenced but this was not required for the aims of the qRT-PCR

study.

Strain Strain ID Lineage RD

lineage Patient origin Genome

N0043

96_4329

1 RD239

Burma

Y

N0121

T17

1 RD239

The Philippines

Y

N0041

96_2104

2 RD105

Vietnam

N

N0053

98_1863

2 RD105

China

Y

5.2 Methods

117

Figure 5.1. Strains sequenced in RNA-seq study. Circles indicate the six Lineage 1

and 2 strains used in the RNA-seq study. Phylogenetic tree of MTBC adapted from

(Hershberg et al., 2008). Image reproduced under the Creative Commons Attribution

License (CCAL).

5.2 Methods

118

5.2.2 Cluster analysis

Hierarchical cluster analysis of the transcriptomes was performed using the hclust

function in R by the complete linkage method. Spearman distances were calculated from

the dissimilarity matrix of pairwise correlations of total gene expression (N=4,015

genes), expressed as Reads Per Kilobase per Million mapped reads (RPKM). Clade

support using 1000 bootstrap replications was performed using the R function pvclust.

Comparison of the total gene expression per strain to SNP distance was performed with

normalised read counts that were transformed using the variance stabilising

transformation (VST), and implemented in the DESeq package (Anders & Huber, 2010).

VST is a monotonous function, and is calculated for each sample such that variance in

the count data becomes independent of the mean.

5.2.3 Differential expression analysis

Statistical testing for the main differential expression analysis was performed using

DESeq (Anders & Huber, 2010). DESeq is a method based on the negative binomial

distribution and implemented in the R statistical environment. Raw reads were

normalised first using DESeq to adjust for differences in library sizes. Reads from

technical replicates were combined and treated as one sample. Gene deletions at either

strain or lineage level were first removed from the analysis (N=223 genes); deletions

were identified based on genome coverage using the respective strains genome, with a

threshold of <90% gene coverage to define a deletion. Normalised expression of features

(annotated genes, antisense or sRNAs) that overlapped with strains from different

lineages due to strain specific expression were filtered and removed, with 1,606 features

entered into the analysis. For the purpose of testing for lineage-specific differential

expression in DESeq, strains from the same lineage were treated as biological replicates,

and the mean expression from the two lineages compared. Significant differential

expression was defined as p<0.05 (p-value adjusted for multiple testing using

Benjamini-Hochberg method).

5.2 Methods

119

5.2.4 Transcriptional Start Site (TSS) calling

Custom Perl scripts were written for TSS calling. Briefly, the increment in reads from

one genome position to the next consecutive base was calculated for all genomic

positions, with an increment significantly above the average background coverage

defined as candidate TSS. TSS peak height was considered as representative of the level

of expression of the TSS. To build a genome-wide TSS map for M. tuberculosis,

automated annotation of the putative TSS detected according to genomic distribution

similar to previous TSS analysis using RNA-seq data (Sharma et al., 2010b).

5.3 Results

120

5.3 Results

5.3.1 Growth rate in vitro

It was critical to isolate the transcriptomes of all study strains from the same

physiological state, ensuring that differential transcription is not simply a reflection of

the stage of growth. RNA was harvested at two growth phases in this study, mid-

exponential and stationary; and these were defined as an Optical Density (OD600) of 0.4

to 0.6 and one week after an OD of 1.0, respectively. A difficulty of working with

clinical strains compared to well-used reference strains is that the growth rates are

largely unknown, which are required to standardise the RNA extraction process.

Three representative strains from Lineage 1 and 2 were selected for the RNA-seq study

(section 5.2.1.1, and the growth of the six strains was monitored over a 14-day period. In

a defined 7H9 media (section 2.1.3) culture density (OD600) was measured daily from the

initial inoculation (day 0). From frozen stocks, strains were grown in 10mls 7H9 for two

days prior to transfer into roller bottles used for the growth curves and all RNA

extractions. At day 0, a calculated volume was transferred from the pre-culture to start

all growth curves at OD 0.01. This experiment was also used to identify any lineage

level differences between growth rates in vitro, and three additional strains from both

Lineage 1 and 2 were included to increase the sample size and so the statistical power of

the test. Additional clinical isolates are described in section 5.2.1.2. The H37Rv

laboratory strain was also included as a reference.

5.3 Results

121

Figure 5.2. In vitro growth curves. A. Growth of twelve strains from Lineage 1 and 2,

plus H37Rv. B. Strains pooled by lineage. Error bars are the standard error of the mean

(SEM). All strains were grown in three independent experiments, and under the same

conditions. Strains are coloured using previously defined lineage colouring.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 140.001

0.01

0.1

1

10 N0121

N0145

N0043

N0053N0041N0031

N0072N0153N0157

N0052

H37RvN0140

N0075

Days from inoculum

Opt

ical

Den

sity

(OD

600)

a)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 140.01

0.1

1

10

Lineage 1Lineage 2H37Rv

Days from inoculum

Opt

ical

Den

sity

(OD

600)

b)

A

B

5.3 Results

122

Growth rates of the clinical strains did vary with a trend for Lineage 2 strains to continue

into late-exponential phase for longer that Lineage 1 strains (Figure 5.2A). This was

reflected in higher OD600 readings for Lineage 2, with the Lineage 2 strain N0145

reaching an OD600 of ~10, the highest of all strains. The reference strain H37Rv is in the

middle of the growth rates. By day 9 to 10 all strains had entered stationary phase.

Figure 5.2B plots all strains from the same lineage as replicates, confirming the

observation that Lineage 2 strains do continue in late-exponential phase for

comparatively longer. However, mid-exponential growth is similar for all strains

irrespective of lineage. As pre-cultures were used, all strains were in exponential growth

at day zero, the start of the growth curve. Between days three and four, strains leave

mid-logarithmic and enter late-logarithmic growth. For these experiments, mid-

logarithmic growth was defined as OD ≤ 0.6.

Strain specific doubling times are shown in Table 5.4. Exponential doubling times range

from 13.8 ± 0.2 hrs (strain N0043) to 24.2 ± 0.6 hrs (strain N0075). This shows that the

doubling times of the clinical strains can range by up to 10 hrs, which is important when

synchronising RNA extraction experiments. Whilst there is some variability in the

specific growth rates of the strains, this was not significant at the lineage level. The

mean lineage exponential doubling time for Lineage 1 was 18.2 ± 1.8 hrs and for

Lineage 2 was 16.4 ± 0.5 hrs (two tailed students t-test, p=0.35).

!

5.3 Results

123

Table 5.4. In vitro growth rates. Doubling times in hours are shown for exponential

phase growth with the SEM. All strains were grown in at least three independent

experiments, under conditions detailed in 2.1.4. Lineage mean doubling time also

shown. The laboratory strain H37Rv was used as a reference. Asterisks (*) identify

strains used in RNA-seq study.

Strain Lineage Doubling time (hrs)

Error (SEM)

Lineage mean doubling time (hrs)

N0043 1 13.8 0.2

18.2

N0072 * 1 16.1 0.8 N0075 1 24.2 0.6 N0121 1 16.0 0.4 N0153 * 1 23.2 1.9 N0157 * 1 15.9 0.4 N0031 * 2 16.2 0.1

16.4

N0041 2 18.1 2.0 N0052 * 2 16.2 2.1 N0053 2 16.6 2.1 N0140 2 16.9 1.9 N0145 * 2 14.4 0.6 H37Rv 4 18.0 1.7 18.0

5.3 Results

124

5.3.2 RNA isolation and Illumina ready libraries

Following extraction of RNA, sample concentration was measured by Nanodrop and

RNA quality by Agilent 2100 BioAnalyser. The BioAnalyser is a nanofluidics device

that performs size fractionation and quantification of DNA, RNA, or protein samples.

Only high quality samples, based on the electropherogram and expressed as an RNA

Integrity Number (RIN) ≥8, were used in later analysis (Figure 5.3A). Samples were

rigorously DNase treated to remove potential DNA contamination from the RNA

extraction, and entered into the Illumina library construction stage. Two main cDNA

library types were constructed, the first was based on a modified Illumina stand-specific

protocol for sequencing all RNA species (section 2.5.1). In total nine high quality cDNA

libraries were constructed, including three technical replicates. Only libraries with

concentrations >10µg/ml and with the expected size distribution of adapter cDNA

fragments were sequenced (Figure 5.3B). The second main library method depleted

processed RNAs in the samples, and was used in Transcriptional Start Site (TSS)

mapping. RNA was sent to Vertis Biotechnologie AG and four libraries constructed

(section 2.5.2).

A. B.

Figure 5.3. Quality control of RNA-seq samples by Bioanalyser. Migration time of

sample shown in seconds on x-axis, and fluorescence units (FU) on y-axis. A. Integrity

of total RNA following RNA isolation and DNase treatment for strain N0052. The two

largest peaks are rRNA 16S and 23S, and the area under the peaks used as a metric of

RNA quality (RIN). B. Quality of strain N0052 Illumina strand-specific RNA-seq

library. Lower and upper size markers are 15 bp and 1500 bp. Distribution of cDNA

library fragments expected to be 180-200 bp, corresponding to 60-80 seconds.

�$''(3

�0!,�''��!-/'.-�"*,��� !,

�����3($� �����

�����10&(053$5,10� ��0*�:.

�� ���

�0!,�''��!-/'.-�"*,�-�(+'!������ ��������

�����3($� ���� �

�����10&(053$5,10� ���0*�:.

3�����$5,1� �4����4"� ���

�����05(*3,58��6/%(3������� ���������������

�,�#(!).�.��'!�"*,�-�(+'!������ ��������

��(! �.�,.��%(!��-� �) ��%(!��-� �,!� ��*"�.*.�'��,!�16S 39.55 43.61 83.3 19.023S 44.91 49.02 117.6 26.9

��� ��

�0!,�''��!-/'.-�"*,�-�(+'!������ ��������

�����3($� �����

�����10&(053$5,10� �0*�:.

3�����$5,1� �4����4"� ��

�����05(*3,58��6/%(3������� ���������������

�,�#(!).�.��'!�"*,�-�(+'!������ ��������

��(! �.�,.��%(!��-� �) ��%(!��-� �,!� ��*"�.*.�'��,!�16S 38.92 42.28 60.9 24.723S 45.45 48.27 72.3 29.3

������

�0!,�''��!-/'.-�"*,�-�(+'!������ ��������

�����3($� �� �

�����10&(053$5,10� ��0*�:.

3�����$5,1� �4����4"� ���

�����05(*3,58��6/%(3������� ���������������

�,�#(!).�.��'!�"*,�-�(+'!������ ��������

��(! �.�,.��%(!��-� �) ��%(!��-� �,!� ��*"�.*.�'��,!�16S 40.41 42.34 32.5 26.323S 45.85 49.59 7.0 5.7

����72(35��������������� 9��1283,*+5������������*,.(05��(&+01.1*,(4���0&� �3,05('� �����������

�����!1+!,.��,*&�,2*.!��*.�'�������)*���� ��� ��������������������1� �$*( 1)� �

�3($5('��1',),('�

��������� ���������������$5$��$5+�

�31-$3815(��15$.������$01��!���31-$3815(��15$.������$01#����������#�����#��� ���7$'

�44$8��.$44�

�'!�.,*+$!,*#,�(��/((�,2

5.3 Results

125

5.3.3 Transcriptome sequencing

All libraries were sequenced at the NIMR by the High Throughout Sequencing staff,

managed by Abdul Sesay. Single-end read sequencing was performed on Illumina

Genome Analyser (GA) and HiSeq (HS) sequencers, using a single flow cell lane per

library. The mean number of raw reads generated per run was 93.2 million (ranging

from 30.1-186.8 million). Full details of the transcriptome data are shown in Tables 5.5

to 5.6.

5.3.3.1 RNA-seq data quality control

It was first necessary discard low quality reads from the transcriptomes to increase the

quality of the subsequent reference based mapping. Poor quality read bases were

trimmed using the SolexaQA package (Cox et al. 2010), trimming bases with

confidences p > 0.05, and removing reads < 25 bases in length. For the nine RNA-seq

transcriptomes used in the following differential expression analysis (Table 5.5), a mean

of 14.4 million (ranging from 1.6 to 52.5 million) reads were discarded by this step.

Therefore, on average 15% of the raw reads were removed due to poor quality. It can be

seen in Figure 5.4A that the mean RNA-seq base quality for strain N0145 decreases

throughout the read length, and after 85 cycles (read lengths of 85bp) mean phred scores

<10, correlating to 90% base call accuracy. Post trimming removed the poor quality 3’

tails of these reads, as well as reads that were low quality throughout (Figure 5.4B).

A. B.

Figure 5.4. Distribution of quality scores throughout RNA-seq read length for

strain N0145. The x-axis is the position with in the read length (bp), y-axis is read

quality based on Phred scores. A. Raw reads pre-trimming. B. Post-trimming read

quality.

N0145_HS

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

0

10

20

30

40

Position in read (bp)

Rea

d qu

ality

(phr

ed s

core

)

N0145_HS

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

0

10

20

30

40

Position in read (bp)

Rea

d qu

ality

(phr

ed s

core

)

5.3 Results

126

Table 5.5. Details of exponential phase transcriptomes used in differential

expression analysis. The nine transcriptomes were constructed using the same Illumina

RNA-seq method (section 2.5.1). Sample ID was used to track the sample through the

sequencing pipeline. Two Illumina machines were used to sequence the technical

replicates; HiSeq2000 (HS) and the Genome Analyser GAIIx (GA). Fold coverage was

calculated as the amount of sequence data mapped, excluding rRNAs (16S, 23S and 5S

subunits), divided by the genome size of H37Rv, which was used for the reference-

based assembly of the transcriptomes (4411532 bp). A. Six strains sequenced on the HS

platform. B. Technical replicates for strains N0145, N0031 and N0153.

A.

Sample 1 2 3 4 5 6

Sample ID GR1 GR2 GR16 GR5 GR6 GR15 Machine HS HS HS HS HS HS Strain N0145 N0031 N0052 N0072 N0157 N0153 Lineage L2 L2 L2 L1 L1 L1 Growth phase EXP EXP EXP EXP EXP EXP Read Length SE 101bp SE 101bp SE 60bp SE 101bp SE 101bp SE 60bp Mapped reads (Million) 47.476144 41.065818 152.323131 44.946409 39.208057 148.026386

Unmapped reads (Million) 30.552533 21.408397 20.090865 13.226052 9.695711 20.899070

rRNA reads 44.530529 38.050982 142.691893 41.303614 36.843853 138.358711 Non- rRNA (Mb) 218.2 219.5 327.5 268.8 169.3 473.7

Fold coverage 49.5 49.8 74.2 60.9 38.4 107.4

B.

Sample 7 (t) 8 (t) 9 (t)

Sample ID GR1 GR2 GR13 Machine GA GA HS Strain N0145 N0031 N0153 Lineage L2 L2 L1 Growth phase EXP EXP EXP Read Length SE 75bp SE 75bp SE 101bp Mapped reads (Million) 22.670235 21.866688 37.070182

Unmapped reads (Million)

8.067401 6.538578 26.419337

rRNA reads 21.800412 20.914537 34.574916 Non- rRNA (Mb) 58.0 63.5 69.9

Fold coverage 13.2 14.4 15.8

5.3 Results

127

Table 5.6. Transcriptomes used in TSS mapping. Exponential and stationary phase

transcriptomes of strains N0153 and N0145 were generated; these were chosen as

representative Lineage 1 and 2 strains. cDNA libraries ready for sequencing were

constructed by Vertis Biotechnologie AG and sequenced at NIMR.

Sample 13 14 15 16

ID GR1 GR2 145_s4 153_s4 Machine HS HS GA GA

Strain N0145 N0153 N0145 N0153

Lineage L2 L1 L2 L1 Growth phase EXP EXP STAT STAT Read Length SE 50bp SE 76bp SE 76bp SE 76bp Mapped reads (Million) 7.966042 24.649725 18.745600 21.574243

Unmapped reads (Million) 0.726967 1.839590 0.834070 1.863064

rRNA reads 2.200098 3.144389 11.548229 10.708733

Non- rRNA (Mb) 288.3 1290.3 431.8 651.9

Fold coverage 65.4 292.0 97.9 147.8

5.3 Results

128

5.3.4 Mapping reads to the H37Rv genome

A reference based mapping assembly was performed using the reference genome H37Rv

and the BWA aligner (Figure 5.5) (section 2.9.3) (Li & Durbin, 2009). The dominance

of the rRNA (16S, 23S and 5S) reads is seen in the RNA-seq plots at genomic position

~1.5Mb. The average mapped read number was 61.0 million (ranging 21.9-152.3) for

nine exponential phase transcriptomes and reads were visualised using Artemis software

(Figure 5.5). Importantly, no sequence data mapped to region of difference 3 (RD3) in

Lineage 1 strain N0153, shown in Figure 5.6. RD3 contains the mobile prophage ϕRv1

(Hendrix et al., 1999), and is variably deleted in clinical strains of M. tuberculosis

(Parsons et al., 2002), and including strain N0153.

Figure 5.5. Circular plot of mapped RNA-seq data. Moving from outer to inner

circles are the annotated CDS on the forward (blue) and reverse (red) strands. The inner

circles are the mapped reads for strain N0153 and N0145. Reads map to forward (blue)

and reverse strands (red). Read coverage per base position sampled by 5000 bp windows

and are log2 scaled.

5.3 Results

129

Figure 5.6. Representation of transcriptome plot based on Artemis. y-axis shows

sequence depth, and all plots scaled to a depth of 30 bases (scale bar on bottom plot).

Reads on forward strand in blue, and reverse strand in red. Part of RD3 is shown for

Lineage 1 strains N0157, N0072 and N0153, which is variably deleted in the MTBC.

5.3.5 Identifying strain specific gene deletions

It is known from previous studies that the number of genes within clinical strains is

variable due to gene deletions at the lineage and strain specific level (Tsolaki et al.,

2004). One region that includes sixteen gene deletions relative to the reference strain

H37Rv was shown previously in Figure 5.6. A microarray based study has identified

224 gene deletions (5.6% of all annotated genes) in a survey of one hundred clinical

isolates (Tsolaki et al., 2004). This has important ramifications for the following

differential expression analysis, and these deletions need to be removed to prevent the

identification of changes in expression due to deletions. A Perl script was written to

identify gene deletions and based on the genome coverage depths for the respective

strains (Appendix A), allowing the removal of only genes deleted in the six strains. This

also presented an opportunity to investigate the nature of the gene deletions within the

strains. In total genome-wide scanning in the genomes identified 223 genes that were

deleted in one or more strain. This was based on a cut-off threshold of <90% base

coverage per annotated gene (Figure 5.7).

N01

57

N00

72

N01

53

Line

age

1

Rv1576c Rv1575

!"0

30

Rv1574 Rv1573

Rv1572c Rv1571

5.3 Results

130

Figure 5.7. Distribution of gene deletions in the six RNA-seq study strains. In total

223 genes were classed as deletions. Strains are hierarchical clustered based on

deletions, and genes (rows) clustered and grouped based on existing gene functional

categories. Deletions found within genes annotated as PE/PPE functional category were

excluded from the analysis.

!"#$%&'()

!""%(&'()

!""*#&'()

!"#%+&'#)

!"#%*&'#)

!""+(&&'#)

,-./0123,4&5,32-678&&&&&&

9378-,:-;&<4532<-269108

672-,=-;61,4&=-21>3068=&17;&,-856,12637

67?3,=12637&512<@148

9-00&@100&17;&9-00&@100&5,39-88-8

678-,2637&8-A8&17;&5<1.-8

0656;&=-21>3068=

/7B73@7

:6,/0-79-C&;-23D6?6912637C1;15212637

0 10 100

Percentage of gene deleted

!Rv1

!Rv2

5.3 Results

131

Hierarchical clustering based on the identified gene deletions clustered the six strains by

lineage and sub-lineage, following the known genome phylogeny. This is expected due

to the clonality of the MTBC. Grouping the gene deletions first by functional category

and then genomic position it was seen that large blocks of deletions within the insertion

sequences and phages section are largely made up of deletions of the prophages. There

are two prophages in the H37Rv genome, designated ϕRv1 (Rv1572c-Rv1588c) and

ϕRv2 (Rv2646-Rv2659c) (Cole et al., 1998; Hendrix et al., 1999). The first prophage

ϕRv1 is not present in all Lineage 2 strains and N0153 from Lineage 1. This is an

example of convergent evolution, whereby ϕRv is not present across more than one

lineage, although it is not known if the phage was deleted in these strains or was never

inserted originally. Strain N0153 and N0072 also do not have the second prophage

(ϕRv2). Whilst deletions were distributed throughout all gene categories, there was a

disproportionate representation of several categories based on the genome-wide number

of genes within the category (Figure 5.8). Using a χ² test followed by multiple testing

correction a significant overrepresentation of gene deletions within the insertion

sequence and phages category was identified (p=0.0009). Expressed as a ratio of the

genes within the category versus the genome-wide number, insertion sequences and

phages were 11-fold over-represented. Under-represented groups were intermediary

metabolism and respiration (p=0.004) and genes involved in information pathways

(p=0.04), which were 2-fold and 5-fold under-represented respectively.

As expected, significantly more genes classed as nonessential for growth based on

previous classifications were present in the deleted gene set (Sassetti et al., 2003;

Sassetti & Rubin, 2003) (χ², <0.0001). However, ten deleted genes (6.4% of all

deletions) were defined as being essential. Four of the genes are annotated as conserved

hypotheticals, and the remaining are genes involved in cell wall and cellular processes

(Rv0383c, Rv1974), lipid metabolism (fadD30, desA3), intermediary metabolism and

respiration (Rv1524), and information pathways (infB). Four of these deletions have

been identified in the previously described microarray based study of one hundred

clinical MTBC isolates (Tsolaki et al., 2004). The infB deletion has not been previously

identified, and was surprising as the encoded InfB protein is an essential initiation factor

of the protein synthesis machinery (Boelens & Gualerzi, 2002). However, the deletion

was just within the threshold of defining a gene deletion used in this study, with 11.1%

of the gene deleted in strain N0153, and <6% in strain N0145.

5.3 Results

132

Figure 5.8. Distribution of gene deletions grouped by gene function category. For

each category the number of deleted genes (black) and non-deleted genes (white) is

shown. Actual deleted gene numbers are shown on top of bars. Gene categories with a

statistically significant departure from the expected number of deletions are identified by

asterisk (*). Categories were tested using a χ² test followed by multiple testing

correction (False discovery rate method). * p <0.05, ** p<0.01, *** p<0.001. All

functional categories except the PE/PPE category were included in the analysis.

0 400 800 1200

virulence detoxification and adaptation

lipid metabolism

information pathways

cell wall and cell processes

intermediary metabolism and respiration

unknown

regulatory proteins

conserved hypotheticals

insertion seqs and phages

DeletedNon-deleted

76

2 *

24

18 **

2

8

28

62 ***

Number of genes

5.3 Results

133

5.3.6 Clustering of strains at the total sample level

The transcriptomic data was first clustered at the total sample level rather than at the

level of individual genes. This provides a meaningful analysis of overall expression

patterns in the samples, allowing the stratification of strains based only on expression.

This provided the first broad analysis of how closely related strains belonging to the

same lineage were in terms of transcription, and how the genetic diversity between

Lineage 1 and 2 is reflected in functional expression. Clustering was based first on gene

expression, or the messenger RNA (mRNA), and then antisense expression, which is

transcription that is complementary to the mRNA, and so from the noncoding strand of

DNA.

5.3.6.1 Clustering of strains by gene expression

Gene expression from all annotated genes excluded those identified as deleted in section

5.3.5. To enable the comparison of different expression levels, data was normalised as

reads per kilobase per million reads (RPKM). The RPKM measure was designed to

reflect the molar concentration of a transcript in the starting sample by normalising for

RNA length and the total number of reads the transcriptome data set (Mortazavi et al.,

2008). Pairwise spearman correlations of RPKM normalised gene expression were

calculated for all samples, converted into dissimilarities, and the distances between the

samples clustered using hierarchical clustering. The resulting dendogram is shown in

Figure 5.9. Branches show bootstrap confidence following 1000 bootstrap replicates,

indicating high statistical support. Strains from the same lineage were more closely

related than those from a different lineage. The three transcriptomes sequenced as

technical replicates, strains N0153 (HS 2), N0145 (GA) and N0031 (GA), were highly

related to their respective replicate transcriptome. Technical replicates were from the

same source of total RNA for each respective strain, but were from separate cDNA

library construction. In the case of Lineage 2 strains N0145 and N0031, cDNA was

sequenced as part of an earlier run on an Illumina Genome Analyser (indicated by the

GA suffix), as opposed to later Illumina HiSeq runs (indicated by the HS suffix) for all

other strains.

In addition to the sequenced clinical strains, three transcriptomes of the reference strain

H37Rv were included in clustering analysis. The transcriptomes were previously

5.3 Results

134

published by Arnvig et al. (2011). Strains were grown in the same growth media as this

study and RNA extracted from exponential phase growth. The method to generate

cDNA libraries was not the same as this study, but cDNA was sequenced using the

Illumina platform. The three transcriptomes were from three biological replicates, which

have similar gene expression, shown in Figure 5.9. Interestingly, whilst the three H37Rv

strains tightly cluster together, they are clearly distinct from all Lineage 1 and 2 strains,

suggesting the laboratory strain is an outlier relative to the clinical strains with respect to

its transcriptional profile.

Comparing back to the underlying genotype, transcriptome diversity parallels genome

diversity for the clinical strains (Figure 5.10), whilst the H37Rv transcriptomes do not fit

within the expected topology. Based on the genome phylogeny, H37Rv would be

expected to cluster alongside Lineage 2 strains and form part of the modern lineages;

instead it is clearly outside of Lineage 2 as well as Lineage 1. Whilst the clinical strains

did cluster by lineage, the parallel to genotype broke down at the sub-lineage level, for

example, Lineage 2 Beijing strain N0145 clustered with N0031, despite being

genetically closer to N0052, the other Beijing strain in the study.

5.3 Results

135

Figure 5.9. Unsupervised hierarchical clustering of total gene expression. Reads

normalised as RPKMs for all annotated genes not previously identified as gene

deletions. Strain replicates also shown, strain N0153 (HS 2), N0145 (GA) and N0031

(GA). Node support after 1000 bootstrap replications on branch. Exponential phase

H37Rv transcriptomes shown as reference. Top scale bar indicates spearman correlation.

Branches coloured using previous classification.

!"#$%&'("(#)*+

!"#$%&'("(,)-+

!"#$%&'("(,)"+

./01"&'!2&-+

./01"&'!2&0+

.//#-&'!2+

./01#&'!2+

.//1-&'!2+

./0(1&'!2+

./0(1&'34+

.//"0&'!2+

.//"0&'34+

/5/ /50 /5- /5" /5(

!6789:

0//

0//

0//

0//

,,

0//

0//

0//

,;

0//

5.3 Results

136

Figure 5.10. Relationship of genotypic to transcriptomic diversity. Left hand side

image shows the 28-genome MTBC phylogeny constructed in Chapter 3. Right hand

side image shows the unsupervised hierarchical clustering of gene expression. The

transcriptome diversity parallels genome diversity for the clinical strains.

!"# !"$ !"% !"& '"!

5.3 Results

137

5.3.6.2 Relationship of SNP distance to gene expression

Gene expression was next compared to SNP distance to explore the effect of genetic

diversity at the total sample level. Normalised reads were variance-stabilising

transformed (VST) so that they were approximately homoscedastic and suitable as input

to the distance calculation (Anders & Huber, 2010). All genes therefore have roughly

equal influence on the distance that is independent of expression strength, thus

preventing a few strongly expressed genes having a greater influence. All strains used in

this transcriptome study have been previously genome sequenced, and the total number

of SNPs that separate these strains was calculated previously (Chapter 3). Figure 5.11

shows the correlation of gene expression to SNP distance. Lineage 2 strain N0145 was

used as the reference. A clear positive correlation between the gene expression and SNP

distance was observed (Spearman r=0.93, p=0.02). The correlation did not include the

reference strain N0145 in the calculation. Using all other strains as the reference

identified the same significant correlation in the other five comparisons (Spearman

r=0.93 to 0.99, p<0.05).

Figure 5.11. Correlation of SNP distance to gene expression. The x-axis is the

number of SNPs of all strains relative to N0145. The y-axis is the Euclidean distance

between each transcriptome relative to strain N0145; distances were calculated from the

variance-stabilising transformation of the count data. The line shows the Linear

Regression slope (Slope = 0.024 ± 0.005). Spearman r = 0.93. Strains coloured by

lineage.

N0145&VSD&expression&distance&vs&snp&distance

0 500 1000 1500 2000 25000

20

40

60

80

N0052

N0031

N0157N0072

N0153

SNP&distance

Genome&wide&expression

distance&(VST)

5.3 Results

138

5.3.7 Clustering of strains by antisense expression

A similar sample level analysis was also performed for the antisense transcriptomes. The

hierarchical clustering of total antisense per transcriptome sample is shown in Figure

5.12. As seen previously for gene expression, strains from the same lineage clustered

closer together based on transcriptional expression than those from the other lineage.

However, within-lineage comparisons again did not follow the finer sub-lineage

structure of the genome phylogeny. Lineage 1 strain N0153 and N0072 did not cluster

together, despite being genetically closer than the third Lineage 1 strain N0157 based on

SNP distance.

Interestingly, the two technical replicates for Lineage 2, N0145 (GA) and N0031 (GA),

clustered based on gene expression (Figure 5.9), but not by antisense expression (Figure

5.12). The replicates can be linked by both being sequenced on the Illumina GA

machine (GA) which might be affecting the level of antisense detected in these

transcriptomes. The two replicates had the lowest number of mapped reads (ranging

21.9-22.6 million reads) compared to the mean 78.8 million reads for the six non-

technical replicates. This may suggest that less abundant and rare antisense transcripts

were not detected in these technical replicates due to the low level of sequencing depth,

and this similarity was identified in this analysis.

5.3 Results

139

Figure 5.12. Unsupervised hierarchical clustering of total antisense expression.

Reads normalised as RPKMs for all annotated genes not previously identified as gene

deletions. Strain replicates also shown, strain N0153 (HS 2), N0145 (GA) and N0031

(GA). Node support after 1000 bootstrap replications on branch. Top scale bar indicates

spearman correlation. Branches coloured using previous classification.

!"#$%&'()&*+

!"#$%&'()&#+

!"",*&'()+

!"#$,&'()+

!""$*&'()+

!"#-$&'()+

!""%#&'()+

!"#-$&'./+

!""%#&'./+

"0"$ "0#" "0#$ "0*" "0*$(12345

$6

#""

#""

77

#""

78

77

5.3 Results

140

5.3.8 Testing for differential expression in RNA-seq data

Measurement of transcription by sequencing is a recent development and currently there

is no clear consensus on a standard method to test for differential expression from

generated RNA-seq data (Dillies et al., 2012). Normalisation is necessary to ensure that

expression levels are comparable across samples (different cDNA libraries) and also

across annotated features to enable valid inferences about the differential expression of

features within or across samples (Robinson & Oshlack, 2010). Importantly,

normalisation must ensure that read counts arising from a transcript are proportional to

the length of the transcript and the total depth of the sample. In section 5.3.6 the RPKM

method was used to normalise the data at the total sample level, but Robinson and

Oshlack argue that RPKM may not be appropriate for normalisation between libraries of

different biological conditions (Robinson & Oshlack, 2010). Central to their argument is

that the total number of reads in a sequencing experiment is limited, and this sequencing

real estate is competed for by highly expressed genes, leaving less available for the

remaining genes. Thus if one sample contains highly expressed genes that are not

expressed in other samples, this sampling artifact can skew the differential analysis,

giving rise to higher false positive rates and less power to detect true differences.

Scaling the libraries with RPKM will not solve the problem due to the assumption that

the unknown total RNA is the same for all libraries.

There are now a number of methods which make a better assumption that the RNA

output of a core set of genes is similar between samples, and this is used to create a

scaling factor for the samples; several R bioconductor packages implement this, such as

DESeq (Anders & Huber, 2010), baySeq (Hardcastle & Kelly, 2010) and edgeR

(Robinson et al., 2010b). All three methods were tested on the transcriptome set. Raw

count data for each annotated feature in the technical transcriptome replicates were

combined, generating six RNA-seq samples to use in the analysis. To test for differential

expression at the lineage level, strains from the same lineage were treated as biological

replicates tested in the three above methods. A statistical cut-off of p <0.05 (False

Discovery rate corrected) was used to identify statistically significant expression. Figure

5.13 shows the number of genes identified as differentially expressed. There was

considerable overlap in the genes identified as displaying lineage-specific gene

expression, although the number of statistically significant genes ranged from 76 genes

using baySeq, to 336 genes identified with edgeR; edgeR identified all of the same

genes as DESeq (112 genes). For this study all differential expression analysis was

5.3 Results

141

performed using the latter method, DESeq, which was chosen as the best compromise of

sensitivity and the size of number of differentially expressed genes identified.

Figure 5.13. Venn diagram comparing edgeR, DESeq and baySeq differential

expression methods. The number of genes identified as differentially expressed using

three methods for identifying significantly different gene expression between Lineage 1

and 2 strains. Significance defined as p <0.05 following multiple testing correction.

5.3.9 Lineage-specific gene expression

A total of 112 genes were identified as having a lineage-specific pattern of differential

gene expression (based on a statistical cut-off of p<0.05); 88 (78.6%) were higher in

Lineage 1, and 24 (21.4%) more highly expressed in Lineage 2 strains (Figure 5.14).

The complete list is shown in Appendix E. Differentially expressed genes were present

in all Tuberculist functional categories. Twenty-six of the genes were identified as

differentially expressed in previous microarray comparisons of ancient versus modern

lineages or M. tuberculosis H37Rv versus M. bovis (a subgroup within Lineage 6)

(Golby et al., 2007; Homolka et al., 2010). The greatest significant fold change in gene

expression was galK (Rv0620), which is involved in galactose metabolism, and was 39-

fold higher in Lineage 1 strains. Antisense expression of Rv0842, a conserved integral

membrane protein of unknown function, was 197-fold higher in Lineage 2 strains.

Differential expression of galK was not detected in the previous microarray analysis,

whilst it was not possible to measure antisense expression in the experiment design

(Homolka et al., 2010).

5.3 Results

142

Figure 5.14. Heatmap of 112 differentially expressed genes. Expression is based on

normalised reads using the DESeq scaling factor method. Colouring is relative, based on

the minimum and maximum expression for each gene (row), moving from lower

expression (blue) to higher expression (red); scale at bottom of heat map. Strains are

hierarchically clustered using Spearman’s rank correlation. Genes (rows) are grouped by

Tuberculist functional categories.

!"#$%

!""&#

!""%'

!""('

!"#%(

!"#%&

conserved hypotheticals

intermediary metabolism and respiration

regulatory proteins

lipid metabolism

virulence, detoxification and adaptation

information pathways PE/PPE

cell wall and cell processes

unknown

row min row max

5.3 Results

143

5.3.9.1 Transcriptional regulators

Eight transcriptional regulators were identified in Chapter 4 to harbour lineage-specific

SNPs or indel mutations predicted to impair the function of the encoded regulatory

protein. Three of the regulatory proteins with predicted functional SNPs, Rv0275c, virS

(Rv3082c) and Rv3167c were identified in the set of lineage-specific differentially

expressed genes. These are shown in Table 5.7.

VirS has previously been shown to act as an inhibitor of its own transcription and as a

positive regulator of the adjacent divergently-expressed MymA locus (Rv3083-3085,

Rv3086-3089) (Singh et al., 2003). Consistent with its predicted functional impairment

by substitution of arginine for leucine within the helix-turn-helix (HTH) DNA-binding

domain, virS expression was 17-fold higher in Lineage 1 than Lineage 2, but with no

effect on expression of MymA. Targets of transcriptional regulators Rv0275c and

Rv3167c are unknown, but the proximity of transcriptional start sites (TSS), identified

by additional 5’ enriched RNA-seq transcriptomes (section 2.5.2), suggested that

binding of the regulators to upstream sequences would repress transcription of the

adjacent divergent genes Rv0276 and Rv3168 (Figure 5.15). Expression of Rv0276

followed Rv0275c in being 10-fold higher in Lineage 2, although it fell outside of the

statistical cut-off (p=0.08), whilst Rv3178 expression was 5-fold higher in Lineage 1

(p=0.12).

Table 5.7. Differential expression associated with lineage-specific amino acid

mutations. SNPs: Three out of the eight transcriptional regulators with predicted

functional lineage-specific SNPs were differentially expressed at the lineage level. Fold

change relative to Lineage 1. Modern lineage represents SNPs in Lineage 2, 3, and 4.

Frameshift indels: a single base insertion in the regulator Rv3830c was predicted to

impair function. Expression of the adjacent genes, Rv3829c and Rv3831, was higher in

the predicted intact regulator.

Gene Function Fold change Mutation SNP

lineage Predicted functional SNPs Rv0275c transcriptional regulator 0.4 L24S Modern Rv3082c virS transcriptional regulator 12.0 L316R 1 Rv3167c transcriptional regulator 3.7 P17Q 1 Frameshift indels Rv3829c phytoene dehydrogenase 0.05 inactive Rv3830c 2 Rv3831 hypothetical 0.1 inactive Rv3830c 2

5.3 Results

144

Figure 5.15. TSS mapping for differential expression of divergently regulated

genes. Reads mapping to forward strand in blue, and reads corresponding to reverse

strand in red. Scale bar indicating maximum read depth at right of trace. For two of the

strains (Lineage 1 N0153 and Lineage 2 N0145), additional RNA-seq was performed

after a 5’ phosphate-dependent exonuclease digestion step to facilitate mapping of

transcriptional start sites (TSS). A. Differential expression of Rv0276 due to predicted

impaired Rv0275c regulator. Overlapping TSS suggest that Rv0275c acts as a repressor

of Rv0276. B. Differential expression of Rv3168 due to the predicted impaired

Rv3167c regulator. Again TSS show some overlap.

In addition to mutations introduced by SNPs, a frameshift insertion mutation in Lineage

2 was predicted to inactivate Rv3830c due to the resulting fusion with Rv3829c (Table

5.7). Although no significant change was observed in expression of Rv3830c itself, a 14-

fold and 21-fold increase in expression of Rv3829c and Rv3831 in Lineage 2 suggested

that the functional protein may act as a repressor of the two flanking genes, and that this

regulation is lost in the case of mutant allele.

It was not possible to identify a lineage-specific transcriptional signature for the

remaining four regulators, NarL, BlaI, SirR and KdpD, which were also predicted to be

functionally impaired (Chapter 4). This may be due to incorrect predictions, or

alternatively culture conditions other than routine exponential growth may be required to

uncover defects in associated regulatory responses.

N01

53

N01

45

0

6000

Rv3168 Rv3167c

0

500 N

0153

N

0145

!"

!" 0

1200

Rv0276 Rv0275c

0

1200 A B

5.3 Results

145

5.3.9.2 Expression of the DosR regulon in strains from the Beijing family

It has previously been reported that genes belonging to the DosR regulon are expressed

during exponential growth in strains belonging to the Beijing family (Homolka et al.,

2010; Reed et al., 2007), but this elevation was not found at the lineage level in this

analysis. Only Rv1733c met statistical criteria for up-regulation in Lineage 2 (p=0.023),

but an enhanced DosR response was clearly seen in strains N0145 and N0052 (Figure

5.16). The outlying strain, N0031, belongs to a basal branch of Lineage 2 that diversified

prior to expansion of the major Beijing branches represented by N0145 and N0052.

A 350kb genomic duplication that includes the DosR operon has been identified in the

Beijing strains and has been suggested to contribute to constitutive expression of the

DosR regulon (Domenech et al., 2010; Weiner et al., 2012). This duplication is present

in N0145 and N0031, but absent from N0052, and therefore cannot account for the

observed differential pattern of DosR expression in our study (Figure 5.17).

5.3 Results

146

Figure 5.16. Heat map of dosR regulon. Normalised expression represented as fold

change relative to the mean. Note * Rv0571c, Rv0572c deleted in strain N0157. Hclust

of strains based on dosR regulon separates Beijing strains N0145 and N0052. N=48 plus

small RNA MTS1338. Black indicates no expression. Scale at bottom.

0.1 1 10

5.3 Results

147

Figure 5.17. Duplication of dosR region. Genome read depth across the genome for

the six in RNA-seq strains. Genome depth maximum height cut-off at 255 bases. Arrows

show the position of the dosR operon within the duplicated region in Lineage 2 strains

N0145 (a Beijing strain) and N0031 (non-Beijing strain). The previously published

genomic duplication that includes the dosR operon extends from 3.5 to 3.8Mb

(Domenech et al., 2010), although in strain N0031 the duplication is shorter in length.

Amino acid changing mutations that might alter the function of DosR or related

regulatory components were not found. However, a Beijing-specific synonymous SNP

(C 3500149 T) was identified within Rv3134c, which encodes a Universal Stress protein

(USP) domain, and is itself a member of the DosR regulon (Gerasimova et al., 2011).

Rv3134c is immediately upstream of dosR and the SNP generates a TAnnnT -10

consensus motif that is characteristic of actinomycetes promoters (where n represents

any base) (Figure 5.18A). The classical prokaryotic promoter structure that is recognized

by σ70 sigma factors has been defined based on studies of numerous E. coli promoters

(Hawley & McClure, 1983), and similar sequences have been identified in other bacteria

(Newton-Foot & Gey van Pittius, 2012). The promoter sequence determines the level of

expression of a gene and is recognised as the DNA sequence between 10 and 35 bases

upstream of the TSS. The TSS is usually a purine base (A or G base), and the -10

sequence, also known as the Pribnow box, is a highly conserved hexamer centered about

0

255

N01

57

N00

72

N01

53

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

0 1 2 3 4

Genome position (Mb)

dosR

5.3 Results

148

-10 bp upstream of the TSP. As described above, in actinomycetes this sequence

consists of highly conserved T and A residues at the first and second positions

respectively, and in all cases a T residue in the final position; some variability is found

in the three central positions of the motif (Newton-Foot & Gey van Pittius, 2012). The -

10 motif has been found to be associated with ~73% of all TSS mapped in M.

tuberculosis H37Rv (T. Cortes, unpublished; (Newton-Foot & Gey van Pittius, 2012;

Zheng et al., 2011).

In Figure 5.18A it can be seen that the Beijing-specific SNP is located seven nucleotides

upstream of a novel TSS, and the TSS is expressed in both exponential and stationary

phase samples of Beijing strain N0145 (Figure 5.18B). The new TSS is distinct from the

standard Rv3134c intergenic TSS associated with growth-phase induction of the DosR

regulon and from secondary promoters identified within the Rv3134c gene of M.

tuberculosis H37Rv (Bagchi et al., 2005). The resulting transcript is clearly seen in the

total RNA profiles and runs through dosR in the two Beijing strains (Figure 5.18A).

A second Beijing-specific SNP (C 3509626 A) similarly generates a TAnnnT consensus

motif and associated TSS for the two-component sensor protein encoded by Rv3143.

Increased expression was evident in total transcriptome profiles from the two Beijing

strains, but in this case downstream targets of the regulator are unknown.

5.3 Results

149

A.

B.

Figure 5.18. DosR regulon and SNP-associated TSS. A. Mapped RNA-seq reads in

Lineage 2 strains over the DosR region. Reads mapping to forward strand in blue, and

reads corresponding to reverse strand in red; plots are shown at an identical scale with scale

bar indicating maximum read depth included in the bottom panel. The C/T SNP in Beijing

strains is indicated with asterisk (*), and the new TSS 7 nucleotides from created -10

box highlighted. Numbering is based on the M. tuberculosis H37Rv genome. B. RNA-

seq TSS mapping for Beijing strain N0145 (Lineage 2) and N0153 (Lineage 1) grown in

exponential and stationary phase conditions; TSS shown with arrows. The Beijing-

specific TSS within Rv3134c is expressed in exponential and stationary phase.

N00

52

N01

45

Line

age

2

N00

31

Rv3134c 0

1500

!"

dosR dosS

!"

0.1 1 10

3'-GccgagtTcacatgtacgcggttggccgacgggacaagctgc-5'

A

3500142 -10 box

Figure 2

!"

B

C

C 3500149 T

Rv3134c

!" !"

N01

53

N01

45

Sta

tiona

ry

0

16000

0

3000

N01

53

N01

45

Exp

onen

tial

Rv3134c

Rv3134c

!" !"

N01

53

N01

45

Sta

tiona

ry

0

16000

0

3000

N01

53

N01

45

Exp

onen

tial

Rv3134c

5.3 Results

150

5.3.9.3 SNP-associated TSS

The influence of SNP-associated TSS in generating transcriptional diversity was next

explored in the rest of the gene set with lineage-specific transcriptional profiles.

Alignment of lineage-specific SNPs with a total transcriptome map of M. tuberculosis

(strain H37Rv, T. Cortes unpublished) identified ninety-four instances (1.2% of 7601

TSS) in which a SNP fell within the 30-nucleotide region upstream of a TSS. The

frequency was markedly higher amongst the 168 differentially expressed genes and

antisense identified in this study, with 23 of the respective TSS harbouring one or more

SNPs in this upstream region (χ2, p<0.0001). In ten cases, lineage-specific SNPs

generated a new TAnnnT consensus motif linked to a new TSS (Table 5.8).

Table 5.8. Ten differentially expressed genes associated with change in promoter

sequence. Fold change relative to Lineage 1, with >1 higher Lineage 1 expression, <1

higher Lineage 2 expression. Modern lineage includes Lineages 2, 3 and 4. The final

mutation column shows the nucleotide change and genomic position as based on H37Rv

coordinates, and in brackets the -10 motif created (SNP in bold upper). Sequences read

in the 5’ to 3’direction.

Differentially expressed gene Function Fold

change Lineage

with SNP Mutation

Rv0469 umaA mycolic acid modification 2.2 1 C 560664 T

(tacaaT)

Rv0557 mgtA mannosyltransferase 3.2 1 C 649345 T

(tatgcT)

Rv0724A - methyltransferase 3.6 1 C 817696 T (tattcT)

Rv1781c malQ glucanotransferase 2.2 1 T 2017560 A (tAcggt)

Rv2051c ppm1 ppm synthase 2.2 1 C 2309356 T (taccaT) & T467I

Rv2765 - dienelactone hydrolase 7.4 1 C 3074830 T

(tactaT)

Rv3366 spoU tRNA methyltransferase 0.2

2 G 3778011 A (taccAg)

2 G 3778012 T (taccaT)

Rv3679 & Rv3680 - anion transport

ATPase 0.1/0.2 Modern C 4119246 T (tatgaT)

Rv3812 PE_PGRS62 hypothetical 2.5 1 C 4276306 T (Taatgt)

5.3 Results

151

For three of the differentially-expressed genes (MalQ, Rv3680, PE_PGRS62) the new

TSS was located within 542 nucleotides of the predicted translational start, either within

an intergenic region or the upstream gene. In Figure 5.19A a SNP within Lineage 1

strains creates a -10 sequence and a resulting novel Lineage 1 TSS is seen. In Lineage 1

malQ is 2-fold higher expressed. The remaining six new TSS (umaA, MgtA, Rv0724A,

Ppm1, Rv2765, SpoU) were located within the differentially-expressed gene itself and,

if translated, would give rise to truncated protein products. Two SNPs in spoU remove

Guanine nucleotides to generate the TSS motif. Rv2051c encodes a bifunctional protein

Ppm1. Shown in Figure 5.19B, the novel internal transcript is initiated in the middle of

the gene and includes the C-terminal polyprenyl phosphomannose synthase domain. The

SNP also introduces a T467I mutation at the amino acid level, which was predicted by

previous SIFT analysis in Chapter 4 to impair the function of the N-terminal

apolipoprotein N-acyltransferase domain. A second internal TSS was present in all

strains (Lineage 1 and 2) at position 2309159, suggesting that the option of dissociating

the two activities is not unique to Lineage 1.

SNPs that alter residues outside of the -10 motif may also influence promoter activity. A

G 4092921 T was associated with a 100-fold increase in reads mapping to a TSS

upstream of PE_PGRS60, for example. This mutation changes an existing -10 TAnnnT

motif to an “extended -10” TGnTAnnnT consensus (Newton-Foot & Gey van Pittius,

2012). Interestingly, this change is similar to that generated by a SNP that drives

increased promoter activity and inhA expression in isoniazid-resistant strains of M.

tuberculosis (Ramaswamy & Musser, 1998; Ramaswamy et al., 2003).

5.3 Results

152

A.

B.

Figure 5.19. SNP-associated TSS leading to differential gene expression. A. Lineage

1 SNP (T 2017560 A) is associated with a new TSS and 2.2-fold increased expression of

malQ in the respective strains. B. Internal coding TSS: A Lineage 1 SNP (C 2309356 T)

within ppm1 is associated with a new TSS and 2.2-fold up regulation of ppm1

transcription. The nonsynonymous SNP is predicted to impair lipoprotein N-

acyltransferase activity. A second internal TSS present in all strains is also indicated in

the TSS mapping inset.

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

malQ eccB5

0

500

!"

!"

!"

eccB5

N01

53

N01

45

malQ eccB5

!"

B A Figure 3

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

ppm1 Rv2050

0

250 !"

!"

!"

eccB5

N01

53

N01

45

ppm1 !"

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

umaA

0

2000

!"

fadB2

pcaA

!"

!"

eccB5

N01

53

N01

45

!"

umaA pcaA

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

deaD

0

300

!"

!"

!"eccB5

N01

53

N01

45

deaD

!"

D C

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

malQ eccB5

0

500

!"

!"

!"

eccB5

N01

53

N01

45

malQ eccB5

!"

B A Figure 3

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

ppm1 Rv2050

0

250 !"

!"

!"

eccB5

N01

53

N01

45

ppm1 !"

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

umaA

0

2000

!"

fadB2

pcaA

!"

!"

eccB5

N01

53

N01

45

!"

umaA pcaA

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

deaD

0

300

!"

!"

!"eccB5

N01

53

N01

45

deaD

!"

D C

5.3 Results

153

5.3.9.4 Differential antisense expression

A parallel analysis of antisense transcription identified similar conservation by lineage,

with a differential expression pattern for 56 genes; 23 were higher in Lineage 1, and 33

in Lineage 2 (Appendix E). Antisense RNAs are transcripts encoded on the strand that is

complementary to protein-coding genes. The transcripts are generated either from

internal TSS, or from overlapping 3’ untranslated regions (UTRs) in convergent gene

pairs, which has been identified previously in the transcriptome of H37Rv (Arnvig et al.,

2011). Three of the differentially expressed 3’ UTR antisense transcripts (pcaA, Rv1898

and ribD) were associated with SNPs that create a new TAnnnT-linked forward TSS in

the adjacent divergent gene (Table 5.9). In the case of pcaA, shown in Figure 5.20A, a 2-

fold increase in umaA gene expression and 4-fold increase in pcaA antisense expression

was detected. For a further six antisense transcripts (Rv0552, Rv0842, Rv0874c, deaD,

Rv2672 and FadE20), introduction of a TAnnnT motif on the reverse strand was

associated with new TSS arising within the gene itself. In the case of deaD, a Lineage 2

C to T SNP (a modern branch SNP) creates a new motif and TSS on both the forward

and reverse strands of DNA, causing a 41-fold increase in Lineage 2 (Figure 5.20B).

Table 5.9. Nine differentially expressed antisense associated with introduction of

SNP-associated TSS. Mutation column shows SNP lineage (1, 2 or Modern). Where

appropriate, predicted functional amino acid changes are shown and the sequence of

new -10 and extended motifs with the SNP allele indicated in uppercase. Nucleotide

positions are based on H37Rv genome. Sequences read in the 5’ to 3’direction.

Gene Function Fold change

Lineage with SNP Mutation

Rv0470c pcaA mycolic acid modification

4.1 1

C 560664 T (tacaaT)

Rv0552 hydrolase 8.6 1

C 643483 T (tacacT)

Rv0842 membrane protein 0.01 Modern

C 938246 T (taggcT)

Rv0874c hypothetical 0.2 2

C 972980 T (taggcT)

Rv1253 deaD RNA helicase 0.02 Modern

C 1400396 T (tatcaT)

Rv1898 hypothetical 0.1 Modern

C 2145878 T (tacccT)

Rv2671/72 ribD riboflavin biosynthesis

82.2/8.2 1

C 2987918 T (tacacT)

Rv2724c fadE20 acyl-CoA dehydrogenase

0.1 2

C 3036826 T (tagcaT)

5.3 Results

154

A.

B.

Figure 5.20. SNP-associated TSS leading to differential antisense expression. A. A

SNP-associated TSS in 3’ region of umaA in Lineage 1 strains is associated with higher

umaA gene expression (2.2-fold) and pcaA antisense expression (4.1-fold). B. A SNP

within deaD in all Lineage 2 strains is associated with a new TSS and 41.2-fold increase

in antisense transcription. The SNP also creates a -10 consensus on the forward strand;

this is associated with a new TSS but has no significant impact on the level of sense

transcription.

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

malQ eccB5

0

500

!"

!"

!"

eccB5

N01

53

N01

45

malQ eccB5

!"

B A Figure 3

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

ppm1 Rv2050

0

250 !"

!"

!"

eccB5

N01

53

N01

45

ppm1 !"

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

umaA

0

2000

!"

fadB2

pcaA

!"

!"

eccB5

N01

53

N01

45

!"

umaA pcaA

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

deaD

0

300

!"

!"

!"eccB5

N01

53

N01

45

deaD

!"

D C

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

malQ eccB5

0

500

!"

!"

!"

eccB5

N01

53

N01

45

malQ eccB5

!"

B A Figure 3

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

ppm1 Rv2050

0

250 !"

!"

!"

eccB5

N01

53

N01

45

ppm1 !"

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

umaA

0

2000

!"

fadB2

pcaA

!"

!"

eccB5

N01

53

N01

45

!"

umaA pcaA

N01

57

N01

53

N00

72

N00

52

N01

45

Line

age

1 Li

neag

e 2

N00

31

deaD

0

300

!"

!"

!"eccB5

N01

53

N01

45

deaD

!"

D C

5.3 Results

155

Although not a lineage example, a highly expressed antisense transcript in Lineage 1

strain N0157 in ino1, an essential gene involved in virulence (Movahedzadeh et al.,

2004), also has a new TAnnnT motif created by a C 50557 T SNP. Interestingly, this is a

homoplasic SNP, which are rare in M. tuberculosis (Comas et al., 2009; Schürch et al.,

2011). The SNP is present in a sub-branch of Lineage 4, including strain H37Rv, which

also expresses the antisense transcript (Arnvig & Young, 2012).

5.3.10 Enrichment of toxin-antitoxins

It was not possible to identify direct SNP-associations for the remainder of the genes

showing lineage-specific patterns of differential expression. It is likely that their

differential expression reflects downstream consequences of primary mutations.

Analysis of the panel of differentially-expressed genes according to functional category

identified a 2-fold over-representation of proteins involved in virulence, detoxification

and adaptation. This was found to be driven by ten toxin-antitoxins (TA) genes, and a

separate classification of all TA as an independent category revealed 2.9-fold over-

representation in the differentially expressed set compared to the genome representation

(χ2, p=0.03) (Figure 5.21). The full table is shown in Appendix F. Six of the TA were

chosen and the pattern of differential gene expression seen by RNA-seq was confirmed

by quantitative RT-PCR (Figure 5.22). Additional strains were included in this analysis

to widen the lineage set (section 5.2.1.3). vapB10 fell outside of the RNA-seq

statistically significant cutoff (p=0.06) but by qRT-PCR this was also shown to be

differentially expressed.

5.3 Results

156

Figure 5.21. Over-representation of differentially expressed toxin-antitoxins. Ratio

of significant differential gene expression grouped by functional category, compared to

the genome-wide representation of the category. Values on the x-axis represents the

difference as fold-change, positive fold-change indicates over-representation of a

particular function category, negative values under-representation, whereas a fold-

change of one indicates no difference. As a separate toxin-antitoxin category, there were

2.9-fold more toxin-antitoxins than expected (χ2, p=0.03).

Figure 5.22. Validation of select RNA-seq differentially expressed toxin- antitoxins

(solid bars) by qRT-PCR (striped bars). Fold change relative to Lineage 1 expression

on y-axis (log10 scale), and bars coloured by lineage with higher expression. Error bars

for qRT-PCR indicate the standard deviation of three biological replicates.

!

!"

!""

#$%&'()*+,-''.%$,/

0*123 0*124 5*6788 5*678! 5*67!" 5*69!" 5*69!3":!

A B

!" # " $

%&'()*+,%(&-.+,/0+12%&,3)*34%+)1-*3,+5(6%2*-+&4-)32.%)+,%(&789778)3:;6+,()1-.)(,3%&2<(&23)=34-/1.(,/3,%<+626%.%4-*3,+5(6%2*<366-0+66-+&4-<366-.)(<32232=%);63&<3>-43,(?%'%<+,%(&>-+4+.,+,%(&;&@&(0&,(?%&!+&,%,(?%&2!

A;&<,%(&+6-<+,3:()1-)3.)323&,+,%(&-B)+,%(C

!

RNAseq qRT-PCR

Figure 4

!

!"

!""

#$%&'()*+,-''.%$,/

0*123 0*124 5*6788 5*678! 5*67!" 5*69!" 5*69!3":!

A B

!" # " $

%&'()*+,%(&-.+,/0+12%&,3)*34%+)1-*3,+5(6%2*-+&4-)32.%)+,%(&789778)3:;6+,()1-.)(,3%&2<(&23)=34-/1.(,/3,%<+626%.%4-*3,+5(6%2*<366-0+66-+&4-<366-.)(<32232=%);63&<3>-43,(?%'%<+,%(&>-+4+.,+,%(&;&@&(0&,(?%&!+&,%,(?%&2!

A;&<,%(&+6-<+,3:()1-)3.)323&,+,%(&-B)+,%(C

!

RNAseq qRT-PCR

Figure 4

5.3 Results

157

Table 5.10. Ten differentially expressed toxin-antitoxins (TA). Mutation column

shows SNP lineage (1, 2 or Modern). Where appropriate, predicted functional amino

acid changes are shown and the sequence of new -10 and extended motifs with the SNP

allele indicated in uppercase. Nucleotide positions are based on H37Rv genome.

Gene Function Fold

change

Lineage

with SNP Mutation

Rv1103c mazE3 antitoxin 2.2

Rv1397c vapC10 toxin 0.1 Modern G103D

Rv2063 mazE7 antitoxin 15.7

Rv2063A mazF7 toxin 4.9 1 R101P

Rv2274A mazE8 antitoxin 3.2

Rv2526 vapB17 antitoxin 0.2

Rv2527 vapC17 toxin 0.1

Rv2596 vapC40 toxin 2.2

Rv2758c vapB21 antitoxin 2.5

Rv2830c vapB22 antitoxin 2.5 Modern G 3137237 A

Transcription of TA modules is generally repressed by binding of the cognate toxin-

antitoxin complex to the promoter region, and activated when the antitoxin is degraded

in response to signals associated with environmental stress (Buts et al., 2005).

Differential expression could result from mutations that affect stability or repressor

activity of the toxin-antitoxin complex, mutations that alter promoter sequences, or

mutations that alter proteolytic activity in the cell. Two differentially-expressed toxins

have nonsynonymous lineage-specific SNPs; VapC10 (Lineage 2, G103D) and MazF7

(Lineage 1, R101P) (Table 5.10), but the SIFT algorithm was unable to predict

functional consequences for these mutations. All TA pairs with detectable transcripts

were expressed from a single major TSS. In two cases the TSS was located within the

annotated coding sequence, and suggesting that the translational start sites are annotated

incorrectly. In the majority of cases (31 out of 51 expressed TA pairs; 60.8%), the TA

pairs were encoded by leaderless mRNAs. A single TSS-associated SNP was identified;

with position -1 of the VapB22 (Rv2830c) TSS switched from G to A in Lineage 2

strains with a decrease in expression.

5.3 Results

158

Due to the lack of direct SNP associations it could be concluded that differential

expression of TA genes reflects general differences in regulatory networks between the

two lineages. A series of genes that are preferentially expressed in Lineage 1 strains

have previously been implicated in the H37Rv response to acid stress and cell wall

damage, including ahpC and ahpD, fabD and lpqS (Fisher et al., 2002). Up-regulation of

these genes may be associated with the stress-related sigma factor sigB (Rv2710), which

has 2-fold higher expression Lineage 1, but falls outside the statistical cut-off (p=0.06).

5.4 Discussion

159

5.4 Discussion

5.4.1 Strengths and limitations of the study

This aim of this study was to identify the lineage-specific expression profiles of Lineage

1 and 2 and to relate this back to the underlying genotype of the respective lineages. For

the first time the total RNA expression of clinical MTBC strains was uncovered using a

sequence-based approach. The RNA-seq data generated has intrinsic advantages of

previous transcriptional analysis methods that rely on hybridisation of targeted

oligonucleotides to specific loci (qRT-PCR), hybridisation of cDNA of multiple probes

(Microarray) or labeled probes binding to RNA (Northern blotting) (Croucher &

Thomson, 2010). Firstly, RNA-seq is not biased as there is no reliance on prior

knowledge of the sequence using probes, therefore all transcripts are studied, including

all gene (mRNA), antisense and non-coding transcription. Secondly, as the method is

sequence based, the resolution is more precise than hybridisation, effectively sampling

all positions within the transcripts, and non-specific hybridisation is not an issue (Kane

et al., 2000). Finally, the dynamic range of RNA-seq is effectively unlimited, and

defined by the amount of sequence coverage that can be generated in the experiment,

whereas the detection of fluorescence or radioactivity can become saturated using

microarrays. Ultimately, the transcriptome data generated in this study is more

discriminatory at high and low expression levels, and provides an unbiased view of

transcription in the MTBC strains.

Whilst one of the advantages of the RNA-seq method is the sampling of all RNA

species, this can also become a draw back through dominance of the transcriptome data

by highly expressed transcripts, such as ribosomal RNA. In this study, about 90% of the

total sequence data was attributed to rRNA, effectively saturating the dataset by out-

competing all other mRNA transcripts for sequence data. Exclusion of such transcripts is

more difficult than with microarray experiments, where rRNA probes can simply be

5.4 Discussion

160

omitted from the chip design. Several methods exist to remove abundant transcripts,

including the use of terminator exonucleases that specifically degrade transcripts with a

5’-monophosphate group (Sharma et al., 2010a), or hybridization of magnetic beads

linked to oligonucleotides complementary to rRNAs (Camarena et al., 2010; Yoder-

Himes et al., 2009). Although such methods are attractive, the significant increased cost

of using these, potential for sample degradation and introduced bias (Croucher et al.,

2009; Yi et al., 2011) and the availability of the high sequence output from the Ilumina

HiSeq2000 sequencer at NIMR rendered these options unnecessary for the differential

expression analysis performed in this study. However, the former terminator

exonuclease method was used in transcriptional start site (TSS) mapping analysis, which

effectively biased the sequence coverage to the 5’ end of transcripts thus facilitating the

accurate mapping of TSS. As larger studies wish to sequence more strains, it may

become necessary to use a depletion step to enable multiplexing of cDNA from multiple

strains into a single Illumina flowcell lane, such as the recently release Epicentre

ScriptSeq v2 preparation kit (Cat. No. RSBC10948) in 2012, which allows up to twelve

indexed cDNA libraries to be pooled together into one lane, therefore decreasing the

cost of sequencing and providing a rapid increase in potential experiment size.

The experimental design of this study was to identify the lineage-specific expression

profiles of two MTBC lineages. The RNA-seq data was therefore mapped to a common

reference genome sequence using M. tuberculosis genome annotations based on H37Rv;

in this case the sequence was the reconstructed ancestor of the MTBC determined from

the phylogeny in Chapter 3, and the annotations was based on Tuberculist annotations

(Lew et al., 2011). This is advantageous as the number of genes is common to the

dataset, allowing comparison of expression levels across all strains. However, a

disadvantage of using a reference-based mapping process is the ignorance to the

expression of any novel transcripts present in the samples. About a quarter (23.3%) of

the filtered high quality transcriptome data did not map to the reference genome, which

could suggest that some highly expressed transcripts are not detected in this analysis.

The mapping algorithm (BWA) (Li & Durbin, 2009) and parameters used could

accommodate gaps of up to three mismatches, and therefore larger indels may account

for some of these sequences not mapped. But future non-reference based de novo

mapping of the sequences has the potential to indentify novel transcripts not present in

the reference strain H37Rv, although such analyses are computational very expensive

and would be more effective using paired-end reads instead of the single-end reads

generated in this study (Schulz et al., 2012).

5.4 Discussion

161

5.4.2 Lineage-specific expression

Clustering analysis of the transcriptome samples identified significant correlation in

transcription between strains of the same lineage based on both sense and antisense

expression, suggesting that the underlying lineage-specific variation is functional and

results in differential transcription. This was strengthened by the positive correlation

between the number of diverging SNPs and gene expression distance (Figure 5.11). At

the gene level, differential analysis identified a total of 112 genes with significant

lineage-specific patterns of expression. A quarter of the genes (26 out of 112 genes)

were identified as differentially expressed in previous microarray comparisons of

ancient versus modern lineages or M. tuberculosis H37Rv versus Mycobacterium bovis

(within Lineage 6) suggesting that the RNA-seq method is concordant to other gene

expression methods (Golby et al., 2007; Homolka et al., 2010). Furthermore, qRT-PCR

analysis of a select number of differentially expressed genes identified the same

direction of fold change identified in the RNA-seq data, and despite the addition of

strains not used in the RNA-seq study; this strengthens the case that there is not a

selection bias in the strains used, and that the lineage-specific patterns of expression are

a general phenomenon of the respective lineage.

A parallel analysis of antisense transcription identified similar conservation by lineage

with a differential expression pattern for 56 genes. Pervasive expression of antisense

transcripts has been recognised as a common feature of bacterial transcriptomes (Lasa et

al., 2011; Raghavan et al., 2012). Comparison of upstream sequences in Escherichia

coli and Salmonella typhimurium suggest that selective pressure for conservation of

antisense promoters is lower than in the case of sense promoters (Raghavan et al., 2012).

Parallel sequencing of the above genomes by Raghavan et al. identified only eight

common highly expressed antisense in both species out the approximately one hundred

antisense from orthologous gene pairs found between the two species. This could have

been due to a species-specific function of the antisense, but no evidence of conservation

was found within strains of E. coli either (Raghavan et al., 2012). In contrast, this study

found a broadly similar pattern of sense and antisense diversity in the MTBC lineage

comparison, which could reflect the reduced purifying selection and increased genetic

drift within MTBC (Hershberg et al., 2008). Currently, the biological significance of

antisense transcripts is unknown, and in the thousands of proposed antisense in E. coli

only a few have been functionally characterised (Fozo et al., 2008; Kawano et al.,

2007). The conservation of antisense in a lineage-specific pattern in the MTBC is

5.4 Discussion

162

interesting and suggests a functional role. It is possible that double-stranded RNA

molecules differ from single-stranded mRNAs in their efficiency of translation and

susceptibility to degradation which could add another layer of regulation (Thomason &

Storz, 2010), which should not be ignored in future studies of MTBC diversity.

5.4.3 Linking genotype to phenotypic consequences at the transcriptional level

Bioinformatic analyses in Chapters 3 and 4 suggested a high percentage of

nonsynonymous SNPs identified across the MTBC were likely to impair protein

function. In this study, three mechanisms by which transcriptome diversity is generated

were identified and these are discussed in the following sections.

5.4.3.1 Transcriptional regulators

Focusing on Lineages 1 and 2, functional impairment of eight transcriptional regulators

was predicted in Chapter 4. Transcriptional profiling provided confirmatory evidence in

four of these cases, virS, Rv0275c, Rv3167c, and Rv3830c. Increased transcription was

observed for three regulatory proteins with mutations affecting the helix-turn-helix

motif, consistent with a loss of autorepression. Elevated expression of VirS in Lineage 1

recapitulates results of a previous microarray comparison of modern and ancient

lineages (Homolka et al., 2010), with the absence of activation of the associated MymA

regulon providing further indication that the mutant VirS lacks functional activity.

Differential expression of virS has also been observed in the comparison of M.

tuberculosis and M. bovis transcriptomes, with 10-fold higher virS expression in M.

bovis (Golby et al., 2007); interestingly another virS lineage-specific SNP was found at

amino acid residue 322, six amino acids away from the above Lineage 1 SNP, and this

defines all animal-adapted MTBC strains, leading to a change in amino acid also

predicted to be functional by SIFT (R322C). Experimental deletion of VirS in M.

tuberculosis H37Rv resulted in pleiotropic cell wall defects and reduced growth in the

spleen of guinea pigs (Singh et al., 2005), raising the possibility that this mutation may

reduce the virulence of Lineage 1 strains. Transcription of Rv0275c and Rv3167c was

similarly upregulated in strains carrying the mutant allele. Neither of these proteins have

been characterised, but RNA-seq profiles were consistent with the functional proteins

acting as autorepressors and inhibitors of adjacent genes. Predicted inactivation of

Rv3830c by a frameshift mutation causing fusion to an adjacent protein did not result in

5.4 Discussion

163

a significant change in expression, but flanking genes (phytoene dehydrogenase

Rv3829c, and Rv3131 with unknown function) were markedly upregulated in Lineage 2.

Whilst for the remaining four transcriptional regulators no detectable transcriptional

phenotype was found in this study, analysis of the response to specific stimuli other than

in exponential phase culture may uncover functional defects. For example, the BlaI

regulator is activated in the presence of beta-lactams (Sala et al., 2009), and therefore

the predicted impaired BlaI in Lineage 1 may only be identified in these conditions.

Similarly, low potassium may uncover functional defects of KdpD in Lineage 1 strains;

kdpE is a sensor protein of the Kdp postassium transport system (Steyn et al., 2003;

Walderhaug et al., 1992).

5.4.3.2 SNP-associated TSS

In addition to amino acid changes in regulatory proteins, genes with lineage-specific

patterns of differential expression were characterised by a high frequency of SNPs

associated with transcriptional start sites (TSS). A striking observation was that SNPs

generating a -10 consensus motif (TAnnnT) were frequently associated with the

emergence of a new TSS. SNP-created TAnnnT motifs could account for 19 of the 168

(11%) lineage-specific differentially expressed genes and antisense, and also for

exponential phase expression of the DosR regulon in the Beijing family. SNPs falling

outside of the -10 motif may also affect promoter activity. Creation of an “extended” -10

consensus (TGnTAnnnT) resulted in enhanced expression, and changes at the -1

position were associated with higher TSS activity.

In addition to their effect on expression of downstream genes, as in the case of

Rv3134c/DosR for example, TSS arising within coding regions may also play a role in

generating functionally active truncated proteins. Ppm1 (Rv2051c) is a bifunctional

enzyme, fusing an N-terminal apolipoprotein N-acyltransferase with a polyprenyl

phosphomannose synthase that are encoded by separate genes in other mycobacteria

(Gurcha et al., 2002). Combination of the two activities in a single polypeptide is likely

to assist in coordination of the final steps in post-translation of glycosylated lipoproteins:

the N-acyltransferase completes the tri-acyl lipid tail, and polyprenyl mannose provides

the sugar donor glycosylation. An internal TSS provides the option of separating the two

activities, freeing the polyprenyl phosphomannose synthase to participate in other

glycosylation pathways. The presence of a conserved internal TSS suggests that this

5.4 Discussion

164

option is retained by all members of the MTBC, with additional flexibility in Lineage 1

provided by a SNP that is associated with a new TSS and predicted impairment of N-

acyltransferase activity. It has been proposed that changes in the mannosylation of cell

surface components have an important impact on recognition of mycobacteria by

receptors on innate immune cells (Torrelles & Schlesinger, 2010), and redistribution of

mannose between lipoglycans and lipoproteins represents an attractive hypothesis to

account for the differential inflammatory response to Lineage 1 and Lineage 2 strains

(Portevin et al., 2011). Enhanced Lineage 1 transcription of mgtA (Rv0557, previously

also referred to as “PimB”) could also contribute to differences in macrophage

phenotype (Torrelles et al., 2009).

New TSS associated with SNP-generated TAnnnT motifs were also observed at a

similar frequency in antisense orientation. The biological significance of antisense

transcripts is unknown; it is possible that double-stranded RNA molecules differ from

single-stranded mRNAs in their efficiency of translation and susceptibility to

degradation. Identification of a Lineage 1 SNP associated with a new TSS in UmaA that

generates antisense to the adjacent pcaA raises the intriguing possibility of a mechanism

for co-ordinated regulation of the two genes. Both proteins are involved in modification

of mycolic acids and lineage-specific differential expression could again contribute to

variation in innate immune reactivity (Rao et al., 2006; Barkan et al., 2012).

More generally, this study has uncovered a potentially important mechanism of

generating transcriptional diversity through SNP-associated TSS. Mutation drives

evolution and adaptation on which selection acts, but mutation is not a completely

stochastic process, and several biases exist (Hershberg & Petrov, 2010). It has been

shown that mutation is AT-biased in clonal organisms including M. tuberculosis, and is

dominated by nucleotide transitions from C or G to T or A (Hershberg & Petrov, 2010).

This was also found to be the case in the Lineage-specific SNPs identified in Chapter 3

(Figure 5.23), with a mean of 64.5% of all SNPs resulting in a G to A or C to T

transition. Together this suggests the potential for many other SNP-associated TSS

within the MTBC, and should be focused on initially in subsequent transcriptome

studies, along with predicted functional mutations at the amino acid level described

earlier.

5.4 Discussion

165

Figure 5.23. Rates of the types of nucleotide mutations across A. nonsynonymous,

B. synonymous and C. intergenic regions. Lineage-specific SNPs result in G/C to A/T

transitions 56.7% for all nonsynonymous SNPs, 76.2% for all synonymous SNPs and

60.7% for all intergenic SNPs.

5.4.3.3 Landscaping of toxin antitoxins

For the remaining differentially expressed genes no direct genotypic link was identified,

and it is presumed that they reflect secondary adaptive responses. The most striking

feature was the over-representation of toxin-antitoxin (TA) gene pairs, contributing to

ten percent of the total set of differentially expressed genes. Differential expression of

TAs is also a feature of previous microarray studies comparing M. bovis with M.

tuberculosis (Golby et al., 2007), and “ancient” with “modern” strains (Homolka et al.,

2010). TA systems were originally identified by their role in plasmid maintenance, but

they are now recognised as a common feature of bacterial genomes (Pandey & Gerdes,

2005). With 62 TA pairs in the current Tuberculist database (Lew et al., 2011), M.

tuberculosis has more TAs than any other intracellular bacterium (Makarova et al.,

2009; Pandey & Gerdes, 2005). The toxin component is typically an endonuclease, with

activity directed towards ribosome-associated mRNAs, rRNAs and tmRNA, resulting in

G/C C/G G/T C/A A/T T/A A/C T/G G/A C/T A/G T/C0

5

10

15

20

25

Per

cent

age

of

inte

rgen

ic S

NP

s

G/C C/G G/T C/A A/T T/A A/C T/G G/A C/T A/G T/C0

10

20

30

40

Per

cent

age

of

syno

nym

ous

SN

Ps

G/C C/G G/T C/A A/T T/A A/C T/G G/A C/T A/G T/C0

5

10

15

20

25

Per

cent

age

of

nons

ynon

ymou

s S

NP

s

A.! B.!

C.!

5.4 Discussion

166

blockage of translation. An attractive hypothesis is that the role of TAs in M.

tuberculosis is to drive the bacteria into reversible growth arrest in unfavourable

environments, by responding to changes in antitoxin stability and proteolytic activities.

Based on this model, the differential expression of TA genes is interpreted as a read-out

of lineage differences in environmental sensing. Comparison of the overall TA

transcription response suggests that the core lineage pattern is overlaid by strain-specific

responses, and it can be envisaged that variability in the combined proteolytic and

transcriptional regulatory network could readily generate heterogeneity within clonal

populations.

6 Final discussion

167

Chapter 6 Final discussion

In this thesis, M. tuberculosis, the principal etiologic agent of tuberculosis in humans,

was investigated at the population level using genomic and transcriptomic approaches

made possible through use of new DNA sequencing technologies. Prior to this study, it

had been hypothesised that a high percentage of genetic diversity in the MTBC will be

functional due to a low frequency of purifying selection (Hershberg et al., 2008). The

MTBC is known to exist as six major lineages, and the overarching aim of this study

was to explore the nature of the genetic diversity at the lineage level and identify the

extent to which this has translated into functional diversity at the transcriptional level.

The results of these studies, their impact, and avenues for future work are discussed in

the following section.

The potential to further our understanding of MTBC diversity was underscored by the

defining study in 2010 by Comas et al. which provided the first representative genome-

wide phylogeny of global genetic diversity at the single nucleotide resolution (Comas et

al., 2010). Twenty-one isolates were selected from a global collection of strains, creating

a robust genomic framework on which to base future analyses. In Chapter 3, the clonal

population structure of the MTBC was exploited to reveal for the first time all lineage-

specific SNPs, which were captured using an expanded 28-genome phylogeny. This was

only possible due to the absence of horizontal gene transfer and recombination in the

MTBC, resulting in the situation whereby the MTBC evolves by decent. This property

was underscored by the extremely low level of homoplasic SNPs, with only 0.14% of all

lineage-specific SNPs present in more than one lineage. SNPs are the most abundant

form of genetic diversity in the MTBC and as such this variation is anticipated to

significantly contribute to the genetic background of the lineages. Accounting for

potential discovery bias in the set of genomes used, the 2,794 SNP set is robust and will

not be expected to change significantly in future studies. From a mechanistic point of

6 Final discussion

168

view, the SNPs identified in this study are directly applicable to SNP based typing

assays, as has been demonstrated recently (Stucki & Gagneux, 2012). Previously,

deletions identified in the M. tuberculosis genome have proved useful targets for typing

(Kong et al., 2006), but efforts are moving towards SNP typing as genome sequencing

costs decrease (Comas et al., 2009). In addition to typing newly isolated strains,

knowledge of the underlying background genetic variation is important for excluding

phylogentically informative SNPs from those associated with drug resistance,

demonstrated by the presence of lineage-specific SNPs identified in this study that were

also present in the database housing the largest collection of mutations causally linked to

drug resistance (Sandgren et al., 2009).

From an evolutionary perspective, the hypothesised reduced selective constraint in the

MTBC might be assumed to create a situation whereby nonsynonymous SNPs are

accumulating in genes with no discrimination to biological function (Hershberg et al.,

2008). The degree of purifying selection was first tested in the lineage-specific set using

the dN/dS measure and it was found that similar low levels of purifying selection was

present in all lineages. As a validation of these results, the dN/dS ratios were congruent

to those found in different MTBC SNP datasets that focused on either a restricted

number of genes (Hershberg et al., 2008) or more generally on all identified SNPs

(Comas et al., 2010). Due to the genome-wide nature of this study, it was possible to

focus down into gene functional categories to ask if there is no difference in purifying

selection, and so if all categories are experiencing the same random genetic drift. It was

found that this was not the case, with a gradient in the removal of nonsynonymous

SNPs; the information pathways category harboured the least number of SNPs, whilst a

significant accumulation of amino acid changing SNPs in the regulatory category was

observed. Interesting it has been previously found that genes involved in essential

functions have a greater level of purifying selection (Comas et al., 2010), which was

also observed in this study, and it was found that this is likely the influencing factor in

the observed result; the information category has the highest proportion of essential

genes whilst the regulatory has the lowest. It has been previously reported that genome

sequencing of strains from the Beijing group of Lineage 2 found an overrepresentation

of nonsynonymous SNPs in regulatory coding genes (Schürch et al., 2011). Together

this suggests that firstly, whilst low purifying selection is acting across all lineages and

gene categories, removal of potential deleterious SNPs is still detectable, and secondly,

the enrichment of nonsynonymous SNPs in genes with a regulatory function could result

in alterations in the response to environmental signals between the lineages.

6 Final discussion

169

Whilst the lineage-specific SNP set is an important pool of genetic diversity, the

identification of nearly three thousand SNPs is difficult to manage from a phenotypic

point of view. As part of a need to generate a focused SNP set for later phenotypic

analysis, and to further understand the genome-wide effect of the observed high

nonsynonymous SNP frequency in Chapter 3, a predictive computational approach was

undertaken in Chapter 4. Based on evolutionary information it was found that nearly half

of all nonsynonymous SNPs introduce an amino acid change at positions conserved in

all other mycobacteria, and therefore are likely to have a functional effect. This confirms

a previous expectation for a high number of functional SNPs based on a restricted

MLSA dataset (Hershberg et al., 2008), and strengthens the observation that this is a

phenomenon specific to the MTBC and not mycobacteria-wide; the same method

applied to the MTBC outlier, M. canetti, found half the level of predicted functional

SNPs. Together this suggests a significant potential for functional diversity in the

MTBC due to nonsynonymous SNPs. The MTBC is thought to have originated in

Africa, and the association with humans over a long time frame has likely resulted in

interactions between human genetic diversity and MTBC variation (Gagneux, 2012).

Interestingly, a similar phenomenon to that found in this study has also been observed in

humans, where recent demographic expansions have distorted basic principles of

population genetics and lead to the accumulation of low frequency genetic variants

associated with strong functional effects (Keinan & Clark, 2012; Tennessen et al.,

2012).

In light of the current slew of genome sequencing studies and corresponding explosion

in growth of databases including dbSNP and DGV (Iafrate et al., 2004; Sherry et al.,

1999), on the human genetics side, and Tuberculist, TBDB and PATRIC on the MTBC

side (Lew et al., 2011; Gillespie et al., 2011; Reddy et al., 2009), it can be envisaged

that the field is rapidly on course to cataloguing the majority of genetic variation. The

activity of this field is demonstrated by a simple pubmed search for “whole genome

sequencing” and “SNPs”, which identified 501 research article and review hits over the

course of this thesis (2009 to 2013). It is therefore reasonable to state that we now have a

good understanding of what the genetic differences are in the MTBC at the lineage level.

As a side note, the MTBC field is struggling to keep up with the growth in genome

sequencing projects in terms of database curation and access, and is in need of a new

online resource to house and integrate recently identified genetic variation (Stucki &

Gagneux, 2012). Keeping with the human genetics theme, it is estimated that 90% of

6 Final discussion

170

sequence variants in humans are SNPs (Collins et al., 1998), with each person thought to

be heterozygous for 24,000-40,000 nonsynonymous SNPs (Cargill et al., 1999), whilst

this study found an average pairwise difference of ~1000 nonsynonymous SNPs

between any one MTBC strain. However, there is a much less complete picture of what

these variants do. In response to this, the computational approaches used in this study in

Chapter 4 have been largely developed to facilitate human genetics research with a need

to filter potential deleterious SNPs from those that are neutral. Ultimately it is

anticipated that genomics will translate into real world clinical settings, informing

diagnostics and treatment in personalised medicine (Evans & Relling, 1999; Laing et al.,

2011). It was interesting to see in Chapter 5 that some of the functional genetic variation

was due to nucleotide level changes that were not nonsynonymous and therefore not the

focus of the computational tools. Nonsynonymous SNPs are classically thought of as

having a higher potential to affect function that synonymous SNPs, which are usually

regarded as neutral. Here the synonymous mutations are shown to give rise to novel

TSS; the SNP predicted to be involved in constitutive expression of DosR in Beijing

strains is synonymous. This stresses the importance of appreciating diversity outside of

the classical focus on nonsynonymous SNPs which are the focus of most computational

resources to predict SNPs (Mooney, 2005).

One of the most exciting aspects of this thesis has been to combine multidisciplinary

methods to strengthen and further understand MTBC diversity. Bioinformatic analyses

in Chapters 3 and 4 suggested that a high percentage of nonsynonymous SNPs identified

across the MTBC likely impair protein function. Chapter 5 explored the potential effects

of genetic variation within the total transcriptomes of clinical MTBC isolates. Prior to

this study, there were no examples of an integrated MTBC genome and transcriptome

analysis, and whilst one recent microarray based study used a rational approach to

selecting strains from different lineages, the underlying genotype was unknown

(Homolka et al., 2010). The aims of the chapter were firstly to survey the transcriptome

profiles of M. tuberculosis clinical isolates from Lineages 1 and 2 using a sequence

based approach, and secondly to understand the effects of the identified lineage-specific

variation. The importance of this study was therefore to establish direct links between

genetic differences observed amongst clinical isolates of the MTBC and phenotypic

consequences at the level of transcription. Transitioning from large-scale whole genome

sequencing of strains to transcriptome analysis using high throughput sequencing is one

of the next frontiers in the understanding of MTBC diversity, and it can be anticipated

that as throughput continues to increase, thanks to improvements in sequencing

6 Final discussion

171

technology, and costs decrease from economies of scale, transcriptome sequencing will

be feasible for many more clinical strains.

The work undertaken in this thesis is positioned at the interface of genomic and

transcriptional systems. Genomic diversity that is specific to each of the MTBC lineages

was identified and the effects of this variation screened in the next biological level -

transcription. Analysis was guided by predictions of potential functional mutations using

an in silico approach. One example of a predicted functional SNP within a regulatory

protein with a detectable phenotype at the transcriptional level was virS. The

hypothesised functional defect with virS in Lineage 1 strains correlates with another virS

SNP found in M. bovis strains and evidence of a similiar transcriptional phenotype

observed by a previous microarray study (Golby et al., 2007). Verification of the

predicted defective virS in Lineage 1 strains and M.bovis is under in investigation at

NIMR, with purified recombinant virS protein currently undergoing DNase footprinting

to ascertain the virS binding site in addition to in vitro transcription assays. A second

example under further investigation is the cause of constitutive DosR expression in the

Beijing sub-family of Lineage 2. It was hypothesised in this thesis that this is due to a

synonymous SNP within all Beijing strains that was seen in the RNA-seq data to

introduce a new transcriptional start site (TSS). Following verification that the SNP is

the cause of the new TSS and associated increased DosR transcription, it would then be

necessary to follow this into the level of translation through measurement of protein

abundance. In a wider context, the relevance of increased DosR to virulence in mouse

models and ultimately epidemiology in humans is not clear (Bartek et al., 2009; Boon &

Dick, 2012). The understanding of tuberculosis at all biological levels is currently an

active field of research, with large collaborations utilising a systems biology approach in

the United States (TB Systems Biology - Stanford University and the Broad Institute)

and Europe (SysteMTb). While these projects are largely based on the reference MTBC

strain H37Rv, it is anticipated that use of clinical stains, such as those used in this study,

will provide important biological insight into the impact of MTBC genetic variation. At

NIMR the approach used in this thesis is also being applied at the proteomic and

metabolomic levels.

In conclusion, this thesis has for the first time captured the genetic diversity that

separates the MTBC lineages, and demonstrates that such diversity generates

transcriptional diversity between the two MTBC lineages focused on in this study, and it

is highly likely that similar mechanisms occur in the other lineages. This underpins the

6 Final discussion

172

importance of the holistic scientific approach that was undertaken in thesis and is in

contrast to the gene centric focus of reductionism. This studies strength comes from the

power to analyse all SNPs across the genome, uncovering examples of functional SNPs

in a data-driven approach, and the potential pool of additional functional SNPs predicted

across all functional categories. To understand MTBC diversity, genomic data should

not be interpreted in isolation, but instead integrated with other biological systems, as

suggested for DosR above. An example of the importance of not treating mutations in

isolation is demonstrated by the phenomenon of epistasis, whereby the phenotypic effect

of one mutation differs depending on the presence of another mutation (Lehner, 2011).

A role for epistatis in M. tuberculosis has been recently reported for the evolution of

drug resistant strains (Borrell et al., 2013), but epistatis has been implicated in many

other biological processes, ranging from pathway organization, mutational load, and

genomic complexity (Breen et al., 2012). Therefore, the lineage-specific SNPs identified

in this thesis provide a framework on which further studies of the effects of MTBC

genomic diversity can be based; firstly as an approach to interrogating genome datasets,

secondly in demonstrating a mechanistic way of generating diversity, and finally as a

resource to the TB community.

Finally, whilst genetic diversity has been uncovered in this thesis, it remains to be shown

whether this has biological consequences during infection. Both lineages are highly

successful pathogens with proven ability to maintain transmission cycles over tens of

thousands of years, and it is likely that phenotypic diversity will reflect adaptation to

different circumstances rather than loss or gain of ability to cause disease. The

differences detected here suggest that strains from the two lineages may present

alternative ligand repertoires to host cells, and respond differently to environmental

changes generated by the host immune response. This in turn may confer varying

degrees of fitness in different epidemiological settings. Understanding the message layer

between a cell and its genome, through studies such as those undertaken in this thesis

will help connect genotype and phenotype, and are needed along with integration of

other biological systems to provide a full understanding of the nature and phenotypic

consequence of MTBC diversity in relation to human TB disease. Finally, it is

important to note that this thesis focused on the common underlying genetic differences

between the MTBC lineages, reflecting events occurring 40,000 to 60,000 years ago

(Hershberg et al., 2008). It has been hypothesised that the MTBC and humans have been

co-evolving and are thus shaped by this longstanding association (Gagneux, 2012), it is

therefore interesting to speculate that focusing on different evolutionary timescales, such

6 Final discussion

173

as the last two hundred years, might reveal selective pressures in the MTBC associated

with the great expansion in human population numbers over this period of time. As well

as providing an opportunity to discern the ongoing evolution of the MTBC population,

such timescales could highlight the response to pressures associated with HIV and drug-

resistance and ultimately help design better tools and effective control strategies for one

of the world’s oldest humans diseases.

REFERENCES

174

References

Achtman, M. (2008). Evolution, population structure, and phylogeography of

genetically monomorphic bacterial pathogens. Annu Rev Microbiol 62 53-70.

Albers, C. A., Lunter, G., MacArthur, D. G., McVean, G., Ouwehand, W. H. &

Durbin, R. (2011). Dindel: accurate indel calls from short-read data. Genome Res 21,

961-973.

Alexander, K. A., Laver, P. N., Michel, A. L., Williams, M., van Helden, P. D.,

Warren, R. M. & Gey van Pittius, N. C. (2010). Novel Mycobacterium tuberculosis

complex pathogen, M. mungi. Emerg Infect Dis 16, 1296-1299.

Anders, S. & Huber, W. (2010). Differential expression analysis for sequence count

data. Genome Biol 11, R106.

Arnvig, K. & Young, D. (2012). Non-coding RNA and its potential role in

Mycobacterium tuberculosis pathogenesis. RNA Biol 9.

Arnvig, K. B., Comas, I., Thomson, N. R., Houghton, J., Boshoff, H. I., Croucher,

N. J., Rose, G., Perkins, T. T., Parkhill, J., Dougan, G. & Young, D. B. (2011).

Sequence-based analysis uncovers an abundance of non-coding RNA in the total

transcriptome of Mycobacterium tuberculosis. PLoS Pathog 7, e1002342.

Atlas, R. M. & Snyder, J. W. (2006). Handbook of media for clinical microbiology:

CRC.

REFERENCES

175

Bagchi, G., Chauhan, S., Sharma, D. & Tyagi, J. S. (2005). Transcription and

autoregulation of the Rv3134c-devR-devS operon of Mycobacterium tuberculosis.

Microbiology 151, 4045-4053.

Baker, L., Brown, T., Maiden, M. C. & Drobniewski, F. (2004). Silent nucleotide

polymorphisms and a phylogeny for Mycobacterium tuberculosis. Emerging Infect Dis

10, 1568-1577.

Balbi, K. J. & Feil, E. J. (2007). The rise and fall of deleterious mutation. Res

Microbiol 158, 779-786.

Bao, L. & Cui, Y. (2005). Prediction of the phenotypic effects of non-synonymous

single nucleotide polymorphisms using structural and evolutionary information.

Bioinformatics 21, 2185-2190.

Barkan, D., Hedhli, D., Yan, H. G., Huygen, K. & Glickman, M. S. (2012).

Mycobacterium tuberculosis lacking all mycolic acid cyclopropanation is viable but

highly attenuated and hyperinflammatory in mice. Infect Immun 80, 1958-1968.

Barry, C. E., Boshoff, H. I., Dartois, V., Dick, T., Ehrt, S., Flynn, J., Schnappinger,

D., Wilkinson, R. J. & Young, D. B. (2009). The spectrum of latent tuberculosis:

rethinking the biology and intervention strategies. Nature reviews Microbiology 7, 845-

855.

Barry, C. E., 3rd (2001). Interpreting cell wall 'virulence factors' of Mycobacterium

tuberculosis. Trends Microbiol 9, 237-241.

Bartek, I. L., Rutherford, R., Gruppo, V., Morton, R. A., Morris, R. P., Klein, M.

R., Visconti, K. C., Ryan, G. J., Schoolnik, G. K., Lenaerts, A. & Voskuil, M. I.

(2009). The DosR regulon of M. tuberculosis and antibacterial tolerance. Tuberculosis

(Edinb) 89, 310-316.

Behr, M. A., Schroeder, B. G., Brinkman, J. N., Slayden, R. A. & Barry, C. E.

(2000). A point mutation in the mma3 gene is responsible for impaired methoxymycolic

acid production in Mycobacterium bovis BCG strains obtained after 1927. J Bacteriol

182, 3394-3399.

REFERENCES

176

Bellamy, R., Beyers, N., McAdam, K. P., Ruwende, C., Gie, R., Samaai, P., Bester,

D., Meyer, M., Corrah, T., Collin, M., Camidge, D. R., Wilkinson, D., Hoal-Van

Helden, E., Whittle, H. C., Amos, W., van Helden, P. & Hill, A. V. (2000). Genetic

susceptibility to tuberculosis in Africans: a genome-wide scan. Proc Natl Acad Sci U S

A 97, 8005-8009.

Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical

and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 57 289-

300.

Bennett-Lovsey, R. M., Herbert, A. D., Sternberg, M. J. & Kelley, L. A. (2008).

Exploring the extremes of sequence/structure space with ensemble fold recognition in

the program Phyre. Proteins 70, 611-625.

Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J.,

Brown, C. G., Hall, K. P., Evers, D. J., Barnes, C. L., Bignell, H. R., Boutell, J. M.,

Bryant, J., Carter, R. J., Keira Cheetham, R., Cox, A. J., Ellis, D. J., Flatbush, M.

R., Gormley, N. A., Humphray, S. J., Irving, L. J., Karbelashvili, M. S., Kirk, S.

M., Li, H., Liu, X., Maisinger, K. S., Murray, L. J., Obradovic, B., Ost, T.,

Parkinson, M. L., Pratt, M. R., Rasolonjatovo, I. M., Reed, M. T., Rigatti, R.,

Rodighiero, C., Ross, M. T., Sabot, A., Sankar, S. V., Scally, A., Schroth, G. P.,

Smith, M. E., Smith, V. P., Spiridou, A., Torrance, P. E., Tzonev, S. S., Vermaas, E.

H., Walter, K., Wu, X., Zhang, L., Alam, M. D., Anastasi, C., Aniebo, I. C., Bailey,

D. M., Bancarz, I. R., Banerjee, S., Barbour, S. G., Baybayan, P. A., Benoit, V. A.,

Benson, K. F., Bevis, C., Black, P. J., Boodhun, A., Brennan, J. S., Bridgham, J. A.,

Brown, R. C., Brown, A. A., Buermann, D. H., Bundu, A. A., Burrows, J. C.,

Carter, N. P., Castillo, N., Chiara, E. C. M., Chang, S., Neil Cooley, R., Crake, N.

R., Dada, O. O., Diakoumakos, K. D., Dominguez-Fernandez, B., Earnshaw, D. J.,

Egbujor, U. C., Elmore, D. W., Etchin, S. S., Ewan, M. R., Fedurco, M., Fraser, L.

J., Fuentes Fajardo, K. V., Scott Furey, W., George, D., Gietzen, K. J., Goddard, C.

P., Golda, G. S., Granieri, P. A., Green, D. E., Gustafson, D. L., Hansen, N. F.,

Harnish, K., Haudenschild, C. D., Heyer, N. I., Hims, M. M., Ho, J. T., Horgan, A.

M., Hoschler, K., Hurwitz, S., Ivanov, D. V., Johnson, M. Q., James, T., Huw

Jones, T. A., Kang, G. D., Kerelska, T. H., Kersey, A. D., Khrebtukova, I.,

Kindwall, A. P., Kingsbury, Z., Kokko-Gonzales, P. I., Kumar, A., Laurent, M. A.,

REFERENCES

177

Lawley, C. T., Lee, S. E., Lee, X., Liao, A. K., Loch, J. A., Lok, M., Luo, S.,

Mammen, R. M., Martin, J. W., McCauley, P. G., McNitt, P., Mehta, P., Moon, K.

W., Mullens, J. W., Newington, T., Ning, Z., Ling Ng, B., Novo, S. M., O'Neill, M.

J., Osborne, M. A., Osnowski, A., Ostadan, O., Paraschos, L. L., Pickering, L.,

Pike, A. C., Chris Pinkard, D., Pliskin, D. P., Podhasky, J., Quijano, V. J., Raczy,

C., Rae, V. H., Rawlings, S. R., Chiva Rodriguez, A., Roe, P. M., Rogers, J., Rogert

Bacigalupo, M. C., Romanov, N., Romieu, A., Roth, R. K., Rourke, N. J., Ruediger,

S. T., Rusman, E., Sanches-Kuiper, R. M., Schenker, M. R., Seoane, J. M., Shaw, R.

J., Shiver, M. K., Short, S. W., Sizto, N. L., Sluis, J. P., Smith, M. A., Ernest Sohna

Sohna, J., Spence, E. J., Stevens, K., Sutton, N., Szajkowski, L., Tregidgo, C. L.,

Turcatti, G., Vandevondele, S., Verhovsky, Y., Virk, S. M., Wakelin, S., Walcott, G.

C., Wang, J., Worsley, G. J., Yan, J., Yau, L., Zuerlein, M., Mullikin, J. C., Hurles,

M. E., McCooke, N. J., West, J. S., Oaks, F. L., Lundberg, P. L., Klenerman, D.,

Durbin, R. & Smith, A. J. (2008). Accurate whole human genome sequencing using

reversible terminator chemistry. Nature 456, 53-59.

Bentley, S. (2010). Taming the next-gen beast. Nature reviews Microbiology 8, 161.

Bentley, S. D., Comas, I., Bryant, J. M., Walker, D., Smith, N. H., Harris, S. R.,

Thurston, S., Gagneux, S., Wood, J., Antonio, M., Quail, M. A., Gehre, F.,

Adegbola, R. A., Parkhill, J. & de Jong, B. C. (2012). The genome of Mycobacterium

africanum West African 2 reveals a lineage-specific locus and genome erosion common

to the M. tuberculosis complex. PLoS neglected tropical diseases 6, e1552.

Bergval, I., Sengstake, S., Brankova, N., Levterova, V., Abadia, E., Tadumaze, N.,

Bablishvili, N., Akhalaia, M., Tuin, K., Schuitema, A., Panaiotov, S., Bachiyska, E.,

Kantardjiev, T., de Zwaan, R., Schurch, A., van Soolingen, D., van 't Hoog, A.,

Cobelens, F., Aspindzelashvili, R., Sola, C., Klatser, P. & Anthony, R. (2012).

Combined species identification, genotyping, and drug resistance detection of

Mycobacterium tuberculosis cultures by MLPA on a bead-based array. PLoS One 7,

e43240.

Boehme, C. C., Nicol, M. P., Nabeta, P., Michael, J. S., Gotuzzo, E., Tahirli, R.,

Gler, M. T., Blakemore, R., Worodria, W., Gray, C., Huang, L., Caceres, T.,

Mehdiyev, R., Raymond, L., Whitelaw, A., Sagadevan, K., Alexander, H., Albert,

H., Cobelens, F., Cox, H., Alland, D. & Perkins, M. D. (2011). Feasibility, diagnostic

REFERENCES

178

accuracy, and effectiveness of decentralised use of the Xpert MTB/RIF test for diagnosis

of tuberculosis and multidrug resistance: a multicentre implementation study. Lancet

377, 1495-1505.

Boelens, R. & Gualerzi, C. O. (2002). Structure and function of bacterial initiation

factors. Current Protein and Peptide Science 3, 107-119.

Boon, C. & Dick, T. (2012). How Mycobacterium tuberculosis goes to sleep: the

dormancy survival regulator DosR a decade later. Future Microbiol 7, 513-518.

Borrell, S. & Gagneux, S. (2009). Infectiousness, reproductive fitness and evolution of

drug-resistant Mycobacterium tuberculosis. Int J Tuberc Lung Dis 13, 1456-1466.

Borrell, S., Teo, Y., Giardina, F., Streicher, E. M., Klopper, M., Feldmann, J.,

Muller, B., Victor, T. C. & Gagneux, S. (2013). Epistasis between antibiotic resistance

mutations drives the evolution of extensively drug-resistant tuberculosis. EMPH, 65-74.

Branton, D., Deamer, D. W., Marziali, A., Bayley, H., Benner, S. A., Butler, T., Di

Ventra, M., Garaj, S., Hibbs, A., Huang, X., Jovanovich, S. B., Krstic, P. S.,

Lindsay, S., Ling, X. S., Mastrangelo, C. H., Meller, A., Oliver, J. S., Pershin, Y. V.,

Ramsey, J. M., Riehn, R., Soni, G. V., Tabard-Cossa, V., Wanunu, M., Wiggin, M.

& Schloss, J. A. (2008). The potential and challenges of nanopore sequencing. Nat

Biotechnol 26, 1146-1153.

Breen, M. S., Kemena, C., Vlasov, P. K., Notredame, C. & Kondrashov, F. A.

(2012). Epistasis as the primary factor in molecular evolution. Nature 490, 535-538.

Brosch, R., Gordon, S. V., Marmiesse, M., Brodin, P., Buchrieser, C., Eiglmeier,

K., Garnier, T., Gutierrez, C., Hewinson, G., Kremer, K., Parsons, L. M., Pym, A.

S., Samper, S., van Soolingen, D. & Cole, S. T. (2002). A new evolutionary scenario

for the Mycobacterium tuberculosis complex. Proc Natl Acad Sci U S A 99, 3684-3689.

Brudey, K., Driscoll, J. R., Rigouts, L., Prodinger, W. M., Gori, A., Al-Hajoj, S. A.,

Allix, C., Aristimuno, L., Arora, J., Baumanis, V., Binder, L., Cafrune, P., Cataldi,

A., Cheong, S., Diel, R., Ellermeier, C., Evans, J. T., Fauville-Dufaux, M.,

Ferdinand, S., Garcia de Viedma, D., Garzelli, C., Gazzola, L., Gomes, H. M.,

REFERENCES

179

Guttierez, M. C., Hawkey, P. M., van Helden, P. D., Kadival, G. V., Kreiswirth, B.

N., Kremer, K., Kubin, M., Kulkarni, S. P., Liens, B., Lillebaek, T., Ho, M. L.,

Martin, C., Mokrousov, I., Narvskaia, O., Ngeow, Y. F., Naumann, L., Niemann, S.,

Parwati, I., Rahim, Z., Rasolofo-Razanamparany, V., Rasolonavalona, T., Rossetti,

M. L., Rusch-Gerdes, S., Sajduda, A., Samper, S., Shemyakin, I. G., Singh, U. B.,

Somoskovi, A., Skuce, R. A., van Soolingen, D., Streicher, E. M., Suffys, P. N.,

Tortoli, E., Tracevska, T., Vincent, V., Victor, T. C., Warren, R. M., Yap, S. F.,

Zaman, K., Portaels, F., Rastogi, N. & Sola, C. (2006). Mycobacterium tuberculosis

complex genetic diversity: mining the fourth international spoligotyping database

(SpolDB4) for classification, population genetics and epidemiology. BMC Microbiol 6,

23.

Burley, S. K. (2013). PDB40: The Protein Data Bank celebrates its 40th birthday.

Biopolymers 99, 165-169.

Buts, L., Lah, J., Dao-Thi, M. H., Wyns, L. & Loris, R. (2005). Toxin-antitoxin

modules as bacterial metabolic stress managers. Trends Biochem Sci 30, 672-679.

Camarena, L., Bruno, V., Euskirchen, G., Poggio, S. & Snyder, M. (2010).

Molecular mechanisms of ethanol-induced pathogenesis revealed by RNA-sequencing.

PLoS Pathog 6, e1000834.

Cargill, M., Altshuler, D., Ireland, J., Sklar, P., Ardlie, K., Patil, N., Shaw, N.,

Lane, C. R., Lim, E. P., Kalyanaraman, N., Nemesh, J., Ziaugra, L., Friedland, L.,

Rolfe, A., Warrington, J., Lipshutz, R., Daley, G. Q. & Lander, E. S. (1999).

Characterization of single-nucleotide polymorphisms in coding regions of human genes.

Nat Genet 22, 231-238.

Carver, T., Berriman, M., Tivey, A., Patel, C., Böhme, U., Barrell, B. G., Parkhill,

J. & Rajandream, M. A. (2008). Artemis and ACT: viewing, annotating and

comparing sequences stored in a relational database. Bioinformatics (Oxford, England)

24, 2672-2676.

Casali, N., Nikolayevskyy, V., Balabanova, Y., Ignatyeva, O., Kontsevaya, I.,

Harris, S. R., Bentley, S. D., Parkhill, J., Nejentsev, S., Hoffner, S. E., Horstmann,

REFERENCES

180

R. D., Brown, T. & Drobniewski, F. (2012). Microevolution of extensively drug-

resistant tuberculosis in Russia. Genome Res 22, 735-745.

Caws, M., Thwaites, G., Stepniewska, K., Nguyen, T. N., Nguyen, T. H., Nguyen, T.

P., Mai, N. T., Phan, M. D., Tran, H. L., Tran, T. H., van Soolingen, D., Kremer,

K., Nguyen, V. V., Nguyen, T. C. & Farrar, J. (2006). Beijing genotype of

Mycobacterium tuberculosis is significantly associated with human immunodeficiency

virus infection and multidrug resistance in cases of tuberculous meningitis. J Clin

Microbiol 44, 3934-3939.

Caws, M., Thwaites, G., Dunstan, S., Hawn, T. R., Lan, N. T., Thuong, N. T.,

Stepniewska, K., Huyen, M. N., Bang, N. D., Loc, T. H., Gagneux, S., van

Soolingen, D., Kremer, K., van der Sande, M., Small, P., Anh, P. T., Chinh, N. T.,

Quy, H. T., Duyen, N. T., Tho, D. Q., Hieu, N. T., Torok, E., Hien, T. T., Dung, N.

H., Nhu, N. T., Duy, P. M., van Vinh Chau, N. & Farrar, J. (2008). The influence of

host and bacterial genotype on the development of disseminated disease with

Mycobacterium tuberculosis. PLoS Pathog 4, e1000034.

Chesne-Seck, M. L., Barilone, N., Boudou, F., Gonzalo Asensio, J., Kolattukudy, P.

E., Martin, C., Cole, S. T., Gicquel, B., Gopaul, D. N. & Jackson, M. (2008). A point

mutation in the two-component regulator PhoP-PhoR accounts for the absence of

polyketide-derived acyltrehaloses but not that of phthiocerol dimycocerosates in

Mycobacterium tuberculosis H37Ra. J Bacteriol 190, 1329-1334.

Cingolani, P., Platts, A., Wang le, L., Coon, M., Nguyen, T., Wang, L., Land, S. J.,

Lu, X. & Ruden, D. M. (2012). A program for annotating and predicting the effects of

single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila

melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80-92.

Coar, T. (1982). The aphorisms of Hippocrates with a Translation into Latin, and

English. Birmingham: Gryphon Editions.

Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L. & Rice, P. M. (2009). The Sanger

FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ

variants. Nucleic Acids Res 38, 1767-1771.

REFERENCES

181

Cohan, F. M. (2002). What are bacterial species? Annu Rev Microbiol 56 457-487.

Cole, S. T., Brosch, R., Parkhill, J., Garnier, T., Churcher, C., Harris, D., Gordon,

S. V., Eiglmeier, K., Gas, S., Barry, C. E., Tekaia, F., Badcock, K., Basham, D.,

Brown, D., Chillingworth, T., Connor, R., Davies, R., Devlin, K., Feltwell, T.,

Gentles, S., Hamlin, N., Holroyd, S., Hornsby, T., Jagels, K., Krogh, A., McLean,

J., Moule, S., Murphy, L., Oliver, K., Osborne, J., Quail, M. A., Rajandream, M.

A., Rogers, J., Rutter, S., Seeger, K., Skelton, J., Squares, R., Squares, S., Sulston,

J. E., Taylor, K., Whitehead, S. & Barrell, B. G. (1998). Deciphering the biology of

Mycobacterium tuberculosis from the complete genome sequence. Nature 393, 537-544.

Collins, F. S., Brooks, L. D. & Chakravarti, A. (1998). A DNA polymorphism

discovery resource for research on human genetic variation. Genome Res 8, 1229-1231.

Comas, I. & Gagneux, S. (2009). The past and future of tuberculosis research. PLoS

Pathog 5, e1000600.

Comas, I., Homolka, S., Niemann, S. & Gagneux, S. (2009). Genotyping of

genetically monomorphic bacteria: DNA sequencing in Mycobacterium tuberculosis

highlights the limitations of current methodologies. PLoS ONE 4, e7815.

Comas, I., Chakravartti, J., Small, P., Galagan, J., Niemann, S., Kremer, K., Ernst,

J. & Gagneux, S. (2010). Human T cell epitopes of Mycobacterium tuberculosis are

evolutionarily hyperconserved. Nat Genet 42, 498-503.

Comas, I., Borrell, S., Roetzer, A., Rose, G., Malla, B., Kato-Maeda, M., Galagan,

J., Niemann, S. & Gagneux, S. (2011). Whole-genome sequencing of rifampicin-

resistant Mycobacterium tuberculosis strains identifies compensatory mutations in RNA

polymerase genes. Nat Genet 44, 106-110.

Constant, P., Perez, E., Malaga, W., Laneelle, M. A., Saurel, O., Daffe, M. &

Guilhot, C. (2002). Role of the pks15/1 gene in the biosynthesis of phenolglycolipids in

the Mycobacterium tuberculosis complex. Evidence that all strains synthesize

glycosylated p-hydroxybenzoic methyl esters and that strains devoid of

phenolglycolipids harbor a frameshift mutation in the pks15/1 gene. J Biol Chem 277,

38148-38158.

REFERENCES

182

Coscolla, M. & Gagneux, S. (2010). Does M. tuberculosis genomic diversity explain

disease diversity? Drug Discov Today Dis Mech 7, e43-e59.

Cowley, D., Govender, D., February, B., Wolfe, M., Steyn, L., Evans, J., Wilkinson,

R. J. & Nicol, M. P. (2008). Recent and rapid emergence of W-Beijing strains of

Mycobacterium tuberculosis in Cape Town, South Africa. Clin Infect Dis 47, 1252-

1259.

Cox, M. P., Peterson, D. A. & Biggs, P. J. (2010). SolexaQA: At-a-glance quality

assessment of Illumina second-generation sequencing data. BMC Bioinformatics 11,

485.

Croucher, N. J., Fookes, M. C., Perkins, T. T., Turner, D. J., Marguerat, S. B.,

Keane, T., Quail, M. A., He, M., Assefa, S., Bahler, J., Kingsley, R. A., Parkhill, J.,

Bentley, S. D., Dougan, G. & Thomson, N. R. (2009). A simple method for directional

transcriptome sequencing using Illumina technology. Nucleic Acids Res 37, e148.

Croucher, N. J. & Thomson, N. R. (2010). Studying bacterial transcriptomes using

RNA-seq. Curr Opin Microbiol 13, 619-624.

Daniel, T. M. (1997). Captain of death: the story of tuberculosis. Rochester, NY:

University of Rochester Press.

de Jong, B. C., Hill, P. C., Aiken, A., Awine, T., Antonio, M., Adetifa, I. M.,

Jackson-Sillah, D. J., Fox, A., Deriemer, K., Gagneux, S., Borgdorff, M. W.,

McAdam, K. P., Corrah, T., Small, P. M. & Adegbola, R. A. (2008). Progression to

active tuberculosis, but not transmission, varies by Mycobacterium tuberculosis lineage

in The Gambia. J Infect Dis 198, 1037-1043.

de Jong, B. C., Antonio, M., Awine, T., Ogungbemi, K., de Jong, Y. P., Gagneux, S.,

DeRiemer, K., Zozio, T., Rastogi, N., Borgdorff, M., Hill, P. C. & Adegbola, R. A.

(2009). Use of spoligotyping and large sequence polymorphisms to study the population

structure of the Mycobacterium tuberculosis complex in a cohort study of consecutive

smear-positive tuberculosis cases in The Gambia. J Clin Microbiol 47, 994-1001.

REFERENCES

183

de Jong, B. C., Antonio, M. & Gagneux, S. (2010). Mycobacterium africanum--review

of an important cause of human tuberculosis in West Africa. PLoS Negl Trop Dis 4,

e744.

de la Rua-Domenech, R. (2006). Human Mycobacterium bovis infection in the United

Kingdom: Incidence, risks, control measures and review of the zoonotic aspects of

bovine tuberculosis. Tuberculosis (Edinb) 86, 77-109.

Dillies, M. A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant,

N., Keime, C., Marot, G., Castel, D., Estelle, J., Guernec, G., Jagla, B., Jouneau, L.,

Laloe, D., Le Gall, C., Schaeffer, B., Le Crom, S., Guedj, M. & Jaffrezic, F. (2012).

A comprehensive evaluation of normalization methods for Illumina high-throughput

RNA sequencing data analysis. Brief Bioinform.

Domenech, P. & Reed, M. B. (2009). Rapid and spontaneous loss of phthiocerol

dimycocerosate (PDIM) from Mycobacterium tuberculosis grown in vitro: implications

for virulence studies. Microbiology 155, 3532-3543.

Domenech, P., Kolly, G. S., Leon-Solis, L., Fallow, A. & Reed, M. B. (2010).

Massive gene duplication event among clinical isolates of the Mycobacterium

tuberculosis W/Beijing family. J Bacteriol 192, 4562-4570.

Donoghue, H. D., Spigelman, M., Greenblatt, C. L., Lev-Maor, G., Bar-Gal, G. K.,

Matheson, C., Vernon, K., Nerlich, A. G. & Zink, A. R. (2004). Tuberculosis: from

prehistory to Robert Koch, as revealed by ancient DNA. Lancet Infect Dis 4, 584-592.

Ellis, R. C. & Zabrowarny, L. A. (1993). Safer staining method for acid fast bacilli. J

Clin Pathol 46, 559-560.

Evans, J. T., Smith, E. G., Banerjee, A., Smith, R. M., Dale, J., Innes, J. A., Hunt,

D., Tweddell, A., Wood, A., Anderson, C., Hewinson, R. G., Smith, N. H., Hawkey,

P. M. & Sonnenberg, P. (2007). Cluster of human tuberculosis caused by

Mycobacterium bovis: evidence for person-to-person transmission in the UK. Lancet

369, 1270-1276.

REFERENCES

184

Evans, W. E. & Relling, M. V. (1999). Pharmacogenomics: translating functional

genomics into rational therapeutics. Science 286, 487-491.

Filiatrault, M. J., Stodghill, P. V., Myers, C. R., Bronstein, P. A., Butcher, B. G.,

Lam, H., Grills, G., Schweitzer, P., Wang, W., Schneider, D. J. & Cartinhour, S. W.

(2011). Genome-wide identification of transcriptional start sites in the plant pathogen

Pseudomonas syringae pv. tomato str. DC3000. PLoS One 6, e29335.

Filliol, I., Driscoll, J. R., van Soolingen, D., Kreiswirth, B. N., Kremer, K.,

Valetudie, G., Dang, D. A., Barlow, R., Banerjee, D., Bifani, P. J., Brudey, K.,

Cataldi, A., Cooksey, R. C., Cousins, D. V., Dale, J. W., Dellagostin, O. A.,

Drobniewski, F., Engelmann, G., Ferdinand, S., Gascoyne-Binzi, D., Gordon, M.,

Gutierrez, M. C., Haas, W. H., Heersma, H., Kassa-Kelembho, E., Ho, M. L.,

Makristathis, A., Mammina, C., Martin, G., Mostrom, P., Mokrousov, I.,

Narbonne, V., Narvskaya, O., Nastasi, A., Niobe-Eyangoh, S. N., Pape, J. W.,

Rasolofo-Razanamparany, V., Ridell, M., Rossetti, M. L., Stauffer, F., Suffys, P. N.,

Takiff, H., Texier-Maugein, J., Vincent, V., de Waard, J. H., Sola, C. & Rastogi, N.

(2003). Snapshot of moving and expanding clones of Mycobacterium tuberculosis and

their global distribution assessed by spoligotyping in an international study. J Clin

Microbiol 41, 1963-1970.

Firdessa, R., Berg, S., Hailu, E., Schelling, E., Gumi, B., Erenso, G., Gadisa, E.,

Kiros, T., Habtamu, M., Hussein, J., Zinsstag, J., Robertson, B. D., Ameni, G.,

Lohan, A., Loftus, B., Comas, I., Gagneux, S., Tschopp, R., Yamuah, L., Hewinson,

G., Gordon, S. V., Young, D. B. & Aseffa, A. (2013). Mycobacterial lineages causing

pulmonary and extrapulmonary tuberculosis, Ethiopia. Emerg Infect Dis 19 460-463.

Fisher, M. A., Plikaytis, B. B. & Shinnick, T. M. (2002). Microarray analysis of the

Mycobacterium tuberculosis transcriptional response to the acidic conditions found in

phagosomes. J Bacteriol 184, 4025-4032.

Fleischmann, R. D., Alland, D., Eisen, J. A., Carpenter, L., White, O., Peterson, J.,

DeBoy, R., Dodson, R., Gwinn, M., Haft, D., Hickey, E., Kolonay, J. F., Nelson, W.

C., Umayam, L. A., Ermolaeva, M., Salzberg, S. L., Delcher, A., Utterback, T.,

Weidman, J., Khouri, H., Gill, J., Mikula, A., Bishai, W., Jacobs Jr, W. R., Jr.,

REFERENCES

185

Venter, J. C. & Fraser, C. M. (2002). Whole-genome comparison of Mycobacterium

tuberculosis clinical and laboratory strains. J Bacteriol 184, 5479-5490.

Forrellad, M. A., Klepp, L. I., Gioffre, A., Sabio, Y. G. J., Morbidoni, H. R.,

Santangelo, M. D., Cataldi, A. A. & Bigi, F. (2012). Virulence factors of the

Mycobacterium tuberculosis complex. Virulence 4.

Fozo, E. M., Hemm, M. R. & Storz, G. (2008). Small toxic proteins and the antisense

RNAs that repress them. Microbiol Mol Biol Rev 72, 579-589, Table of Contents.

Gagneux, S., DeRiemer, K., Van, T., Kato-Maeda, M., de Jong, B. C., Narayanan,

S., Nicol, M., Niemann, S., Kremer, K., Gutierrez, M. C., Hilty, M., Hopewell, P. C.

& Small, P. M. (2006a). Variable host-pathogen compatibility in Mycobacterium

tuberculosis. Proc Natl Acad Sci U S A 103, 2869-2873.

Gagneux, S., Long, C. D., Small, P. M., Van, T., Schoolnik, G. K. & Bohannan, B.

J. M. (2006b). The competitive cost of antibiotic resistance in Mycobacterium

tuberculosis. Science 312, 1944-1946.

Gagneux, S. & Small, P. M. (2007). Global phylogeography of Mycobacterium

tuberculosis and implications for tuberculosis product development. Lancet Infect Dis 7,

328-337.

Gagneux, S. (2012). Host-pathogen coevolution in human tuberculosis. Philos Trans R

Soc Lond B Biol Sci 367, 850-859.

Gao, Q., Kripke, K. E., Saldanha, A. J., Yan, W., Holmes, S. & Small, P. M. (2005).

Gene expression diversity among Mycobacterium tuberculosis clinical isolates.

Microbiology (Reading, England) 151, 5-14.

Garnier, T., Eiglmeier, K., Camus, J. C., Medina, N., Mansoor, H., Pryor, M.,

Duthoy, S., Grondin, S., Lacroix, C., Monsempe, C., Simon, S., Harris, B., Atkin,

R., Doggett, J., Mayes, R., Keating, L., Wheeler, P. R., Parkhill, J., Barrell, B. G.,

Cole, S. T., Gordon, S. V. & Hewinson, R. G. (2003). The complete genome sequence

of Mycobacterium bovis. Proc Natl Acad Sci U S A 100, 7877-7882.

REFERENCES

186

Gerasimova, A., Kazakov, A. E., Arkin, A. P., Dubchak, I. & Gelfand, M. S. (2011).

Comparative genomics of the dormancy regulons in mycobacteria. J Bacteriol 193,

3446-3452.

Gillespie, J. J., Wattam, A. R., Cammer, S. A., Gabbard, J. L., Shukla, M. P.,

Dalay, O., Driscoll, T., Hix, D., Mane, S. P., Mao, C., Nordberg, E. K., Scott, M.,

Schulman, J. R., Snyder, E. E., Sullivan, D. E., Wang, C., Warren, A., Williams, K.

P., Xue, T., Yoo, H. S., Zhang, C., Zhang, Y., Will, R., Kenyon, R. W. & Sobral, B.

W. (2011). PATRIC: the comprehensive bacterial bioinformatics resource with a focus

on human pathogenic species. Infect Immun 79, 4286-4298.

Glynn, J. R., Whiteley, J., Bifani, P. J., Kremer, K. & van Soolingen, D. (2002).

Worldwide occurrence of Beijing/W strains of Mycobacterium tuberculosis: a

systematic review. Emerg Infect Dis 8, 843-849.

Golby, P., Hatch, K. A., Bacon, J., Cooney, R., Riley, P., Allnutt, J., Hinds, J.,

Nunez, J., Marsh, P. D., Hewinson, R. G. & Gordon, S. V. (2007). Comparative

transcriptomics reveals key gene expression differences between the human and bovine

pathogens of the Mycobacterium tuberculosis complex. Microbiology 153, 3323-3336.

Goldman, D. S. (1963). Enzyme Systems in the Mycobacteria. Xv. Initial Steps in the

Metabolism of Glycerol. J Bacteriol 86, 30-37.

Grange, J. M. (2001). Mycobacterium bovis infection in human beings. Tuberculosis

(Edinb) 81, 71-77.

Grissa, I., Vergnaud, G. & Pourcel, C. (2008). CRISPRcompar: a website to compare

clustered regularly interspaced short palindromic repeats. Nucleic Acids Res 36, W145-

148.

Gurcha, S. S., Baulard, A. R., Kremer, L., Locht, C., Moody, D. B., Muhlecker, W.,

Costello, C. E., Crick, D. C., Brennan, P. J. & Besra, G. S. (2002). Ppm1, a novel

polyprenol monophosphomannose synthase from Mycobacterium tuberculosis. Biochem

J 365, 441-450.

REFERENCES

187

Gustafsson, C., Govindarajan, S. & Minshull, J. (2004). Codon bias and heterologous

protein expression. Trends Biotechnol 22, 346-353.

Gutierrez, M. C., Brisse, S., Brosch, R., Fabre, M., Omaïs, B., Marmiesse, M.,

Supply, P. & Vincent, V. (2005). Ancient origin and gene mosaicism of the progenitor

of Mycobacterium tuberculosis. PLoS Path 1, e5.

Hardcastle, T. J. & Kelly, K. A. (2010). baySeq: empirical Bayesian methods for

identifying differential expression in sequence count data. BMC Bioinformatics 11, 422.

Harris, S. R., Feil, E. J., Holden, M. T., Quail, M. A., Nickerson, E. K., Chantratita,

N., Gardete, S., Tavares, A., Day, N., Lindsay, J. A., Edgeworth, J. D., de

Lencastre, H., Parkhill, J., Peacock, S. J. & Bentley, S. D. (2010). Evolution of

MRSA during hospital transmission and intercontinental spread. Science (New York, NY)

327, 469-474.

Hawley, D. K. & McClure, W. R. (1983). Compilation and analysis of Escherichia coli

promoter DNA sequences. Nucleic Acids Res 11, 2237-2255.

He, M., Sebaihia, M., Lawley, T. D., Stabler, R. A., Dawson, L. F., Martin, M. J.,

Holt, K. E., Seth-Smith, H. M., Quail, M. A., Rance, R., Brooks, K., Churcher, C.,

Harris, D., Bentley, S. D., Burrows, C., Clark, L., Corton, C., Murray, V., Rose, G.,

Thurston, S., van Tonder, A., Walker, D., Wren, B. W., Dougan, G. & Parkhill, J.

(2010). Evolutionary dynamics of Clostridium difficile over short and long time scales.

Proceedings of the National Academy of Sciences of the United States of America.

Hendrix, R. W., Smith, M. C., Burns, R. N., Ford, M. E. & Hatfull, G. F. (1999).

Evolutionary relationships among diverse bacteriophages and prophages: all the world's

a phage. Proceedings of the National Academy of Sciences of the United States of

America 96, 2192-2197.

Heng, L. (2008).MAQ: Mapping and Assembly with Qualities.

Hershberg, R., Lipatov, M., Small, P. M., Sheffer, H., Niemann, S., Homolka, S.,

Roach, J. C., Kremer, K., Petrov, D. A., Feldman, M. W. & Gagneux, S. (2008).

REFERENCES

188

High functional diversity in Mycobacterium tuberculosis driven by genetic drift and

human demography. PLoS Biol 6, e311.

Hershberg, R. & Petrov, D. A. (2010). Evidence that mutation is universally biased

towards AT in bacteria. PLoS Genet 6.

Heym, B., Alzari, P. M., Honore, N. & Cole, S. T. (1995). Missense mutations in the

catalase-peroxidase gene, katG, are associated with isoniazid resistance in

Mycobacterium tuberculosis. Mol Microbiol 15, 235-245.

Hillemann, D., Rusch-Gerdes, S. & Richter, E. (2007). Evaluation of the GenoType

MTBDRplus assay for rifampin and isoniazid susceptibility testing of Mycobacterium

tuberculosis strains and clinical specimens. J Clin Microbiol 45, 2635-2640.

Hirsh, A. E., Tsolaki, A. G., DeRiemer, K., Feldman, M. W. & Small, P. M. (2004).

Stable association between strains of Mycobacterium tuberculosis and their human host

populations. Proc Natl Acad Sci U S A 101, 4871-4876.

Ho, D. D., Neumann, A. U., Perelson, A. S., Chen, W., Leonard, J. M. &

Markowitz, M. (1995). Rapid turnover of plasma virions and CD4 lymphocytes in

HIV-1 infection. Nature 373, 123-126.

Holt, K. E., Parkhill, J., Mazzoni, C. J., Roumagnac, P., Weill, F. X., Goodhead, I.,

Rance, R., Baker, S., Maskell, D. J., Wain, J., Dolecek, C., Achtman, M. & Dougan,

G. (2008). High-throughput sequencing provides insights into genome variation and

evolution in Salmonella Typhi. Nat Genet 40, 987-993.

Homolka, S., Köser, C., Archer, J., Rüsch-Gerdes, S. & Niemann, S. (2009). Single-

nucleotide polymorphisms in Rv2629 are specific for Mycobacterium tuberculosis

genotypes Beijing and Ghana but not associated with rifampin resistance. J Clin

Microbiol 47, 223-226.

Homolka, S., Niemann, S., Russell, D. G. & Rohde, K. H. (2010). Functional genetic

diversity among Mycobacterium tuberculosis complex clinical isolates: delineation of

conserved core and lineage-specific transcriptomes during intracellular survival. PLoS

Path 6, e1000988.

REFERENCES

189

Iafrate, A. J., Feuk, L., Rivera, M. N., Listewnik, M. L., Donahoe, P. K., Qi, Y.,

Scherer, S. W. & Lee, C. (2004). Detection of large-scale variation in the human

genome. Nat Genet 36, 949-951.

Ioerger, T. R., Feng, Y., Ganesula, K., Chen, X., Dobos, K. M., Fortune, S., Jacobs,

W. R., Mizrahi, V., Parish, T., Rubin, E., Sassetti, C. & Sacchettini, J. C. (2010).

Variation among genome sequences of H37Rv strains of Mycobacterium tuberculosis

from multiple laboratories. J Bacteriol 192, 3645-3653.

Jones, T. F., Craig, A. S., Valway, S. E., Woodley, C. L. & Schaffner, W. (1999).

Transmission of tuberculosis in a jail. Ann Intern Med 131, 557-563.

Jordan, I. K., Rogozin, I. B., Wolf, Y. I. & Koonin, E. V. (2002). Essential genes are

more evolutionarily conserved than are nonessential genes in bacteria. Genome Res 12,

962-968.

Kahla, I. B., Henry, M., Boukadida, J. & Drancourt, M. (2011). Pyrosequencing

assay for rapid identification of Mycobacterium tuberculosis complex species. BMC Res

Notes 4, 423.

Kane, M. D., Jatkoe, T. A., Stumpf, C. R., Lu, J., Thomas, J. D. & Madore, S. J.

(2000). Assessment of the sensitivity and specificity of oligonucleotide (50mer)

microarrays. Nucleic Acids Res 28, 4552-4557.

Kaplan, G., Post, F. A., Moreira, A. L., Wainwright, H., Kreiswirth, B. N.,

Tanverdi, M., Mathema, B., Ramaswamy, S. V., Walther, G., Steyn, L. M., Barry,

C. E., 3rd & Bekker, L. G. (2003). Mycobacterium tuberculosis growth at the cavity

surface: a microenvironment with failed immunity. Infect Immun 71, 7099-7108.

Kawano, M., Aravind, L. & Storz, G. (2007). An antisense RNA controls synthesis of

an SOS-induced toxin evolved from an antitoxin. Mol Microbiol 64, 738-754.

Keating, L. A., Wheeler, P. R., Mansoor, H., Inwald, J. K., Dale, J., Hewinson, R.

G. & Gordon, S. V. (2005). The pyruvate requirement of some members of the

REFERENCES

190

Mycobacterium tuberculosis complex is due to an inactive pyruvate kinase: implications

for in vivo growth. Mol Microbiol 56, 163-174.

Keinan, A. & Clark, A. G. (2012). Recent explosive human population growth has

resulted in an excess of rare genetic variants. Science 336, 740-743.

Kelley, L. A. & Sternberg, M. J. (2009). Protein structure prediction on the Web: a

case study using the Phyre server. Nat Protoc 4, 363-371.

Kibota, T. T. & Lynch, M. (1996). Estimate of the genomic mutation rate deleterious

to overall fitness in E. coli. Nature 381, 694-696.

Kimchi-Sarfaty, C., Oh, J. M., Kim, I. W., Sauna, Z. E., Calcagno, A. M.,

Ambudkar, S. V. & Gottesman, M. M. (2007). A "silent" polymorphism in the MDR1

gene changes substrate specificity. Science 315, 525-528.

Kimura, M. (1977). Preponderance of synonymous changes as evidence for the neutral

theory of molecular evolution. Nature 267, 275-276.

Komar, A. A. (2007). Silent SNPs: impact on gene function and phenotype.

Pharmacogenomics 8, 1075-1080.

Kong, Y., Cave, M. D., Yang, D., Zhang, L., Marrs, C. F., Foxman, B., Bates, J. H.,

Wilson, F., Mukasa, L. N. & Yang, Z. H. (2005). Distribution of insertion- and

deletion-associated genetic polymorphisms among four Mycobacterium tuberculosis

phospholipase C genes and associations with extrathoracic tuberculosis: a population-

based study. J Clin Microbiol 43, 6048-6053.

Kong, Y., Cave, M. D., Zhang, L., Foxman, B., Marrs, C. F., Bates, J. H. & Yang,

Z. H. (2006). Population-based study of deletions in five different genomic regions of

Mycobacterium tuberculosis and possible clinical relevance of the deletions. J Clin

Microbiol 44, 3940-3946.

Kong, Y., Cave, M. D., Zhang, L., Foxman, B., Marrs, C. F., Bates, J. H. & Yang,

Z. H. (2007). Association between Mycobacterium tuberculosis Beijing/W lineage strain

infection and extrathoracic tuberculosis: Insights from epidemiologic and clinical

REFERENCES

191

characterization of the three principal genetic groups of M. tuberculosis clinical isolates.

J Clin Microbiol 45, 409-414.

Korber, B. (2000). HIV Signature and Sequence Variation Analysis. In Computational

Analysis of HIV Molecular Sequences, pp. 55-72. Edited by A. G. Rodrigo & G. H.

Learn: Kluwer Academic Publishers, Dordrecht, Netherlands.

Koser, C. U., Summers, D. K. & Archer, J. A. (2011). Thr270Ile in embC (Rv3793) is

not a marker for ethambutol resistance in the Mycobacterium tuberculosis complex.

Antimicrob Agents Chemother 55, 1825.

Kryazhimskiy, S. & Plotkin, J. B. (2008). The population genetics of dN/dS. PLoS

Genet 4, e1000304.

Kumar, A., Toledo, J. C., Patel, R. P., Lancaster, J. R., Jr. & Steyn, A. J. (2007).

Mycobacterium tuberculosis DosS is a redox sensor and DosT is a hypoxia sensor. Proc

Natl Acad Sci U S A 104, 11568-11573.

Laing, R. E., Hess, P., Shen, Y., Wang, J. & Hu, S. X. (2011). The role and impact of

SNPs in pharmacogenomics and personalized medicine. Curr Drug Metab 12, 460-486.

Lasa, I., Toledo-Arana, A., Dobin, A., Villanueva, M., de los Mozos, I. R., Vergara-

Irigaray, M., Segura, V., Fagegaltier, D., Penades, J. R., Valle, J., Solano, C. &

Gingeras, T. R. (2011). Genome-wide antisense transcription drives mRNA processing

in bacteria. Proc Natl Acad Sci U S A 108, 20172-20177.

Lehner, B. (2011). Molecular mechanisms of epistasis within and between genes.

Trends Genet 27, 323-331.

Lew, J. M., Kapopoulou, A., Jones, L. M. & Cole, S. T. (2011). TubercuList--10 years

after. Tuberculosis (Edinb) 91, 1-7.

Li, H., Ruan, J. & Durbin, R. (2008). Mapping short DNA sequencing reads and

calling variants using mapping quality scores. Genome Res 18, 1851-1858.

REFERENCES

192

Li, H. & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-

Wheeler transform. Bioinformatics 25, 1754-1760.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G.,

Abecasis, G. & Durbin, R. (2009). The Sequence Alignment/Map format and

SAMtools. Bioinformatics 25, 2078-2079.

Lindstedt, B. A. (2005). Multiple-locus variable number tandem repeats analysis for

genetic fingerprinting of pathogenic bacteria. Electrophoresis 26, 2567-2582.

Liu, X., Gutacker, M. M., Musser, J. M. & Fu, Y. X. (2006). Evidence for

recombination in Mycobacterium tuberculosis. J Bacteriol 188, 8169-8177.

Liveris, D., Schwartz, J. J., Geertman, R. & Schwartz, I. (1993). Molecular cloning

and sequencing of infC, the gene encoding translation initiation factor IF3, from four

enterobacterial species. FEMS Microbiol Lett 112, 211-216.

Loman, N., Constantinidou, C., Chan, J. Z., Halachev, M., Sergeant, M., Penn, C.,

Robinson, E. & Pallen, M. (2012). High-throughput bacterial genome sequencing: an

embarrassment of choice, a world of opportunity. Nature reviews Microbiology.

Madigan, M. T., Martinko, J. M. & Parker, J. (2003). Brock Biology of

Microorganisms10th ed. Pearson Education.

Makarova, K. S., Wolf, Y. I. & Koonin, E. V. (2009). Comprehensive comparative-

genomic analysis of type 2 toxin-antitoxin systems and related mobile stress response

systems in prokaryotes. Biol Direct 4, 19.

Malen, H., Berven, F. S., Fladmark, K. E. & Wiker, H. G. (2007). Comprehensive

analysis of exported proteins from Mycobacterium tuberculosis H37Rv. Proteomics 7,

1702-1718.

Malys, N. & McCarthy, J. E. (2011). Translation initiation: variations in the

mechanism can be anticipated. Cell Mol Life Sci 68, 991-1003.

REFERENCES

193

Manca, C., Tsenova, L., Barry, C. E., 3rd, Bergtold, A., Freeman, S., Haslett, P. A.,

Musser, J. M., Freedman, V. H. & Kaplan, G. (1999). Mycobacterium tuberculosis

CDC1551 induces a more vigorous host response in vivo and in vitro, but is not more

virulent than other clinical isolates. J Immunol 162, 6740-6746.

Manca, C., Tsenova, L., Bergtold, A., Freeman, S., Tovey, M., Musser, J. M.,

Barry, C. E., 3rd, Freedman, V. H. & Kaplan, G. (2001). Virulence of a

Mycobacterium tuberculosis clinical isolate in mice is determined by failure to induce

Th1 type immunity and is associated with induction of IFN-alpha /beta. Proc Natl Acad

Sci U S A 98, 5752-5757.

Manca, C., Tsenova, L., Freeman, S., Barczak, A. K., Tovey, M., Murray, P. J.,

Barry, C. & Kaplan, G. (2005). Hypervirulent M. tuberculosis W/Beijing strains

upregulate type I IFNs and increase expression of negative regulators of the Jak-Stat

pathway. J Interferon Cytokine Res 25, 694-701.

Mao, C., Shukla, M., Larrouy-Maumus, G., Dix, F. L., Kelley, L. A., Sternberg, M.

J., Sobral, B. W. & de Carvalho, L. P. (2012). Functional assignment of

Mycobacterium tuberculosis proteome revealed by genome-scale fold-recognition.

Tuberculosis (Edinb).

Marguerat, S. & Bähler, J. (2010). RNA-seq: from technology to biology. Cell Mol

Life Sci.

McEvoy, C. R., Cloete, R., Muller, B., Schurch, A. C., van Helden, P. D., Gagneux,

S., Warren, R. M. & Gey van Pittius, N. C. (2012). Comparative analysis of

Mycobacterium tuberculosis pe and ppe genes reveals high sequence variation and an

apparent absence of selective constraints. PLoS One 7, e30593.

McNerney, R., Maeurer, M., Abubakar, I., Marais, B., McHugh, T. D., Ford, N.,

Weyer, K., Lawn, S., Grobusch, M. P., Memish, Z., Squire, S. B., Pantaleo, G.,

Chakaya, J., Casenghi, M., Migliori, G. B., Mwaba, P., Zijenah, L., Hoelscher, M.,

Cox, H., Swaminathan, S., Kim, P. S., Schito, M., Harari, A., Bates, M., Schwank,

S., O'Grady, J., Pletschette, M., Ditui, L., Atun, R. & Zumla, A. (2012).

Tuberculosis diagnostics and biomarkers: needs, challenges, recent advances, and

opportunities. J Infect Dis 205 Suppl 2, S147-158.

REFERENCES

194

Meyer, M. & Kircher, M. (2010). Illumina sequencing library preparation for highly

multiplexed target capture and sequencing. Cold Spring Harb Protoc 2010, pdb

prot5448.

Micklinghoff, J. C., Breitinger, K. J., Schmidt, M., Geffers, R., Eikmanns, B. J. &

Bange, F. C. (2009). Role of the transcriptional regulator RamB (Rv0465c) in the

control of the glyoxylate cycle in Mycobacterium tuberculosis. J Bacteriol 191, 7260-

7269.

Miller, M. P. & Kumar, S. (2001). Understanding human disease mutations through

the use of interspecific genetic variation. Hum Mol Genet 10, 2319-2328.

Minnikin, D. E., Minnikin, S. M., Dobson, G., Goodfellow, M., Portaels, F., van den

Breen, L. & Sesardic, D. (1983). Mycolic acid patterns of four vaccine strains of

Mycobacterium bovis BCG. J Gen Microbiol 129, 889-891.

Mitchison, D. A., Wallace, J. G., Bhatia, A. L., Selkon, J. B., Subbaiah, T. V. &

Lancaster, M. C. (1960). A comparison of the virulence in guinea-pigs of South Indian

and British tubercle bacilli. Tubercle 41, 1-22.

Mitchison, D. A., Selkon, J. B. & Lloyd, J. (1963). Virulence in the Guinea-Pig,

Susceptibility to Hydrogen Peroxide, and Catalase Activity of Isoniazid-Sensitive

Tubercle Bacilli from South Indian and British Patients. J Pathol Bacteriol 86, 377-386.

Mooney, S. (2005). Bioinformatics approaches and resources for single nucleotide

polymorphism functional analysis. Brief Bioinformatics 6, 44-56.

Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. (2008).

Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621-

628.

Movahedzadeh, F., Smith, D. A., Norman, R. A., Dinadayala, P., Murray-Rust, J.,

Russell, D. G., Kendall, S. L., Rison, S. C., McAlister, M. S., Bancroft, G. J.,

McDonald, N. Q., Daffe, M., Av-Gay, Y. & Stoker, N. G. (2004). The Mycobacterium

REFERENCES

195

tuberculosis ino1 gene is essential for growth and virulence. Mol Microbiol 51, 1003-

1014.

Muller, B., Borrell, S., Rose, G. & Gagneux, S. (2013). The heterogeneous evolution

of multidrug-resistant Mycobacterium tuberculosis. Trends Genet.

Müller, B., Streicher, E. M., Hoek, K. G., Tait, M., Trollip, A., Bosman, M. E.,

Coetzee, G. J., Chabula-Nxiweni, E. M., Hoosain, E., Gey van Pittius, N. C., Victor,

T. C., van Helden, P. D. & Warren, R. M. (2011). inhA promoter mutations: a

gateway to extensively drug-resistant tuberculosis in South Africa? The international

journal of tuberculosis and lung disease : the official journal of the International Union

against Tuberculosis and Lung Disease 15, 344-351.

Musser, J. M., Kapur, V., Williams, D. L., Kreiswirth, B. N., van Soolingen, D. &

van Embden, J. D. (1996). Characterization of the catalase-peroxidase gene (katG) and

inhA locus in isoniazid-resistant and -susceptible strains of Mycobacterium tuberculosis

by automated DNA sequencing: restricted array of mutations associated with drug

resistance. J Infect Dis 173, 196-202.

Musser, J. M., Amin, A. & Ramaswamy, S. (2000). Negligible genetic diversity of

mycobacterium tuberculosis host immune system protein targets: evidence of limited

selective pressure. Genetics 155, 7-16.

Nerlich, A. G., Haas, C. J., Zink, A., Szeimies, U. & Hagedorn, H. G. (1997).

Molecular evidence for tuberculosis in an ancient Egyptian mummy. Lancet 350, 1404.

Newton-Foot, M. & Gey van Pittius, N. C. (2012). The complex architecture of

mycobacterial promoters. Tuberculosis (Edinb).

Ng, P. C. & Henikoff, S. (2001). Predicting deleterious amino acid substitutions.

Genome Res 11, 863-874.

Ng, P. C. & Henikoff, S. (2003). SIFT: Predicting amino acid changes that affect

protein function. Nucleic Acids Res 31, 3812-3814.

REFERENCES

196

Ng, P. C. & Henikoff, S. (2006). Predicting the effects of amino acid substitutions on

protein function. Annual review of genomics and human genetics 7, 61-80.

Nicol, M. P. & Wilkinson, R. J. (2008). The clinical consequences of strain diversity in

Mycobacterium tuberculosis. Trans R Soc Trop Med Hyg 102, 955-965.

Pandey, D. P. & Gerdes, K. (2005). Toxin-antitoxin loci are highly abundant in free-

living but lost from host-associated prokaryotes. Nucleic Acids Res 33, 966-976.

Parish, T. & Stoker, N. G. (2001). Mycobacterium tuberculosis protocols. Totowa, NJ:

Humana Press.

Parsons, L. M., Brosch, R., Cole, S. T., Somoskovi, A., Loder, A., Bretzel, G., Van

Soolingen, D., Hale, Y. M. & Salfinger, M. (2002). Rapid and simple approach for

identification of Mycobacterium tuberculosis complex isolates by PCR-based genomic

deletion analysis. J Clin Microbiol 40, 2339-2345.

Parsons, S., Smith, S. G., Martins, Q., Horsnell, W. G., Gous, T. A., Streicher, E.

M., Warren, R. M., van Helden, P. D. & Gey van Pittius, N. C. (2008). Pulmonary

infection due to the dassie bacillus (Mycobacterium tuberculosis complex sp.) in a free-

living dassie (rock hyrax-Procavia capensis) from South Africa. Tuberculosis (Edinb)

88, 80-83.

Parthiban, V., Gromiha, M. M. & Schomburg, D. (2006). CUPSAT: prediction of

protein stability upon point mutations. Nucleic Acids Res 34, W239-242.

Parwati, I., van Crevel, R. & van Soolingen, D. (2010). Possible underlying

mechanisms for successful emergence of the Mycobacterium tuberculosis Beijing

genotype strains. Lancet Infect Dis 10, 103-111.

Perkins, T. T., Kingsley, R. A., Fookes, M. C., Gardner, P. P., James, K. D., Yu, L.,

Assefa, S. A., He, M., Croucher, N. J., Pickard, D. J., Maskell, D. J., Parkhill, J.,

Choudhary, J., Thomson, N. R. & Dougan, G. (2009). A strand-specific RNA-Seq

analysis of the transcriptome of the typhoid bacillus Salmonella typhi. PLoS Genet 5,

e1000569.

REFERENCES

197

Plotkin, J. B. & Kudla, G. (2011). Synonymous but not the same: the causes and

consequences of codon bias. Nat Rev Genet 12, 32-42.

Portevin, D., Gagneux, S., Comas, I. & Young, D. (2011). Human macrophage

responses to clinical isolates from the Mycobacterium tuberculosis complex discriminate

between ancient and modern lineages. PLoS Pathog 7, e1001307.

Projahn, M., Koser, C., Homolka, S., Summers, D., Archer, J. & Niemann, S.

(2011). Polymorphisms in Isoniazid and Prothionamide Resistance Genes of the

Mycobacterium tuberculosis Complex. Antimicrobial agents and chemotherapy 55,

4408-4411.

Punta, M., Coggill, P. C., Eberhardt, R. Y., Mistry, J., Tate, J., Boursnell, C., Pang,

N., Forslund, K., Ceric, G., Clements, J., Heger, A., Holm, L., Sonnhammer, E. L.,

Eddy, S. R., Bateman, A. & Finn, R. D. (2012). The Pfam protein families database.

Nucleic Acids Res 40, D290-301.

Qi, W., Kaser, M., Roltgen, K., Yeboah-Manu, D. & Pluschke, G. (2009). Genomic

diversity and evolution of Mycobacterium ulcerans revealed by next-generation

sequencing. PLoS Pathog 5, e1000580.

Qian, L., Van Embden, J. D., Van Der Zanden, A. G., Weltevreden, E. F., Duanmu,

H. & Douglas, J. T. (1999). Retrospective analysis of the Beijing family of

Mycobacterium tuberculosis in preserved lung tissues. J Clin Microbiol 37, 471-474.

Quinlan, A. R. & Hall, I. M. (2010). BEDTools: a flexible suite of utilities for

comparing genomic features. Bioinformatics 26, 841-842.

Raghavan, R., Sloan, D. B. & Ochman, H. (2012). Antisense transcription is pervasive

but rarely conserved in enteric bacteria. MBio 3.

Ramaswamy, S. & Musser, J. M. (1998). Molecular genetic basis of antimicrobial

agent resistance in Mycobacterium tuberculosis: 1998 update. Tuber Lung Dis 79, 3-29.

Ramaswamy, S. V., Amin, A. G., Goksel, S., Stager, C. E., Dou, S. J., El Sahly, H.,

Moghazeh, S. L., Kreiswirth, B. N. & Musser, J. M. (2000). Molecular genetic

REFERENCES

198

analysis of nucleotide polymorphisms associated with ethambutol resistance in human

isolates of Mycobacterium tuberculosis. Antimicrob Agents Chemother 44, 326-336.

Ramaswamy, S. V., Reich, R., Dou, S. J., Jasperse, L., Pan, X., Wanger, A.,

Quitugua, T. & Graviss, E. A. (2003). Single nucleotide polymorphisms in genes

associated with isoniazid resistance in Mycobacterium tuberculosis. Antimicrob Agents

Chemother 47, 1241-1250.

Ramensky, V., Bork, P. & Sunyaev, S. (2002). Human non-synonymous SNPs: server

and survey. Nucleic Acids Res 30, 3894.

Rao, V., Gao, F., Chen, B., Jacobs, W. R., Jr. & Glickman, M. S. (2006). Trans-

cyclopropanation of mycolic acids on trehalose dimycolate suppresses Mycobacterium

tuberculosis -induced inflammation and virulence. J Clin Invest 116, 1660-1667.

Reddy, T. B., Riley, R., Wymore, F., Montgomery, P., DeCaprio, D., Engels, R.,

Gellesch, M., Hubble, J., Jen, D., Jin, H., Koehrsen, M., Larson, L., Mao, M.,

Nitzberg, M., Sisk, P., Stolte, C., Weiner, B., White, J., Zachariah, Z. K., Sherlock,

G., Galagan, J. E., Ball, C. A. & Schoolnik, G. K. (2009). TB database: an integrated

platform for tuberculosis research. Nucleic Acids Res 37, D499-508.

Reed, M. B., Domenech, P., Manca, C., Su, H., Barczak, A. K., Kreiswirth, B. N.,

Kaplan, G. & Barry, C. E., 3rd (2004). A glycolipid of hypervirulent tuberculosis

strains that inhibits the innate immune response. Nature 431, 84-87.

Reed, M. B., Gagneux, S., Deriemer, K., Small, P. M. & Barry, C. E., 3rd (2007).

The W-Beijing lineage of Mycobacterium tuberculosis overproduces triglycerides and

has the DosR dormancy regulon constitutively upregulated. J Bacteriol 189, 2583-2589.

Reed, M. B., Pichler, V. K., McIntosh, F., Mattia, A., Fallow, A., Masala, S.,

Domenech, P., Zwerling, A., Thibert, L., Menzies, D., Schwartzman, K. & Behr, M.

A. (2009). Major Mycobacterium tuberculosis lineages associate with patient country of

origin. J Clin Microbiol 47, 1119-1128.

Riska, P. F., Jacobs, W. R., Jr. & Alland, D. (2000). Molecular determinants of drug

resistance in tuberculosis. Int J Tuberc Lung Dis 4, S4-10.

REFERENCES

199

Robinson, D. A., Falush, D. & Feil, E. J. (2010a). Bacterial population genetics in

infectious disease. Hoboken, N.J.: Wiley-Blackwell.

Robinson, M. D., McCarthy, D. J. & Smyth, G. K. (2010b). edgeR: a Bioconductor

package for differential expression analysis of digital gene expression data.

Bioinformatics 26, 139-140.

Robinson, M. D. & Oshlack, A. (2010). A scaling normalization method for

differential expression analysis of RNA-seq data. Genome Biol 11, R25.

Rocha, E. P., Smith, J. M., Hurst, L. D., Holden, M. T., Cooper, J. E., Smith, N. H.

& Feil, E. J. (2006). Comparisons of dN/dS are time dependent for closely related

bacterial genomes. J Theor Biol 239, 226-235.

Roumagnac, P., Weill, F. X., Dolecek, C., Baker, S., Brisse, S., Chinh, N. T., Le, T.

A., Acosta, C. J., Farrar, J., Dougan, G. & Achtman, M. (2006). Evolutionary history

of Salmonella typhi. Science 314, 1301-1304.

Rozen, S. & Skaletsky, H. (2000). Primer3 on the WWW for general users and for

biologist programmers. Methods Mol Biol 132, 365-386.

Russell, D. G., Barry, C. E. & Flynn, J. L. (2010). Tuberculosis: what we don't know

can, and does, hurt us. Science (New York, NY) 328, 852-856.

Sala, C., Haouz, A., Saul, F., Miras, I., Rosenkrands, I., Alzari, P. & Cole, S. T.

(2009). Genome-wide regulon and crystal structure of BlaI (Rv1846c) from

Mycobacterium tuberculosis. Mol Microbiol 71, 1102-1116.

Salo, W. L., Aufderheide, A. C., Buikstra, J. & Holcomb, T. A. (1994). Identification

of Mycobacterium tuberculosis DNA in a pre-Columbian Peruvian mummy. Proc Natl

Acad Sci U S A 91, 2091-2094.

Sandgren, A., Strong, M., Muthukrishnan, P., Weiner, B. K., Church, G. M. &

Murray, M. B. (2009). Tuberculosis drug resistance mutation database. PLoS Med 6,

e2.

REFERENCES

200

Sassetti, C. M., Boyd, D. H. & Rubin, E. J. (2003). Genes required for mycobacterial

growth defined by high density mutagenesis. Mol Microbiol 48, 77-84.

Sassetti, C. M. & Rubin, E. J. (2003). Genetic requirements for mycobacterial survival

during infection. Proc Natl Acad Sci U S A 100, 12989-12994.

Saunders, C. T. & Baker, D. (2002). Evaluation of structural and evolutionary

contributions to deleterious mutation prediction. J Mol Biol 322, 891-901.

Schena, M., Heller, R. A., Theriault, T. P., Konrad, K., Lachenmeier, E. & Davis,

R. W. (1998). Microarrays: biotechnology's discovery platform for functional genomics.

Trends Biotechnol 16, 301-306.

Schnell, R., Agren, D. & Schneider, G. (2008). 1.9 A structure of the signal receiver

domain of the putative response regulator NarL from Mycobacterium tuberculosis. Acta

Crystallogr Sect F Struct Biol Cryst Commun 64, 1096-1100.

Schulz, M. H., Zerbino, D. R., Vingron, M. & Birney, E. (2012). Oases: robust de

novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics

28, 1086-1092.

Schürch, A. C., Kremer, K., Warren, R. M., Hung, N. V., Zhao, Y., Wan, K.,

Boeree, M. J., Siezen, R. J., Smith, N. H. & van Soolingen, D. (2011). Mutations in

the regulatory network underlie the recent clonal expansion of a dominant subclone of

the Mycobacterium tuberculosis Beijing genotype. Infect, Genet Evol 11, 587-597.

Sharma, C. M., Hoffmann, S., Darfeuille, F., Reignier, J., Findeiss, S., Sittka, A.,

Chabas, S., Reiche, K., Hackermüller, J., Reinhardt, R., Stadler, P. F. & Vogel, J.

(2010a). The primary transcriptome of the major human pathogen Helicobacter pylori.

Nature 464, 250-255.

Sharma, C. M., Hoffmann, S., Darfeuille, F., Reignier, J., Findeiss, S., Sittka, A.,

Chabas, S., Reiche, K., Hackermuller, J., Reinhardt, R., Stadler, P. F. & Vogel, J.

(2010b). The primary transcriptome of the major human pathogen Helicobacter pylori.

Nature 464, 250-255.

REFERENCES

201

Shendure, J. & Ji, H. (2008). Next-generation DNA sequencing. Nat Biotechnol 26,

1135-1145.

Sherman, D. R., Mdluli, K., Hickey, M. J., Arain, T. M., Morris, S. L., Barry, C. E.,

3rd & Stover, C. K. (1996). Compensatory ahpC gene expression in isoniazid-resistant

Mycobacterium tuberculosis. Science 272, 1641-1643.

Sherry, S. T., Ward, M. & Sirotkin, K. (1999). dbSNP-database for single nucleotide

polymorphisms and other classes of minor genetic variation. Genome Res 9, 677-679.

Singh, A., Jain, S., Gupta, S., Das, T. & Tyagi, A. K. (2003). mymA operon of

Mycobacterium tuberculosis: its regulation and importance in the cell envelope. FEMS

Microbiol Lett 227, 53-63.

Singh, A., Gupta, R., Vishwakarma, R. A., N, P. C., Ramanathan, V. D. & Tyagi, A.

K. (2005). Requirement of the mymA operon for appropriate cell wall ultrastructure and

persistence of Mycobacterium tuberculosis in the spleens of guinea pigs. J Bacteriol

187, 4173-4186.

Sinsimer, D., Huet, G., Manca, C., Tsenova, L., Koo, M. S., Kurepina, N., Kana, B.,

Mathema, B., Marras, S. A., Kreiswirth, B. N., Guilhot, C. & Kaplan, G. (2008).

The phenolic glycolipid of Mycobacterium tuberculosis differentially modulates the

early host cytokine response but does not in itself confer hypervirulence. Infect Immun

76, 3027-3036.

Smith, N. H., Gordon, S. V., de la Rua-Domenech, R., Clifton-Hadley, R. S. &

Hewinson, R. G. (2006a). Bottlenecks and broomsticks: the molecular evolution of

Mycobacterium bovis. Nat Rev Microbiol 4, 670-681.

Smith, N. H., Kremer, K., Inwald, J., Dale, J., Driscoll, J. R., Gordon, S. V., van

Soolingen, D., Hewinson, R. G. & Smith, J. M. (2006b). Ecotypes of the

Mycobacterium tuberculosis complex. J Theor Biol 239, 220-225.

Sreevatsan, S., Pan, X., Stockbauer, K. E., Connell, N. D., Kreiswirth, B. N.,

Whittam, T. S. & Musser, J. M. (1997a). Restricted structural gene polymorphism in

REFERENCES

202

the Mycobacterium tuberculosis complex indicates evolutionarily recent global

dissemination. Proc Natl Acad Sci U S A 94, 9869-9874.

Sreevatsan, S., Stockbauer, K. E., Pan, X., Kreiswirth, B. N., Moghazeh, S. L.,

Jacobs, W. R., Jr., Telenti, A. & Musser, J. M. (1997b). Ethambutol resistance in

Mycobacterium tuberculosis: critical role of embB mutations. Antimicrob Agents

Chemother 41, 1677-1681.

Srivastava, S., Garg, A., Ayyagari, A., Nyati, K. K., Dhole, T. N. & Dwivedi, S. K.

(2006). Nucleotide polymorphism associated with ethambutol resistance in clinical

isolates of Mycobacterium tuberculosis. Curr Microbiol 53, 401-405.

Srivastava, S., Ayyagari, A., Dhole, T. N., Nyati, K. K. & Dwivedi, S. K. (2009). emb

nucleotide polymorphisms and the role of embB306 mutations in Mycobacterium

tuberculosis resistance to ethambutol. Int J Med Microbiol 299, 269-280.

Stahl, D. A. & Urbance, J. W. (1990). The division between fast- and slow-growing

species corresponds to natural relationships among the mycobacteria. J Bacteriol 172,

116-124.

Steenken, W. (1935). Lysis of tubercle bacilli in vitro. Proc Soc Exptl Biol Med 33

253–255.

Stenson, P. D., Ball, E. V., Mort, M., Phillips, A. D., Shaw, K. & Cooper, D. N.

(2012). The Human Gene Mutation Database (HGMD) and its exploitation in the fields

of personalized genomics and molecular evolution. Curr Protoc Bioinformatics Chapter

1, Unit1 13.

Steyn, A. J. C., Joseph, J. & Bloom, B. R. (2003). Interaction of the sensor module of

Mycobacterium tuberculosis H37Rv KdpD with members of the Lpr family. Mol

Microbiol 47, 1075-1089.

Stitziel, N. O., Binkowski, T. A., Tseng, Y. Y., Kasif, S. & Liang, J. (2004). topoSNP:

a topographic database of non-synonymous single nucleotide polymorphisms with and

without known disease association. Nucleic Acids Res 32, D520-522.

REFERENCES

203

Stucki, D. & Gagneux, S. (2012). Single nucleotide polymorphisms in Mycobacterium

tuberculosis and the need for a curated database. Tuberculosis (Edinb).

Stucki, D., Malla, B., Hostettler, S., Huna, T., Feldmann, J., Yeboah-Manu, D.,

Borrell, S., Fenner, L., Comas, I., Coscolla, M. & Gagneux, S. (2012). Two new

rapid SNP-typing methods for classifying Mycobacterium tuberculosis complex into the

main phylogenetic lineages. PLoS One 7, e41253.

Sunyaev, S., Ramensky, V. & Bork, P. (2000). Towards a structural basis of human

non-synonymous single nucleotide polymorphisms. Trends Genet 16, 198-200.

Sunyaev, S., Ramensky, V., Koch, I., Lathe, W., Kondrashov, A. S. & Bork, P.

(2001). Prediction of deleterious human alleles. Hum Mol Genet 10, 591-597.

Supply, P., Lesjean, S., Savine, E., Kremer, K., van Soolingen, D. & Locht, C.

(2001). Automated high-throughput genotyping for study of global epidemiology of

Mycobacterium tuberculosis based on mycobacterial interspersed repetitive units. J Clin

Microbiol 39, 3563-3571.

Supply, P., Warren, R. M., Banuls, A. L., Lesjean, S., Van Der Spuy, G. D., Lewis,

L. A., Tibayrenc, M., Van Helden, P. D. & Locht, C. (2003). Linkage disequilibrium

between minisatellite loci supports clonal evolution of Mycobacterium tuberculosis in a

high tuberculosis incidence area. Mol Microbiol 47, 529-538.

Supply, P., Marceau, M., Mangenot, S., Roche, D., Rouanet, C., Khanna, V.,

Majlessi, L., Criscuolo, A., Tap, J., Pawlik, A., Fiette, L., Orgeur, M., Fabre, M.,

Parmentier, C., Frigui, W., Simeone, R., Boritsch, E. C., Debrie, A. S., Willery, E.,

Walker, D., Quail, M. A., Ma, L., Bouchier, C., Salvignol, G., Sayes, F.,

Cascioferro, A., Seemann, T., Barbe, V., Locht, C., Gutierrez, M. C., Leclerc, C.,

Bentley, S. D., Stinear, T. P., Brisse, S., Medigue, C., Parkhill, J., Cruveiller, S. &

Brosch, R. (2013). Genomic analysis of smooth tubercle bacilli provides insights into

ancestry and pathoadaptation of Mycobacterium tuberculosis. Nat Genet 45, 172-179.

Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M. & Kumar, S. (2011).

MEGA5: molecular evolutionary genetics analysis using maximum likelihood,

evolutionary distance, and maximum parsimony methods. Mol Biol Evol 28, 2731-2739.

REFERENCES

204

Team_RDC (2008). R Development Core Team. Vienna: R Foundation for Statistical

Computing.

Tennessen, J. A., Bigham, A. W., O'Connor, T. D., Fu, W., Kenny, E. E., Gravel, S.,

McGee, S., Do, R., Liu, X., Jun, G., Kang, H. M., Jordan, D., Leal, S. M., Gabriel,

S., Rieder, M. J., Abecasis, G., Altshuler, D., Nickerson, D. A., Boerwinkle, E.,

Sunyaev, S., Bustamante, C. D., Bamshad, M. J. & Akey, J. M. (2012). Evolution

and functional impact of rare coding variation from deep sequencing of human exomes.

Science 337, 64-69.

Thomas, P. D., Campbell, M. J., Kejariwal, A., Mi, H., Karlak, B., Daverman, R.,

Diemer, K., Muruganujan, A. & Narechania, A. (2003). PANTHER: a library of

protein families and subfamilies indexed by function. Genome Res 13, 2129-2141.

Thomason, M. & Storz, G. (2010). Bacterial antisense RNAs: how many are there, and

what are they doing? Annu Rev Genet 44, 167-188.

Thusberg, J. & Vihinen, M. (2009). Pathogenic or not? And if so, then how? Studying

the effects of missense mutations using bioinformatics methods. Hum Mutat 30, 703-

714.

Torrelles, J. B., DesJardin, L. E., MacNeil, J., Kaufman, T. M., Kutzbach, B.,

Knaup, R., McCarthy, T. R., Gurcha, S. S., Besra, G. S., Clegg, S. & Schlesinger, L.

S. (2009). Inactivation of Mycobacterium tuberculosis mannosyltransferase pimB

reduces the cell wall lipoarabinomannan and lipomannan content and increases the rate

of bacterial-induced human macrophage cell death. Glycobiology 19, 743-755.

Torrelles, J. B. & Schlesinger, L. S. (2010). Diversity in Mycobacterium tuberculosis

mannosylated cell wall determinants impacts adaptation to the host. Tuberculosis

(Edinb) 90, 84-93.

Tsolaki, A. G., Hirsh, A. E., DeRiemer, K., Enciso, J. A., Wong, M. Z., Hannan, M.,

Goguet de la Salmoniere, Y. O., Aman, K., Kato-Maeda, M. & Small, P. M. (2004).

Functional and evolutionary genomics of Mycobacterium tuberculosis: insights from

REFERENCES

205

genomic deletions in 100 strains. Proceedings of the National Academy of Sciences of

the United States of America 101, 4865-4870.

Valway, S. E., Sanchez, M. P., Shinnick, T. F., Orme, I., Agerton, T., Hoy, D.,

Jones, J. S., Westmoreland, H. & Onorato, I. M. (1998). An outbreak involving

extensive transmission of a virulent strain of Mycobacterium tuberculosis. N Engl J Med

338, 633-639.

van Embden, J. D., Cave, M. D., Crawford, J. T., Dale, J. W., Eisenach, K. D.,

Gicquel, B., Hermans, P., Martin, C., McAdam, R., Shinnick, T. M. & et al. (1993).

Strain identification of Mycobacterium tuberculosis by DNA fingerprinting:

recommendations for a standardized methodology. J Clin Microbiol 31, 406-409.

van Soolingen, D., Hermans, P. W., de Haas, P. E., Soll, D. R. & van Embden, J. D.

(1991). Occurrence and stability of insertion sequences in Mycobacterium tuberculosis

complex strains: evaluation of an insertion sequence-dependent DNA polymorphism as

a tool in the epidemiology of tuberculosis. J Clin Microbiol 29, 2578-2586.

van Soolingen, D., Hoogenboezem, T., de Haas, P. E., Hermans, P. W., Koedam, M.

A., Teppema, K. S., Brennan, P. J., Besra, G. S., Portaels, F., Top, J., Schouls, L.

M. & van Embden, J. D. (1997). A novel pathogenic taxon of the Mycobacterium

tuberculosis complex, Canetti: characterization of an exceptional isolate from Africa. Int

J Syst Bacteriol 47, 1236-1245.

Van Soolingen, D. (2001). Molecular epidemiology of tuberculosis and other

mycobacterial infections: main methodologies and achievements. J Intern Med 249, 1-

26.

Walderhaug, M. O., Polarek, J. W., Voelkner, P., Daniel, J. M., Hesse, J. E.,

Altendorf, K. & Epstein, W. (1992). KdpD and KdpE, proteins that control expression

of the kdpABC operon, are members of the two-component sensor-effector class of

regulators. J Bacteriol 174, 2152-2159.

Wang, Q., Yue, J., Zhang, L., Xu, Y., Chen, J., Zhang, M., Zhu, B. & Wang, H.

(2007). A newly identified 191A/C mutation in the Rv2629 gene that was significantly

REFERENCES

206

associated with rifampin resistance in Mycobacterium tuberculosis. J Proteome Res 6,

4564-4571.

Wang, Z. & Moult, J. (2001). SNPs, protein structure, and disease. Hum Mutat 17,

263-270.

Wayne, L. G. (1994). Tuberculosis: Pathogenesis, Protection, and Control.

Washington, D.C: American Society for Microbiology Press.

Weiner, B., Gomez, J., Victor, T. C., Warren, R. M., Sloutsky, A., Plikaytis, B. B.,

Posey, J. E., van Helden, P. D., Gey van Pittius, N. C., Koehrsen, M., Sisk, P.,

Stolte, C., White, J., Gagneux, S., Birren, B., Hung, D., Murray, M. & Galagan, J.

(2012). Independent large scale duplications in multiple M. tuberculosis lineages

overlapping the same genomic region. PLoS One 7, e26038.

Weniger, T., Krawczyk, J., Supply, P., Niemann, S. & Harmsen, D. (2010). MIRU-

VNTRplus: a web tool for polyphasic genotyping of Mycobacterium tuberculosis

complex bacteria. Nucleic Acids Res 38, W326-331.

WHO (2012).WHO 2012 Global tuberculosis control—surveillance, planning,

financing. Geneva.

Winder, F. G. & Brennan, P. J. (1966). Initial steps in the metabolism of glycerol by

Mycobacterium tuberculosis. J Bacteriol 92, 1846-1847.

Yi, H., Cho, Y. J., Won, S., Lee, J. E., Jin Yu, H., Kim, S., Schroth, G. P., Luo, S. &

Chun, J. (2011). Duplex-specific nuclease efficiently removes rRNA for prokaryotic

RNA-seq. Nucleic Acids Res.

Yoder-Himes, D. R., Chain, P. S., Zhu, Y., Wurtzel, O., Rubin, E. M., Tiedje, J. M.

& Sorek, R. (2009). Mapping the Burkholderia cenocepacia niche response via high-

throughput sequencing. Proc Natl Acad Sci U S A 106, 3976-3981.

Yuan, Y., Zhu, Y., Crane, D. D. & Barry, C. E., 3rd (1998). The effect of oxygenated

mycolic acid composition on cell wall function and macrophage growth in

Mycobacterium tuberculosis. Mol Microbiol 29, 1449-1458.

REFERENCES

207

Yue, P., Li, Z. & Moult, J. (2005). Loss of protein structure stability as a major

causative factor in monogenic disease. J Mol Biol 353, 459-473.

Zheng, X., Hu, G., She, Z. & Zhu, H. (2011). Leaderless genes in bacteria: clue to the

evolution of translation initiation mechanisms in prokaryotes. BMC Genomics 12, 361.

APPENDIX((

208

Appendix: A-G

Appendix A: genomeDeletions.pl

209

Appendix A

Perl script (genomeDeletions.pl) to identify large deletions within genome sequencing

data. Script takes as input the Artemis genome coverage file format.

genomeDeletions.pl

#!/usr/bin/perl -w

#################################################################

# Find deletions using Artemis per base coverage file #

# #

# usage: perl findGeneDeletions [artemis coverage file] #

#[annotation file] [% threshold] #

# #

# Percentage cutoff set by command argument 3 #

# #

# Graham Rose 05.2011 #

# #

#################################################################

################# Arguments from commandline ####################

if ($#ARGV != 2 ) {

print "\nusage: perl findGeneDeletions.pl [artemis coverage

file] [H37Rv annotation file] [% deletion threshold eg: 50]\n\n";

exit;

}

open FILEIN_ONE, $ARGV[0] or die "Can't open STDOUT: $!\n";

@genomeCoverage = <FILEIN_ONE>;

close(FILEIN_ONE);

open FILEIN_TWO, $ARGV[1] or die "Can't open STDOUT: $!\n";

@annotations = <FILEIN_TWO>;

close(FILEIN_TWO);

$threshold = $ARGV[2];

Appendix A: genomeDeletions.pl

210

########################## Main logic ###########################

foreach $line_in_annotations(@annotations)

{

chomp($line_in_annotations);

$line_in_annotations =~ /(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/;

$geneStart = $1-1; #catch +1 error

$geneEnd = $2;

#$direction = $3;

$geneName = $4;

$geneLength = $geneEnd-$geneStart;

$geneLength2 = $geneLength;

$zeros = 0;

$numberNonZero = 0;

while($geneStart != $geneEnd)

{

#push(@array,$genomeCoverage[$geneStart]);

if($genomeCoverage[$geneStart] == 0)

{

$zeros++;

}

$geneStart++;

}

#$length = @array;

#print "$geneName length = $geneLength2\n";

#print "$geneName Number of zeros = $zeros\n";

$numberNonZero = ($geneLength2-$zeros);

$percentZero = (($zeros/$geneLength2)*100);

$rounded = sprintf "%.2f", $percentZero;

############################## Output ###########################

if($rounded >= $threshold)

{

print "Deletion: $geneName (% deleted: $rounded)\n";

}

}

print "\ncomplete\n\n";

Appendix B. Lineage-specific SNPs

211

Appendix B

Lineage-specific SNPs

All lineage-specific MTBC SNPs. SNPs are ordered by genomic position. Alleles are

relative to the coding strand. If SNPs are in intergenic regions, alleles are based on the

forward strand. Ancestral allele based on the reconstructed most recent common ancestor of

the MTBC. Mutation column shows codon position and amino acid change if the SNP is

nonsynonymous.

Appendix B. Lineage-specific SNPs

212

Lineage Genomic position

Ancestral allele

Derived allele Mutation type Gene Mutation

modern 2532 C T synonymous Rv0002 L161L 5 3192 A G nonsynonymous Rv0002 N381D 3 3446 C T nonsynonymous Rv0003 A56V 5 3452 T C nonsynonymous Rv0003 L58S 1 6112 G C nonsynonymous Rv0005 M330I 1 8452 C T nonsynonymous Rv0006 A384V 6 8493 C T nonsynonymous Rv0006 L398F modern 9143 C T synonymous Rv0006 I614I 5 9566 C T synonymous Rv0006 Y755Y 2 11820 C G intergenic - - 3 12204 C T synonymous Rv0008c L36L 1 13298 C G nonsynonymous Rv0010c I87M modern 13460 C T synonymous Rv0010c D33D 6 13482 C T nonsynonymous Rv0010c A26V 5 13579 C T intergenic - - modern 14401 G A nonsynonymous Rv0012 E105K 2 14861 G T nonsynonymous Rv0012 G258V 4 15117 G C nonsynonymous Rv0013 M68I 5 16720 G C nonsynonymous Rv0014c V251L modern 21819 G T nonsynonymous Rv0018c A455S 1 22961 A C nonsynonymous Rv0018c D74A modern 23174 T G nonsynonymous Rv0018c L3R 6 24780 A G nonsynonymous Rv0020c D222G 5 25386 G A nonsynonymous Rv0020c G20D 6 26053 G C nonsynonymous Rv0021c A277P 1 26347 G C nonsynonymous Rv0021c D179H 5 26783 G A synonymous Rv0021c A33A 3 26957 C G intergenic - - modern 27469 G A intergenic - - 3 27487 G A intergenic - - 5 27947 C T nonsynonymous Rv0023 A118V 1 27996 T C synonymous Rv0023 Y134Y 5 31807 C T synonymous Rv0028 R98R 5 32776 G T synonymous Rv0029 T240T 6 33137 C G nonsynonymous Rv0029 L361V 2 36008 G C nonsynonymous Rv0032 D572H 6 36304 T G synonymous Rv0032 A670A modern 36538 C T synonymous Rv0032 S748S 2 37305 C G nonsynonymous Rv0035 S16W 2 39158 G C synonymous Rv0036c R224R modern 39758 C T synonymous Rv0036c H24H 5 39786 G C nonsynonymous Rv0036c S15T 5 40177 G A synonymous Rv0037c L342L 6 41241 G A intergenic - - 4 42281 T G nonsynonymous Rv0039c F24C modern 43945 G A synonymous Rv0041 V128V 5 46297 T C synonymous Rv0041 Y912Y 5 47877 A C nonsynonymous Rv0043c D75A 5 50059 A T nonsynonymous Rv0046c I356F

Appendix B. Lineage-specific SNPs

213

6 51113 C T synonymous Rv0046c H4H 5 51892 A G nonsynonymous Rv0048c Y269C 3 53422 G A intergenic - - 3 54842 G T nonsynonymous Rv0050 A394S 3 56001 G A synonymous Rv0051 Q102Q 5 58875 C T nonsynonymous Rv0054 T97I 5 60059 C T nonsynonymous Rv0057 T55M 5 60300 G A synonymous Rv0057 V135V 6 62367 G A nonsynonymous Rv0058 V658I 2 63146 G T intergenic - - 1 64028 C T synonymous Rv0060 P40P 3 65083 G A synonymous Rv0061 G31G 1 65159 G A nonsynonymous Rv0061 A57T 1 65663 C G nonsynonymous Rv0062 H38D 3 66632 C T nonsynonymous Rv0062 P361S 1 66892 C G intergenic - - 3 67012 C T synonymous Rv0063 T30T 1 68174 T C nonsynonymous Rv0063 S418P 3 69984 C A synonymous Rv0064 A455A 4 70267 T G nonsynonymous Rv0064 F550V 5 71203 C G nonsynonymous Rv0064 Q862E 5 71203 C T stopgain Rv0064 Q862X 3 72549 G A nonsynonymous Rv0066c G655S 5 73148 C T nonsynonymous Rv0066c A455V 6 74161 G C nonsynonymous Rv0066c K117N 1 74737 A C nonsynonymous Rv0067c E154D 5 75313 A C nonsynonymous Rv0068 T5P 6 76147 C G nonsynonymous Rv0068 Q283E 5 77327 T G nonsynonymous Rv0069c I99S 6 78103 A G nonsynonymous Rv0070c E265G 1 79479 T C intergenic - - 6 84238 T G synonymous Rv0075 G81G 5 86587 C A synonymous Rv0078 I20I 2 87468 G A nonsynonymous Rv0078A E112K 5 87499 C T synonymous Rv0078A F101F 5 87973 C T intergenic - - 6 89113 T G nonsynonymous Rv0080 V31G modern 89200 T G nonsynonymous Rv0080 V60G 5 89474 G A synonymous Rv0080 T151T 6 89535 C T intergenic - - 1 89871 C T synonymous Rv0081 D99D 6 91001 C T nonsynonymous Rv0083 P201L 6 92016 T G nonsynonymous Rv0083 D539E 5 95867 G T nonsynonymous Rv0087 A152S modern 97696 C T intergenic - - 1 98966 G C nonsynonymous Rv0090 A163P 6 100589 A G nonsynonymous Rv0092 T3A modern 103600 C T nonsynonymous Rv0093c R22C 6 104712 T C intergenic - - 6 111651 G A nonsynonymous Rv0101 D551N 5 113059 C A nonsynonymous Rv0101 P1020H 1 115499 T G synonymous Rv0101 R1833R 6 116901 T G nonsynonymous Rv0101 C2301G

Appendix B. Lineage-specific SNPs

214

3 117389 C T synonymous Rv0101 T2463T 6 121248 C T nonsynonymous Rv0103c P309L 3 123198 T C synonymous Rv0104 P294P 4 123520 C T nonsynonymous Rv0104 H402Y 3 123745 G A nonsynonymous Rv0104 G477R 6 123842 C T intergenic - - modern 126803 C T nonsynonymous Rv0107c P1247S 3 129576 T G synonymous Rv0107c A322A 5 131232 G T intergenic - - 6 134014 C T nonsynonymous Rv0111 A22V 5 134555 C A synonymous Rv0111 G202G 6 135398 G A nonsynonymous Rv0111 M483I 5 137085 A G nonsynonymous Rv0112 D266G 6 137185 G A synonymous Rv0112 Q299Q 1 137233 C T synonymous Rv0112 A315A modern 139756 C T intergenic - - 1 139954 A C intergenic - - 6 140644 G A nonsynonymous Rv0116c G127R 5 140875 G A nonsynonymous Rv0116c V50M 6 141261 C T nonsynonymous Rv0117 A21V 5 141516 C T nonsynonymous Rv0117 S106F 4 143207 G A nonsynonymous Rv0118c G224S 6 144345 C A synonymous Rv0119 R99R 1 144564 C T synonymous Rv0119 I172I 6 144570 C G nonsynonymous Rv0119 F174L 1 146788 C T synonymous Rv0120c V328V 1 146872 G A synonymous Rv0120c Q300Q 5 146893 G A synonymous Rv0120c L293L 2 147262 C A nonsynonymous Rv0120c D170E 6 147650 A G nonsynonymous Rv0120c E41G 5 148187 C T synonymous Rv0121c T52T 1 154191 A G intergenic - - 5 155478 C A nonsynonymous Rv0127 A416E 3 157129 G A nonsynonymous Rv0129c G158S 1 160976 C G nonsynonymous Rv0133 N36K 1 162226 G A stopgain Rv0134 W152X 6 162622 G A synonymous Rv0134 V284V 6 162948 C T nonsynonymous Rv0135c T101I 5 163148 C T synonymous Rv0135c R34R 5 164936 G A nonsynonymous Rv0137c D109N 1 167986 C T synonymous Rv0142 D92D 6 168529 G A synonymous Rv0142 P273P 1 168787 G A nonsynonymous Rv0143c V466I 6 170083 T C nonsynonymous Rv0143c F34L 3 170671 G A nonsynonymous Rv0144 A130T 3 172492 C G stopgain Rv0146 Y94X 3 181090 C T intergenic - - 6 183575 C T intergenic - - 5 188856 C T intergenic - - 2 190816 A C synonymous Rv0161 S70S 6 191470 G A synonymous Rv0161 L288L 6 195315 T C nonsynonymous Rv0166 F108S modern 195360 C T nonsynonymous Rv0166 A123V

Appendix B. Lineage-specific SNPs

215

1 196874 C T nonsynonymous Rv0167 T5I 6 198313 C A nonsynonymous Rv0168 D218E 3 198401 G T nonsynonymous Rv0168 G248C 4 199470 G T nonsynonymous Rv0169 A313S 5 199734 C G nonsynonymous Rv0169 P401A 6 201567 G C nonsynonymous Rv0171 E212D 5 202229 C G nonsynonymous Rv0171 A433G 6 203639 G A synonymous Rv0172 L388L 5 204315 A G nonsynonymous Rv0173 K84R 4 206481 G C synonymous Rv0174 P417P 4 206484 T G synonymous Rv0174 G418G 3 207079 G C nonsynonymous Rv0175 R89P 5 208299 C T nonsynonymous Rv0176 P283L modern 208318 C T synonymous Rv0176 I289I modern 208320 G A nonsynonymous Rv0176 S290N modern 208321 C G nonsynonymous Rv0176 S290R modern 208321 C T synonymous Rv0176 S290S 6 208403 C T nonsynonymous Rv0176 P318S 6 210442 G A nonsynonymous Rv0179c R124H 3 210624 G A synonymous Rv0179c L63L 1 211993 G T nonsynonymous Rv0180c K86N 1 215238 C T synonymous Rv0184 D90D 4 217201 C T synonymous Rv0186 N311N 3 218599 T C intergenic - - modern 223752 G C nonsynonymous Rv0192A G49A 3 224338 C T nonsynonymous Rv0192 P259S 5 224414 A G intergenic - - 6 225416 T C nonsynonymous Rv0193c W386R modern 225668 A G nonsynonymous Rv0193c S302G 1 226676 G A intergenic - - 5 227468 C T synonymous Rv0194 V197V 5 229448 G C nonsynonymous Rv0194 L857F 2 230170 C T nonsynonymous Rv0194 P1098L 5 230197 C T nonsynonymous Rv0194 T1107I modern 233358 C A synonymous Rv0197 V376V modern 233364 C G nonsynonymous Rv0197 S378R 5 233377 C T nonsynonymous Rv0197 H383Y 4 234493 C G nonsynonymous Rv0197 L755V 5 240032 G T nonsynonymous Rv0202c A421S modern 243598 G A nonsynonymous Rv0205 R72H 1 244550 T C synonymous Rv0206c R923R 6 245921 C G nonsynonymous Rv0206c D466E 6 246169 T A nonsynonymous Rv0206c F384I 5 248946 C T intergenic - - 4 249522 C T nonsynonymous Rv0209 A162V 5 251176 C A nonsynonymous Rv0210 L353M 4 251575 A G nonsynonymous Rv0210 T486A modern 251669 C T intergenic - - 5 253046 A C nonsynonymous Rv0211 K422T 6 254508 G C nonsynonymous Rv0212c G45R 1 254903 G T nonsynonymous Rv0213c G350C 3 255373 A G nonsynonymous Rv0213c D193G 5 256001 G T intergenic - -

Appendix B. Lineage-specific SNPs

216

3 257071 C T synonymous Rv0214 Y336Y 5 258470 T C synonymous Rv0215c G129G 5 258561 T G nonsynonymous Rv0215c L99R modern 260282 A C nonsynonymous Rv0217c T184P 6 260610 G A synonymous Rv0217c L74L 1 263149 A C nonsynonymous Rv0220 E113A 2 264129 G A nonsynonymous Rv0221 M21I 1 264298 C A nonsynonymous Rv0221 Q78K 1 264984 C G synonymous Rv0221 T306T 3 264992 T C nonsynonymous Rv0221 L309P 3 266405 G A nonsynonymous Rv0223c G454S 6 272306 G T synonymous Rv0227c A178A 1 272678 G A synonymous Rv0227c A54A 6 273558 C G synonymous Rv0228 S168S 3 274463 G T nonsynonymous Rv0229c R175L 5 274584 T C nonsynonymous Rv0229c S135P 6 275367 C A nonsynonymous Rv0230c D199E 3 276539 G C nonsynonymous Rv0231 G161A 5 277865 G C intergenic - - 5 281405 C A nonsynonymous Rv0235c P404T 6 282537 C A synonymous Rv0235c I26I 2 282892 G A synonymous Rv0236c T1320T 5 285653 G A nonsynonymous Rv0236c R400Q 3 289253 C T synonymous Rv0239 D50D 1 290374 A G nonsynonymous Rv0241c H94R 5 291830 A C nonsynonymous Rv0242c D67A 5 295645 C T intergenic - - 3 301341 G T synonymous Rv0249c P105P 6 301687 C T intergenic - - 5 306201 A G nonsynonymous Rv0254c H50R 5 309681 C G intergenic - - modern 309765 T C stoplost Rv0257 X23R 6 310129 T C intergenic - - 6 310132 G A intergenic - - 6 315069 G T nonsynonymous Rv0263c R233L modern 320180 C T synonymous Rv0266c I325I 6 323306 A G intergenic - - 3 324812 C T synonymous Rv0270 D82D 4 325505 C T synonymous Rv0270 V313V 1 326002 A G nonsynonymous Rv0270 D479G 6 327312 A G nonsynonymous Rv0271c K384E 3 328569 G C intergenic - - 5 331309 T C nonsynonymous Rv0275c L117P modern 331588 T C nonsynonymous Rv0275c L24S modern 333212 C G intergenic - - modern 333292 G A intergenic - - 6 333394 G C synonymous Rv0277A S8S 3 339230 G C intergenic - - 4 342146 C A nonsynonymous Rv0282 A6E 3 342873 C T synonymous Rv0282 V248V modern 343281 C G synonymous Rv0282 A384A 5 344258 G A synonymous Rv0283 V79V 1 344288 C G synonymous Rv0283 S89S

Appendix B. Lineage-specific SNPs

217

6 344957 A G nonsynonymous Rv0283 I312M 5 345317 G C synonymous Rv0283 V432V 6 352058 G T nonsynonymous Rv0288 A71S 5 352646 G A synonymous Rv0289 P166P 3 353197 C T nonsynonymous Rv0290 R39C 2 353309 G A nonsynonymous Rv0290 S76N 2 353365 G A nonsynonymous Rv0290 A95T 5 357538 A G nonsynonymous Rv0293c H176R 1 357582 C T synonymous Rv0293c H161H 3 358473 G C nonsynonymous Rv0294 W101C 6 360585 G A nonsynonymous Rv0296c E191K 3 363563 G A nonsynonymous Rv0299 A30T 5 376237 G A synonymous Rv0306 L108L 6 378032 G A synonymous Rv0309 A34A 1 378357 T G nonsynonymous Rv0309 S143A 6 378404 G A synonymous Rv0309 P158P 1 378939 C G synonymous Rv0310c R70R 5 379687 C T synonymous Rv0311 V172V modern 381030 G A nonsynonymous Rv0312 G159S 6 382243 C T nonsynonymous Rv0312 P563L 6 382489 C T intergenic - - 4 392261 C T stopgain Rv0325 Q75X 1 393941 G A nonsynonymous Rv0327c M35I 6 394900 C T stopgain Rv0329c R141X 1 396750 G A nonsynonymous Rv0331 V184I 3 402836 C T synonymous Rv0337c Y109Y 6 402881 C T synonymous Rv0337c D94D 6 405274 C A nonsynonymous Rv0338c L190I 6 405854 C G intergenic - - 1 406274 A G synonymous Rv0339c A725A 5 408006 C A nonsynonymous Rv0339c P148Q 5 408935 C T nonsynonymous Rv0340 A101V 5 409079 G A nonsynonymous Rv0340 G149E modern 412280 G T nonsynonymous Rv0342 Q481H 5 414876 C G synonymous Rv0344c G22G 1 420405 C T nonsynonymous Rv0350 R191C 3 422678 G A nonsynonymous Rv0352 R76H 1 424250 A G intergenic - - modern 438271 A G nonsynonymous Rv0359 T252A 3 438470 G A synonymous Rv0360c K90K 1 439711 T C intergenic - - 3 440365 G A synonymous Rv0362 S165S 3 440878 C T synonymous Rv0362 T336T 5 441062 G T nonsynonymous Rv0362 A398S 6 441891 A G nonsynonymous Rv0363c I137V 5 442468 G A nonsynonymous Rv0364 G25D 5 443471 G A nonsynonymous Rv0365c A243T 6 443897 C T synonymous Rv0365c L101L 5 445696 C A stopgain Rv0368c S277X 6 445737 G A synonymous Rv0368c P263P 4 445780 A G nonsynonymous Rv0368c H249R modern 447442 A G nonsynonymous Rv0370c E201G 1 447525 C G nonsynonymous Rv0370c I173M

Appendix B. Lineage-specific SNPs

218

3 447642 C G synonymous Rv0370c L134L 1 452288 C T intergenic - - 6 452730 A G nonsynonymous Rv0375c D142G 5 454418 T G intergenic - - 3 455024 G A nonsynonymous Rv0377 V202I modern 455325 G C nonsynonymous Rv0377 R302P modern 455329 C T synonymous Rv0377 G303G 1 456511 C T synonymous Rv0380c T103T 5 456731 G A nonsynonymous Rv0380c R30H 6 457372 C A nonsynonymous Rv0381c P151Q 5 458116 G A nonsynonymous Rv0382c D89N 5 464480 G A nonsynonymous Rv0386 R357Q 5 464958 C T synonymous Rv0386 G516G 5 466175 A C nonsynonymous Rv0386 E922A modern 468357 G A nonsynonymous Rv0389 G8E 1 469042 C T synonymous Rv0389 N236N 1 471666 T C nonsynonymous Rv0392c M325T 6 476582 G T intergenic - - 2 477234 A C nonsynonymous Rv0398c E29D 6 477634 A G nonsynonymous Rv0399c E308G 5 477988 A C nonsynonymous Rv0399c H190P 5 479350 A G nonsynonymous Rv0400c D135G 5 480239 C G intergenic - - 5 482106 C T stopgain Rv0402c R376X 3 484504 C G synonymous Rv0404 A176A 2 484596 C T nonsynonymous Rv0404 P207L 1 485230 C T synonymous Rv0404 H418H 3 485561 A C nonsynonymous Rv0404 I529L 1 485785 T G nonsynonymous Rv0405 L19V 5 487463 A G nonsynonymous Rv0405 D578G 5 489024 C T synonymous Rv0405 S1098S 6 489514 G A nonsynonymous Rv0405 G1262R 6 490398 C T nonsynonymous Rv0406c T103M 6 491668 A G nonsynonymous Rv0407 K296E 4 491742 C T synonymous Rv0407 F320F 4 492150 C G nonsynonymous Rv0408 A122G 1 492655 G A synonymous Rv0408 A290A 6 494915 C A nonsynonymous Rv0409 D355E 6 495108 C A nonsynonymous Rv0410c T736K 5 495322 G A nonsynonymous Rv0410c A665T 1 495473 G A synonymous Rv0410c S614S 2 497491 C T synonymous Rv0411c D270D 4 498531 C T synonymous Rv0412c A363A 5 500223 G T nonsynonymous Rv0413 V171L 5 501517 G A nonsynonymous Rv0415 E124K 4 505974 A G nonsynonymous Rv0419 T297A 6 507989 G A synonymous Rv0422c L189L 5 512659 G A nonsynonymous Rv0425c S888N 6 514098 G A synonymous Rv0425c A408A modern 514657 C A nonsynonymous Rv0425c A222E 4 517358 G A nonsynonymous Rv0428c G149D 3 517389 G T nonsynonymous Rv0428c V139F modern 517411 C T synonymous Rv0428c R131R

Appendix B. Lineage-specific SNPs

219

6 518166 C A synonymous Rv0429c R77R modern 519185 G T nonsynonymous Rv0431 G38V 6 519331 A G nonsynonymous Rv0431 T87A 5 519872 C T synonymous Rv0432 F91F modern 522081 G A nonsynonymous Rv0434 A190T 3 523654 G A nonsynonymous Rv0435c A294T 6 524802 C T nonsynonymous Rv0436c P197S 5 525205 C T synonymous Rv0436c R62R 5 525540 A C synonymous Rv0437c A181A 6 526255 C T nonsynonymous Rv0438c P369L 6 526406 C T nonsynonymous Rv0438c P319S modern 527316 C G nonsynonymous Rv0438c I15M modern 528354 A C intergenic - - 6 529147 C G synonymous Rv0440 T180T 6 532927 G C intergenic - - 5 534205 G A synonymous Rv0445c R64R modern 534427 A T intergenic - - 5 536070 G A synonymous Rv0447c R146R 6 538762 G A synonymous Rv0450c L910L 5 539019 C T nonsynonymous Rv0450c H825Y 2 542014 C G intergenic - - 4 546357 C T synonymous Rv0456c T149T 6 547013 G A intergenic - - 3 548326 A G nonsynonymous Rv0457c T428A 1 549251 G A stopgain Rv0457c W119X 5 554297 C T synonymous Rv0463 D94D 5 554493 G T synonymous Rv0464c A131A 5 555621 A C nonsynonymous Rv0465c K229T 1 555945 A G nonsynonymous Rv0465c Q121R 4 555991 C T nonsynonymous Rv0465c R106C 1 556035 C A nonsynonymous Rv0465c P91Q 6 556089 T C nonsynonymous Rv0465c V73A 6 556201 A G nonsynonymous Rv0465c N36D 5 558750 C T synonymous Rv0467 T408T 1 560664 C T synonymous Rv0469 N259N 1 560666 A G nonsynonymous Rv0469 K260R 6 560857 G C synonymous Rv0470c L285L 3 562064 G C nonsynonymous Rv0470A W77C 3 562066 T C nonsynonymous Rv0470A W77R 5 562322 A C nonsynonymous Rv0471c E131A 5 563965 C T synonymous Rv0473 G134G 1 564723 C T nonsynonymous Rv0473 T387M modern 565404 T G synonymous Rv0474 R128R 1 568693 C A nonsynonymous Rv0479c A92E 5 569841 C T intergenic - - 5 573190 C T synonymous Rv0484c G204G modern 573384 C A nonsynonymous Rv0484c L140I 3 579284 T C intergenic - - 1 580336 C A synonymous Rv0490 R330R 6 580576 C T stopgain Rv0490 R410X 4 584171 G A nonsynonymous Rv0493c G174S modern 584511 A C synonymous Rv0493c T60T 5 588160 G T nonsynonymous Rv0497 A262S

Appendix B. Lineage-specific SNPs

220

5 588733 C G nonsynonymous Rv0498 P137A 5 589808 A C nonsynonymous Rv0499 D209A 6 590622 C G synonymous Rv0500 G180G modern 590763 C G nonsynonymous Rv0500 D227E 5 591470 G A intergenic - - 1 591965 G A synonymous Rv0501 A104A 6 592494 G A nonsynonymous Rv0501 G281R 1 595501 C T nonsynonymous Rv0505c A362V 6 598244 C T nonsynonymous Rv0507 T349I 6 598723 G T nonsynonymous Rv0507 D509Y 6 599363 C G nonsynonymous Rv0507 S722C 4 599868 G A synonymous Rv0507 R890R 5 601315 G A nonsynonymous Rv0509 R292H 5 609003 C G nonsynonymous Rv0517 I86M 5 610787 G A synonymous Rv0518 E200E 6 611373 A G synonymous Rv0519c Q234Q 3 611977 G A nonsynonymous Rv0519c G33D 5 613957 G T nonsynonymous Rv0522 R307L 1 615938 G A synonymous Rv0524 E368E 1 621390 G A nonsynonymous Rv0530 V162I 5 622361 G A synonymous Rv0531 R11R 1 627350 C G nonsynonymous Rv0536 P35A 2 627485 G A nonsynonymous Rv0536 V80I 2 628864 A G nonsynonymous Rv0537c T290A 6 628906 C T synonymous Rv0537c L276L 3 629714 C T synonymous Rv0537c D6D 6 630018 A G intergenic - - 6 631296 G A synonymous Rv0538 P419P 3 635139 C G synonymous Rv0542c G122G 5 635775 G A nonsynonymous Rv0543c R34H 6 640129 C T intergenic - - 2 640954 A G intergenic - - 1 643483 G A nonsynonymous Rv0552 G199S 6 644439 C T synonymous Rv0552 A517A 5 644472 T C synonymous Rv0552 G528G 5 646048 C G synonymous Rv0554 P194P 1 646531 A T synonymous Rv0555 T78T 5 648756 C T nonsynonymous Rv0557 T74M 4 648856 C T synonymous Rv0557 G107G 1 649345 C T synonymous Rv0557 A270A 5 649446 G A nonsynonymous Rv0557 G304D 3 652950 T C synonymous Rv0562 R60R 6 654603 A G nonsynonymous Rv0563 E242G 6 655382 A G nonsynonymous Rv0564c T190A 1 655707 C T synonymous Rv0564c V81V 4 655986 G T intergenic - - 6 656432 G A nonsynonymous Rv0565c G347S 2 657142 G A nonsynonymous Rv0565c R110H modern 657578 C T synonymous Rv0566c D154D 6 658923 C G nonsynonymous Rv0567 F201L 6 659019 C T synonymous Rv0567 D233D 4 659341 C T intergenic - - 5 660153 T G nonsynonymous Rv0568 L235R

Appendix B. Lineage-specific SNPs

221

4 662911 C T synonymous Rv0570 A539A 5 666713 T G nonsynonymous Rv0573c L177R 5 667950 C T stopgain Rv0574c Q149X 6 669225 A G nonsynonymous Rv0575c Y174C 6 669231 A C nonsynonymous Rv0575c E172A 5 669406 G A nonsynonymous Rv0575c E114K 4 670545 A G nonsynonymous Rv0576 H233R 1 678440 G A synonymous Rv0583c T212T 5 678934 G T nonsynonymous Rv0583c A48S modern 684290 G A intergenic - - 3 684376 T C intergenic - - 1 685955 G A nonsynonymous Rv0588 A10T 3 686123 C A nonsynonymous Rv0588 L66M 5 686146 C T synonymous Rv0588 G73G 6 687074 C T nonsynonymous Rv0589 P85L 6 688260 G A nonsynonymous Rv0590 V77I modern 690248 A C nonsynonymous Rv0591 N397T 4 690450 C A synonymous Rv0591 A464A 2 696917 G T intergenic - - 5 697196 C T synonymous Rv0598c D124D modern 700776 C T nonsynonymous Rv0604 P180S 1 704997 C T stopgain Rv0610c Q305X modern 705602 G A nonsynonymous Rv0610c S103N 6 705988 G C nonsynonymous Rv0611c E119D 2 707334 A G nonsynonymous Rv0613c T728A 3 708056 A G nonsynonymous Rv0613c H487R 5 708263 T C nonsynonymous Rv0613c L418S 6 709150 C T synonymous Rv0613c D122D 4 713310 C T nonsynonymous Rv0620 R199C 6 713802 C T nonsynonymous Rv0620 R363C modern 715266 G A stopgain Rv0621 W355X 6 717062 C T intergenic - - 5 717558 C G synonymous Rv0625c G112G modern 717588 C T synonymous Rv0625c V102V 1 720863 C A synonymous Rv0629c A290A 1 722852 C T nonsynonymous Rv0630c P721L 1 726498 G T synonymous Rv0631c L603L 5 728532 A C intergenic - - 6 729114 C T synonymous Rv0632c G55G 3 729685 G A nonsynonymous Rv0633c R161Q 1 730087 C T nonsynonymous Rv0633c A27V 5 731750 G C nonsynonymous Rv0634B L13F 1 734116 T C nonsynonymous Rv0638 M127T 5 735135 G A nonsynonymous Rv0640 M38I 6 735252 C T synonymous Rv0640 A77A 6 736919 C A nonsynonymous Rv0642c F95L 6 738820 G T nonsynonymous Rv0644c R114L 5 738899 G A nonsynonymous Rv0644c V88I 5 740038 C G synonymous Rv0645c A50A 6 742633 G A intergenic - - 5 745835 C G synonymous Rv0648 G1039G 1 749968 T C intergenic - - 1 752046 C T nonsynonymous Rv0655 A177V

Appendix B. Lineage-specific SNPs

222

3 753174 G T nonsynonymous Rv0656c W65L 1 753668 C T intergenic - - 1 754387 C T nonsynonymous Rv0658c T8I 6 754754 C T synonymous Rv0659c R80R 2 757139 C A nonsynonymous Rv0663 R335S 4 757182 G A nonsynonymous Rv0663 G349D 3 759746 C T intergenic - - 6 760969 C T nonsynonymous Rv0667 S388L 6 761723 A C nonsynonymous Rv0667 E639D 3 762434 T G synonymous Rv0667 G876G 4 763031 C T synonymous Rv0667 A1075A 1 763884 C T nonsynonymous Rv0668 A172V 1 763886 C A synonymous Rv0668 R173R 3 767339 A G intergenic - - 1 767609 A G intergenic - - 5 769406 G A synonymous Rv0669c L64L 6 772596 T C synonymous Rv0672 R371R 5 773021 G A nonsynonymous Rv0672 R513Q 4 776100 T C nonsynonymous Rv0676c I794T 6 781075 G C nonsynonymous Rv0681 A119P 1 786137 A C intergenic - - 3 788615 G C nonsynonymous Rv0688 G226R 1 797597 C T nonsynonymous Rv0697 T222M 2 798355 G C nonsynonymous Rv0697 A475P 3 798779 T C intergenic - - 3 798934 C A synonymous Rv0698 R34R 1 800357 C A intergenic - - 5 801959 C T synonymous Rv0702 R166R 6 807012 G T nonsynonymous Rv0711 W226C 3 807405 C T synonymous Rv0711 S357S 1 810287 C G synonymous Rv0713 T114T 1 811492 C G synonymous Rv0714 V40V 2 811753 C T synonymous Rv0715 H4H 1 812502 C T synonymous Rv0716 V148V modern 815236 C T nonsynonymous Rv0723 T16I modern 815851 G A synonymous Rv0724 R63R 6 816732 G A nonsynonymous Rv0724 S357N 6 816862 G T synonymous Rv0724 V400V 5 817489 T C synonymous Rv0724 T609T 1 817696 C T nonsynonymous Rv0725c P250L 1 819213 A G nonsynonymous Rv0726c E143G 3 820734 T C nonsynonymous Rv0728c V248A 4 820752 A G nonsynonymous Rv0728c H242R modern 821907 C T nonsynonymous Rv0729 P134L 1 829719 G A intergenic - - modern 832246 G A synonymous Rv0740 Q157Q 3 834857 G A nonsynonymous Rv0744c M30I 6 841095 T G nonsynonymous Rv0748 V50G 5 841139 C T synonymous Rv0748 L65L 1 841494 C G synonymous Rv0749 L89L 1 841495 A G nonsynonymous Rv0749 M90V 6 841629 G T synonymous Rv0749 S134S 6 843751 G T nonsynonymous Rv0752c A222S

Appendix B. Lineage-specific SNPs

223

4 847995 C T intergenic - - 5 850047 C T intergenic - - 1 850985 T C nonsynonymous Rv0756c I161T 6 851104 G A synonymous Rv0756c A121A 6 851562 G A intergenic - - 3 857643 G C nonsynonymous Rv0764c G132A 3 858464 T C nonsynonymous Rv0765c V134A 5 861279 A C nonsynonymous Rv0768 E123A 6 862664 C T synonymous Rv0769 L85L 5 863975 G T synonymous Rv0770 R240R 1 865761 C T synonymous Rv0772 H392H 6 866448 G A synonymous Rv0773c T314T 6 867745 A G nonsynonymous Rv0774c T203A 5 869036 C T nonsynonymous Rv0776c T243I modern 871271 C A nonsynonymous Rv0777 P422T 5 872863 A C nonsynonymous Rv0779c T144P 5 883072 C T nonsynonymous Rv0788 R105C 1 885689 C A nonsynonymous Rv0791c T51K 6 886178 C A nonsynonymous Rv0792c L157I 5 892659 C T synonymous Rv0799c T205T modern 894888 C T synonymous Rv0801 S86S 1 895082 T C nonsynonymous Rv0802c F183L 1 895120 G A nonsynonymous Rv0802c R170H 3 896979 T C nonsynonymous Rv0803 V387A 6 900065 C A synonymous Rv0806c T422T 6 901327 C G nonsynonymous Rv0806c P2A 5 901358 G A intergenic - - 5 904367 G C nonsynonymous Rv0809 D215H 5 905344 G T nonsynonymous Rv0811c G333C 3 906742 T C nonsynonymous Rv0812 V107A 5 907906 G A nonsynonymous Rv0813c R38H 5 908033 G C intergenic - - 1 910015 G T nonsynonymous Rv0816c A7S 5 910282 C A synonymous Rv0817c S187S modern 911261 C T nonsynonymous Rv0818 P97L modern 913274 G C synonymous Rv0820 S183S modern 916046 C A nonsynonymous Rv0822c P89H 6 916350 G A intergenic - - 6 916714 C A synonymous Rv0823c L311L 6 917259 G T nonsynonymous Rv0823c G130C 5 919007 T G nonsynonymous Rv0825c C183G 6 919382 T C nonsynonymous Rv0825c F58L 6 919384 A G nonsynonymous Rv0825c Y57C 3 919551 G A synonymous Rv0825c V1V 6 920333 C T nonsynonymous Rv0826 P234S 5 921429 C T nonsynonymous Rv0828c P62L 4 931123 C T synonymous Rv0835 Y57Y 6 932252 C A intergenic - - 4 932280 G A stopgain Rv0836c W218X modern 933699 A C synonymous Rv0837c G111G 4 934230 G C intergenic - - 4 934611 T G intergenic - - 6 937614 C A nonsynonymous Rv0841 H8N

Appendix B. Lineage-specific SNPs

224

modern 938246 G A synonymous Rv0842 L45L 2 940602 G C nonsynonymous Rv0844c G169R 6 941054 C T nonsynonymous Rv0844c P18L 4 941845 A C nonsynonymous Rv0845 E219A 5 941849 A C nonsynonymous Rv0845 E220D 3 942616 C A intergenic - - 6 944725 C T nonsynonymous Rv0847 A128V 3 945238 G T nonsynonymous Rv0848 A101S modern 948294 G A nonsynonymous Rv0851c G59S 3 950116 G C nonsynonymous Rv0853c A335P modern 951142 A C intergenic - - 1 952597 C T synonymous Rv0855 I322I 5 954131 G A nonsynonymous Rv0858c V264M 5 955631 C T synonymous Rv0859 G185G 3 957306 C T synonymous Rv0860 A338A 6 959369 G A synonymous Rv0861c S261S 4 960367 C T nonsynonymous Rv0862c S749L 6 962133 C A nonsynonymous Rv0862c D160E 5 964400 C A nonsynonymous Rv0867c T379K 3 964969 A G synonymous Rv0867c A189A 5 965648 G T intergenic - - 6 972484 C T intergenic - - 2 972980 G A nonsynonymous Rv0874c G243S 6 975915 G C nonsynonymous Rv0876c R8P 1 976043 G T intergenic - - 6 982363 C A synonymous Rv0884c G64G 1 987601 C T synonymous Rv0888 G123G 6 988043 T C nonsynonymous Rv0888 Y271H 5 991740 A G nonsynonymous Rv0890c Q286R 3 991939 C T nonsynonymous Rv0890c R220C 6 994678 C G nonsynonymous Rv0892 L276V 1 996219 C T nonsynonymous Rv0893c A26V 3 996263 C T synonymous Rv0893c T11T 6 996284 G A synonymous Rv0893c E4E 6 1000732 A G nonsynonymous Rv0896 T421A 3 1002172 C T nonsynonymous Rv0897c R82W 1 1002342 A G nonsynonymous Rv0897c Y25C 5 1004177 A C nonsynonymous Rv0901 E74A 3 1007198 T C nonsynonymous Rv0904c L328P 6 1007708 G A nonsynonymous Rv0904c G158E 3 1008460 C T nonsynonymous Rv0905 S85F 1 1009490 C T stopgain Rv0906 Q183X 3 1009500 C T nonsynonymous Rv0906 P186L 6 1009957 G A synonymous Rv0906 L338L 3 1012815 C G nonsynonymous Rv0908 A362G 6 1013635 C T synonymous Rv0908 V635V modern 1014815 G T nonsynonymous Rv0909 Q45H 2 1022003 A C intergenic - - 1 1022613 G A nonsynonymous Rv0917 R176Q 3 1023911 C G intergenic - - 4 1024346 G A nonsynonymous Rv0918 G46S 5 1025135 T C nonsynonymous Rv0919 L151P 3 1029586 C T nonsynonymous Rv0923c P331L

Appendix B. Lineage-specific SNPs

225

1 1029997 T C nonsynonymous Rv0923c V194A 1 1032524 A G nonsynonymous Rv0925c D37G 6 1034238 T C synonymous Rv0927c L132L 3 1034381 C T nonsynonymous Rv0927c A84V 5 1038813 C T nonsynonymous Rv0931c H368Y modern 1040706 G T nonsynonymous Rv0932c A115S 1 1043136 C T nonsynonymous Rv0934 T341I 6 1043169 T C nonsynonymous Rv0934 V352A 1 1048102 G A nonsynonymous Rv0938 R656H 6 1049460 A T nonsynonymous Rv0939 D350V 5 1050523 G T intergenic - - 5 1053653 G A synonymous Rv0943c R28R 6 1054136 C T synonymous Rv0944 T124T 4 1054784 G C nonsynonymous Rv0945 G180R 5 1058309 A G nonsynonymous Rv0949 D17G 1 1061386 A G nonsynonymous Rv0950c E90G modern 1063765 A G nonsynonymous Rv0952 K209R 1 1063922 C T synonymous Rv0952 G261G 1 1066038 A G synonymous Rv0954 *304* 6 1069146 C T synonymous Rv0957 G314G 3 1071349 G C nonsynonymous Rv0959 G32A 1 1072342 A G nonsynonymous Rv0959 D363G 1 1075169 A G intergenic - - 5 1077102 G A synonymous Rv0965c A32A 1 1077754 G A nonsynonymous Rv0966c A28T 4 1080192 A G nonsynonymous Rv0969 N484D 1 1083755 T C nonsynonymous Rv0973c S666P 1 1086648 G T nonsynonymous Rv0974c R233L 6 1095053 G T nonsynonymous Rv0979A K56N modern 1097023 A G nonsynonymous Rv0981 S70G 5 1097633 C T synonymous Rv0982 A42A 4 1098523 A T nonsynonymous Rv0982 H339L 2 1102468 C A synonymous Rv0986 G222G 5 1104499 A G nonsynonymous Rv0987 D653G 4 1104690 G T nonsynonymous Rv0987 V717F modern 1105284 A G nonsynonymous Rv0988 I57V 5 1105557 G T nonsynonymous Rv0988 A148S 3 1106099 C T synonymous Rv0988 I328I 5 1107024 A G nonsynonymous Rv0989c D120G 6 1107897 C T nonsynonymous Rv0990c A68V 4 1107940 G T nonsynonymous Rv0990c A54S modern 1109163 C G nonsynonymous Rv0992c I3M 3 1110721 T C synonymous Rv0994 R151R modern 1110956 G T nonsynonymous Rv0994 G230C 1 1111518 T C nonsynonymous Rv0994 V417A 3 1111852 G T nonsynonymous Rv0995 D81Y modern 1113290 C G nonsynonymous Rv0996 Q303E 5 1114129 T G intergenic - - 6 1117308 C T synonymous Rv1001 L42L 1 1117405 C T nonsynonymous Rv1001 T74I 6 1118270 A G synonymous Rv1001 V362V 6 1119597 G C nonsynonymous Rv1002c V115L 1 1119739 G A synonymous Rv1002c L67L

Appendix B. Lineage-specific SNPs

226

6 1122175 C G intergenic - - modern 1123597 C T nonsynonymous Rv1005c S1L modern 1131300 A G nonsynonymous Rv1012 N58S 3 1139089 G C synonymous Rv1020 T41T 1 1139222 G A nonsynonymous Rv1020 A86T 5 1139497 C T synonymous Rv1020 T177T 5 1141069 G T synonymous Rv1020 A701A modern 1143832 C A nonsynonymous Rv1022 P33T 3 1144409 T G nonsynonymous Rv1022 I225S 4 1144585 G A nonsynonymous Rv1023 G8R 5 1145442 C G synonymous Rv1023 G293G 4 1148259 G A intergenic - - 6 1149547 G A nonsynonymous Rv1028c A714T 6 1150490 G C synonymous Rv1028c S399S 6 1150803 G A nonsynonymous Rv1028c G295D 1 1151490 C G nonsynonymous Rv1028c T66R 3 1152805 T A nonsynonymous Rv1029 L265Q 3 1152863 A G synonymous Rv1029 Q284Q 3 1153388 C T synonymous Rv1029 N459N 6 1153920 C T nonsynonymous Rv1030 T66I 6 1154634 T C nonsynonymous Rv1030 V304A 1 1155700 C T synonymous Rv1030 I659I 5 1155819 T C nonsynonymous Rv1030 V699A 5 1156224 C T synonymous Rv1031 G124G 6 1156704 A G nonsynonymous Rv1032c T418A 1 1157771 C G nonsynonymous Rv1032c S62C 3 1164571 A G intergenic - - modern 1165521 T A intergenic - - 2 1168776 A C nonsynonymous Rv1046c R151S 2 1175343 C T intergenic - - 6 1177815 A G nonsynonymous Rv1056 D63G 1 1184826 G T nonsynonymous Rv1061 R271L 5 1186287 C A synonymous Rv1063c A179A 5 1190588 G A intergenic - - 5 1192641 C A nonsynonymous Rv1069c H545N 1 1192830 C T nonsynonymous Rv1069c R482W 5 1196194 C T intergenic - - 6 1197169 G C intergenic - - 1 1199019 A G nonsynonymous Rv1074c T119A 2 1201581 A C nonsynonymous Rv1076 Q272P 5 1203264 C T intergenic - - modern 1203824 C T nonsynonymous Rv1078 T171I 4 1211369 C A nonsynonymous Rv1086 R259S 5 1213925 C T intergenic - - 5 1215581 A C nonsynonymous Rv1089A H22P 6 1220180 C T nonsynonymous Rv1092c R3W modern 1220570 T G intergenic - - 6 1224174 C T nonsynonymous Rv1095 A393V 6 1225198 C T nonsynonymous Rv1096 P272S 6 1226021 C A nonsynonymous Rv1097c Q42K 1 1228116 T C nonsynonymous Rv1099c I156T 4 1230778 T C nonsynonymous Rv1102c I65T 5 1232089 C T intergenic - -

Appendix B. Lineage-specific SNPs

227

6 1233275 C T synonymous Rv1106c L228L modern 1235446 G A synonymous Rv1108c S5S 3 1236433 A C nonsynonymous Rv1110 E83D 1 1237403 C T nonsynonymous Rv1111c P264S 6 1238483 G A nonsynonymous Rv1112 V77I 5 1239649 C T nonsynonymous Rv1114 R14C 5 1240578 C T nonsynonymous Rv1115 P131L 3 1240744 C A nonsynonymous Rv1115 N186K 3 1241572 A G intergenic - - 5 1242007 G A synonymous Rv1118c R275R 5 1242416 A G nonsynonymous Rv1118c Q139R 5 1243724 T C synonymous Rv1121 G6G 6 1245781 C T nonsynonymous Rv1122 T218I 6 1247306 C G synonymous Rv1124 V60V 3 1247391 A C nonsynonymous Rv1124 S89R 4 1248382 G A nonsynonymous Rv1125 G101S 4 1248936 C G synonymous Rv1125 P285P 5 1250131 T C nonsynonymous Rv1127c V425A 4 1250340 C T synonymous Rv1127c A355A modern 1250357 C A nonsynonymous Rv1127c P350T 5 1251071 T G nonsynonymous Rv1127c S112A 6 1253028 C T intergenic - - 4 1254562 G A nonsynonymous Rv1130 G3D 5 1255685 C T synonymous Rv1130 F377F 6 1256012 G T synonymous Rv1130 R486R 5 1256176 A G synonymous Rv1131 K15K 5 1256895 A C nonsynonymous Rv1131 E255A 5 1257823 G A nonsynonymous Rv1132 A167T 1 1262230 C T intergenic - - 3 1265828 C T synonymous Rv1138c L221L 6 1265913 T C synonymous Rv1138c T192T 3 1271187 C G nonsynonymous Rv1144 T11S 6 1275025 T A synonymous Rv1147 G42G 6 1275084 C T nonsynonymous Rv1147 P62L 5 1275333 A G nonsynonymous Rv1147 H145R 6 1279184 G A nonsynonymous Rv1151c R145Q 6 1281685 C G nonsynonymous Rv1155 A86G 4 1281771 C T nonsynonymous Rv1155 P115S 3 1281984 G A intergenic - - 5 1283821 T C synonymous Rv1157c N117N 6 1283851 G A synonymous Rv1157c S107S 5 1284479 C G synonymous Rv1158c T128T 5 1284931 C T intergenic - - 6 1286582 A G intergenic - - modern 1287112 C T intergenic - - 1 1287160 A C intergenic - - 3 1287372 G C synonymous Rv1161 L15L 5 1288251 C T synonymous Rv1161 C308C 5 1288630 A C nonsynonymous Rv1161 S435R 5 1305657 G A intergenic - - 6 1306281 T G nonsynonymous Rv1175c I649S 3 1308317 A G nonsynonymous Rv1176c H159R modern 1310316 T C nonsynonymous Rv1178 V318A

Appendix B. Lineage-specific SNPs

228

5 1313128 C A synonymous Rv1179c R58R 5 1313131 C A nonsynonymous Rv1179c R57S 6 1313726 T C nonsynonymous Rv1180 V1A 6 1314261 G A synonymous Rv1180 S179S 1 1314617 C T nonsynonymous Rv1180 A298V 5 1316651 A G nonsynonymous Rv1181 Q473R 2 1317655 C T synonymous Rv1181 L808L 6 1318990 C T nonsynonymous Rv1181 L1253F 1 1320508 G A synonymous Rv1182 V158V 1 1320614 C G nonsynonymous Rv1182 L194V 3 1325650 G A intergenic - - 2 1329234 C T synonymous Rv1186c D24D 5 1331789 C T nonsynonymous Rv1188 R257C 3 1340784 T C synonymous Rv1197 G42G modern 1341040 A C nonsynonymous Rv1198 D12A 1 1344857 G A nonsynonymous Rv1201c G105S 3 1345016 G A nonsynonymous Rv1201c A52T 6 1347173 G A synonymous Rv1204c L484L 6 1347264 C T nonsynonymous Rv1204c T454M 1 1348520 G T synonymous Rv1204c L35L 1 1348521 T C nonsynonymous Rv1204c L35P 4 1351172 G A intergenic - - 5 1352566 C T synonymous Rv1208 H141H 6 1355937 C T synonymous Rv1213 F34F 1 1356648 C T synonymous Rv1213 D271D 5 1358934 A C nonsynonymous Rv1215c S171R modern 1358940 T G nonsynonymous Rv1215c S169A 5 1359908 T G nonsynonymous Rv1216c V80G 5 1360604 G A nonsynonymous Rv1217c V400I 6 1366736 T C nonsynonymous Rv1223 F288L 6 1367208 G A nonsynonymous Rv1223 S445N 4 1367484 G T nonsynonymous Rv1224 G8W 6 1368133 G T nonsynonymous Rv1225c A197S modern 1368947 C T nonsynonymous Rv1226c A450V 1 1369389 G A nonsynonymous Rv1226c G303S 1 1369735 C T synonymous Rv1226c Y187Y 3 1371470 G A nonsynonymous Rv1228 R184H 6 1372002 G A synonymous Rv1229c L316L 5 1372975 C G nonsynonymous Rv1230c A408G 6 1373576 T C nonsynonymous Rv1230c W208R 1 1374578 G A nonsynonymous Rv1231c R96H 1 1374639 A T nonsynonymous Rv1231c N76Y 6 1375349 G T nonsynonymous Rv1232c G274W 5 1377185 C G synonymous Rv1234 L70L 3 1377568 A G synonymous Rv1235 V15V 5 1383185 A C intergenic - - 5 1383970 A C nonsynonymous Rv1240 D253A 6 1384188 C A nonsynonymous Rv1240 L326I 6 1384255 G T intergenic - - 1 1387211 G A nonsynonymous Rv1244 A119T 6 1387580 C A nonsynonymous Rv1244 Q242K 6 1388517 G A nonsynonymous Rv1245c V38I 1 1389866 G A nonsynonymous Rv1248c G1063S

Appendix B. Lineage-specific SNPs

229

6 1390089 G A synonymous Rv1248c A988A 4 1390763 A G nonsynonymous Rv1248c M764V 6 1391728 G A nonsynonymous Rv1248c R442H modern 1395010 G A nonsynonymous Rv1250 G278R 3 1396618 G T stopgain Rv1251c E875X modern 1397201 C T synonymous Rv1251c N680N 1 1397215 G C nonsynonymous Rv1251c G676R 5 1397633 G A synonymous Rv1251c K536K modern 1400396 G A nonsynonymous Rv1253 V143M 3 1401033 C T nonsynonymous Rv1253 S355L 5 1403266 G C nonsynonymous Rv1255c A41P 6 1404738 C T synonymous Rv1257c L449L 6 1406685 T C nonsynonymous Rv1258c V219A 5 1407273 A T nonsynonymous Rv1258c D23V modern 1410062 G C nonsynonymous Rv1262c R104P 5 1413242 G T intergenic - - 6 1414870 C T nonsynonymous Rv1266c T324I 6 1416633 C G nonsynonymous Rv1267c L239V 1 1417019 G A nonsynonymous Rv1267c C110Y modern 1417554 C G intergenic - - 1 1417793 C T synonymous Rv1268c T188T 5 1419373 A G nonsynonymous Rv1270c T126A 5 1422079 G C nonsynonymous Rv1272c R76P 3 1422666 G A nonsynonymous Rv1273c G462E 3 1422667 G A nonsynonymous Rv1273c G462R modern 1424699 C T nonsynonymous Rv1274 P168L 1 1426928 C T synonymous Rv1277 F255F 5 1436284 T C synonymous Rv1283c G278G 6 1438981 C T synonymous Rv1286 L25L 3 1440090 C T nonsynonymous Rv1286 T395I 3 1441545 G A synonymous Rv1288 A66A 6 1442734 C G intergenic - - 1 1443354 C T synonymous Rv1289 A196A 1 1445977 C T intergenic - - 3 1450316 C T synonymous Rv1294 A314A 6 1452717 C T nonsynonymous Rv1296 P241S 6 1453680 C T synonymous Rv1297 T159T 6 1454811 C T synonymous Rv1297 N536N modern 1458144 A C nonsynonymous Rv1301 Q196P 6 1461251 G T synonymous Rv1305 A69A 5 1463143 C T nonsynonymous Rv1307 S434L 5 1465542 A C nonsynonymous Rv1309 Y220S 6 1467735 G A synonymous Rv1312 G16G 3 1478357 C A synonymous Rv1317c P254P 4 1479085 G A nonsynonymous Rv1317c V12I 1 1480972 A G synonymous Rv1319c E510E 5 1481038 G A synonymous Rv1319c L488L 3 1481563 G A synonymous Rv1319c E313E 5 1482978 G A nonsynonymous Rv1320c A414T 6 1486647 G C nonsynonymous Rv1323 Q262H 5 1487674 G A nonsynonymous Rv1324 A172T 6 1490140 G A nonsynonymous Rv1326c A725T 4 1490905 C T nonsynonymous Rv1326c P470S

Appendix B. Lineage-specific SNPs

230

modern 1490911 T C nonsynonymous Rv1326c S468P modern 1492194 G A nonsynonymous Rv1326c G40D 3 1495836 G A nonsynonymous Rv1328 V425M modern 1496964 T G nonsynonymous Rv1328 S801A 5 1501448 G T nonsynonymous Rv1332 V175L modern 1505194 C T synonymous Rv1338 G40G 5 1505806 C T synonymous Rv1338 P244P modern 1505973 A G synonymous Rv1339 R19R 6 1507308 G C nonsynonymous Rv1340 R185P 5 1507920 G A synonymous Rv1341 V116V 6 1508682 A C nonsynonymous Rv1343c E81A 5 1509093 C G synonymous Rv1344 L42L 3 1513189 C T nonsynonymous Rv1348 A48V 1 1514010 G T nonsynonymous Rv1348 V322F 6 1515003 G A nonsynonymous Rv1348 A653T 5 1518271 G A nonsynonymous Rv1351 R14K 5 1518280 C T nonsynonymous Rv1351 S17F 5 1518681 C G intergenic - - 1 1521526 A G synonymous Rv1354c E117E 6 1521892 A G nonsynonymous Rv1355c H714R 1 1522862 G A nonsynonymous Rv1355c V391I modern 1523175 G T synonymous Rv1355c A286A modern 1523791 C T nonsynonymous Rv1355c P81L 1 1525160 G A intergenic - - 6 1529346 C T nonsynonymous Rv1358 A912V 1 1534548 C T synonymous Rv1362c I21I 3 1534551 G A synonymous Rv1362c E20E 1 1535643 C T intergenic - - 6 1536183 C T nonsynonymous Rv1364c R488C 6 1537926 G A nonsynonymous Rv1365c E82K 4 1540141 G A nonsynonymous Rv1367c V169I 3 1540484 G C synonymous Rv1367c L54L 4 1544255 T C synonymous Rv1371 R299R 5 1545472 C T synonymous Rv1372 N216N 5 1545720 A G nonsynonymous Rv1372 D299G 4 1546703 T C nonsynonymous Rv1373 L231P modern 1548087 A G nonsynonymous Rv1375 R86G 6 1548149 G A synonymous Rv1375 P106P 6 1549854 G C nonsynonymous Rv1376 R236P 5 1550945 A C synonymous Rv1377c G91G 5 1555432 C T synonymous Rv1381 T415T 5 1556030 C T synonymous Rv1383 G20G 2 1556787 T C nonsynonymous Rv1383 S273P modern 1559562 A C nonsynonymous Rv1384 E821A 6 1560088 G T synonymous Rv1384 L996L 1 1560912 G A synonymous Rv1385 E156E 1 1563686 G T intergenic - - 1 1568178 G A synonymous Rv1393c P470P 5 1568891 C G nonsynonymous Rv1393c L233V modern 1574206 G A nonsynonymous Rv1397c G103D 1 1575793 T C nonsynonymous Rv1399c V6A 2 1577241 G A synonymous Rv1401 P104P 5 1578212 A C synonymous Rv1402 A200A

Appendix B. Lineage-specific SNPs

231

1 1580181 C T nonsynonymous Rv1403c P81L 1 1581377 G A nonsynonymous Rv1405c G198D 6 1581727 C T synonymous Rv1405c P81P modern 1584379 C A nonsynonymous Rv1407 L427I 5 1585032 C T synonymous Rv1408 V178V modern 1585283 C A nonsynonymous Rv1409 N30K modern 1585404 A G nonsynonymous Rv1409 T71A 6 1585900 C T nonsynonymous Rv1409 P236L 1 1589739 C T synonymous Rv1413 G118G 1 1590555 C T synonymous Rv1415 T53T 5 1591039 G A nonsynonymous Rv1415 A215T 5 1593536 T C nonsynonymous Rv1419 L11P 6 1593652 G A nonsynonymous Rv1419 A50T 5 1593762 C A synonymous Rv1419 P86P 5 1593762 C T synonymous Rv1419 P86P 6 1598404 C A nonsynonymous Rv1423 L167I 3 1599557 G C nonsynonymous Rv1424c R33T modern 1600685 T C nonsynonymous Rv1425 L343P 1 1601528 G A nonsynonymous Rv1426c C265Y 5 1601722 G A synonymous Rv1426c P200P 5 1603637 A G nonsynonymous Rv1427c N98D 1 1604290 T A nonsynonymous Rv1428c V157D 5 1605016 C T synonymous Rv1429 L47L 5 1605569 T C nonsynonymous Rv1429 V231A 4 1608276 C A nonsynonymous Rv1431 T65N 5 1611024 C T synonymous Rv1432 T392T modern 1611283 C T intergenic - - 3 1614143 G A synonymous Rv1436 L279L 1 1616831 A G intergenic - - 1 1617833 C T intergenic - - 6 1620178 A C nonsynonymous Rv1442 M130L 6 1620179 T C nonsynonymous Rv1442 M130T 1 1625259 T G nonsynonymous Rv1446c I36S 6 1629058 C T nonsynonymous Rv1449c S381L modern 1639418 G A synonymous Rv1453 L346L modern 1639643 C T synonymous Rv1453 R421R 5 1639699 T G nonsynonymous Rv1454c V321G modern 1640442 A G synonymous Rv1454c G73G 2 1643864 A G nonsynonymous Rv1458c T133A 1 1644250 C G nonsynonymous Rv1458c A4G 6 1646982 C T nonsynonymous Rv1460 A266V 1 1647830 G A nonsynonymous Rv1461 G281D 5 1648089 G T synonymous Rv1461 L367L 5 1648224 C T synonymous Rv1461 H412H 5 1649265 G A synonymous Rv1461 L759L 3 1650406 C A nonsynonymous Rv1462 T294N 2 1651308 A G nonsynonymous Rv1463 E198G 5 1658030 C T synonymous Rv1469 I356I 1 1658535 G T nonsynonymous Rv1469 V525F 6 1659318 C T synonymous Rv1470 G113G 1 1659902 T C nonsynonymous Rv1472 V47A modern 1659994 T C nonsynonymous Rv1472 S78P 6 1662719 C G synonymous Rv1474c A162A

Appendix B. Lineage-specific SNPs

232

6 1666671 G A synonymous Rv1476 V156V 6 1666796 C A intergenic - - 5 1669296 G A nonsynonymous Rv1479 G5D 5 1672707 C T nonsynonymous Rv1482c T198M 5 1673338 G A intergenic - - 6 1674434 T C nonsynonymous Rv1484 V78A 6 1676742 C A nonsynonymous Rv1486c H48N 3 1676880 T C nonsynonymous Rv1486c W2R 5 1678639 C G nonsynonymous Rv1489 P30A 2 1678706 A C nonsynonymous Rv1489 K52T 6 1679008 A G nonsynonymous Rv1489A K23E 3 1681147 G A intergenic - - 5 1686414 C T synonymous Rv1494 D48D 6 1686538 C T nonsynonymous Rv1494 R90W 5 1686737 C T synonymous Rv1495 P56P 4 1688300 C T synonymous Rv1497 F120F 6 1690345 T C nonsynonymous Rv1498A M1T 6 1690970 G T nonsynonymous Rv1500 D41Y 1 1691520 T G nonsynonymous Rv1500 L224R 6 1692061 C G nonsynonymous Rv1501 R58G 6 1694199 G A nonsynonymous Rv1503c E116K modern 1694547 G T stopgain Rv1504c E200X modern 1695674 T C synonymous Rv1505c Y91Y 2 1695796 G A nonsynonymous Rv1505c A51T 6 1696132 G A synonymous Rv1506c A104A 5 1696941 C A nonsynonymous Rv1507c A161E 5 1696942 G A nonsynonymous Rv1507c A161T 6 1699448 G T nonsynonymous Rv1508c W149C 4 1699849 T C nonsynonymous Rv1508c S16P 6 1701212 A C intergenic - - 6 1702423 G A nonsynonymous Rv1510 A377T 6 1702803 G A intergenic - - 5 1703786 A C nonsynonymous Rv1511 Q238P modern 1706746 T G nonsynonymous Rv1515c F261V 3 1708792 T C intergenic - - 5 1710085 G C nonsynonymous Rv1518 V148L 3 1710767 T A nonsynonymous Rv1519 L12H 5 1711221 C T nonsynonymous Rv1520 T65I 5 1711619 G A nonsynonymous Rv1520 A198T 6 1711637 G T nonsynonymous Rv1520 A204S 2 1711670 C T nonsynonymous Rv1520 R215C 6 1712522 G T nonsynonymous Rv1521 S74I 3 1713923 A G nonsynonymous Rv1521 H541R 6 1714678 A G nonsynonymous Rv1522c T979A 4 1716472 C T nonsynonymous Rv1522c P381S 5 1717141 G C nonsynonymous Rv1522c A158P 1 1718444 T C synonymous Rv1523 R264R 1 1721213 T G nonsynonymous Rv1526c I283S 1 1721490 G C nonsynonymous Rv1526c A191P 5 1721987 G A nonsynonymous Rv1526c R25Q 2 1722228 T G nonsynonymous Rv1527c L2061R 6 1725768 C T nonsynonymous Rv1527c T881I modern 1726816 A C nonsynonymous Rv1527c T532P

Appendix B. Lineage-specific SNPs

233

6 1728324 G T nonsynonymous Rv1527c G29V 1 1728615 A C intergenic - - 5 1731312 G A intergenic - - 1 1731563 A T nonsynonymous Rv1530 H64L 5 1733563 A G intergenic - - 5 1733618 C T synonymous Rv1533 T3T 1 1735903 A G intergenic - - 5 1735926 T G intergenic - - 1 1739390 A G nonsynonymous Rv1536 T958A 1 1745446 T A nonsynonymous Rv1543 L128H 5 1746404 T C nonsynonymous Rv1544 L104S 3 1748439 C T nonsynonymous Rv1547 T249I 1 1754299 A C nonsynonymous Rv1550 E195A 5 1754786 G A synonymous Rv1550 G357G 1 1754983 C T nonsynonymous Rv1550 A423V 5 1757727 G A nonsynonymous Rv1552 G16D 1 1758790 C T synonymous Rv1552 P370P modern 1760923 C T synonymous Rv1555 T124T modern 1761789 C G nonsynonymous Rv1557 L16V 3 1763482 G A nonsynonymous Rv1559 G19R 3 1764225 C T synonymous Rv1559 A266A 6 1774116 A C nonsynonymous Rv1566c T169P 6 1775066 G A nonsynonymous Rv1567c V27I 6 1775167 G A intergenic - - 5 1776614 C T nonsynonymous Rv1568 A408V 6 1776663 G A synonymous Rv1568 S424S 6 1778197 G A synonymous Rv1570 L113L 3 1789933 G A intergenic - - 1 1791823 T C nonsynonymous Rv1591 L85S modern 1797577 G A nonsynonymous Rv1596 D64N 5 1799921 C A synonymous Rv1599 G113G 3 1802047 C G nonsynonymous Rv1601 Q5E 6 1805289 C A nonsynonymous Rv1605 P146Q 6 1805670 G A synonymous Rv1606 K6K 5 1810169 T G intergenic - - 6 1810253 C T nonsynonymous Rv1611 T5I 5 1811243 A C nonsynonymous Rv1612 E39D 6 1811964 C A synonymous Rv1612 R280R 3 1812448 C T synonymous Rv1613 Y30Y 6 1816587 C G synonymous Rv1617 V133V 3 1818286 C T synonymous Rv1618 G224G 6 1824107 G C nonsynonymous Rv1622c M98I modern 1826054 G A nonsynonymous Rv1624c A178T 1 1826343 G A synonymous Rv1624c R81R 5 1826577 C A nonsynonymous Rv1624c H3Q 1 1826624 C T nonsynonymous Rv1625c A441V 3 1827468 G T nonsynonymous Rv1625c V160L 1 1827553 G C nonsynonymous Rv1625c L131F 6 1827946 G A intergenic - - 6 1828389 C T synonymous Rv1626 R70R 6 1828506 G C nonsynonymous Rv1626 M109I 1 1829576 G A synonymous Rv1627c E166E 6 1829836 C T nonsynonymous Rv1627c P80S

Appendix B. Lineage-specific SNPs

234

2 1831220 A C nonsynonymous Rv1629 T186P 2 1831226 A G nonsynonymous Rv1629 R188G 2 1831288 C T synonymous Rv1629 P208P 3 1832509 C G synonymous Rv1629 T615T 1 1832642 G T nonsynonymous Rv1629 G660C 1 1832643 G C nonsynonymous Rv1629 G660A 3 1833025 C T synonymous Rv1629 D787D 2 1834177 A C synonymous Rv1630 R212R 3 1836417 G T synonymous Rv1632c G138G 6 1837168 G T nonsynonymous Rv1633 A32S modern 1839260 G T synonymous Rv1634 L31L 1 1839329 G A synonymous Rv1634 R54R 4 1839759 C G nonsynonymous Rv1634 R198G 6 1840543 C T nonsynonymous Rv1634 S459F 6 1841538 G A synonymous Rv1635c A235A 5 1846092 C T synonymous Rv1638 Y784Y 6 1846552 G T nonsynonymous Rv1638 A938S 5 1846854 C T synonymous Rv1638A Y40Y 1 1848147 G T synonymous Rv1639c V104V 5 1848963 G T nonsynonymous Rv1640c D1025Y 1 1849191 A G nonsynonymous Rv1640c I949V 4 1849609 G A synonymous Rv1640c R809R 5 1849814 T C nonsynonymous Rv1640c V741A 3 1852877 A C stoplost Rv1641 X202S 5 1853565 G A nonsynonymous Rv1643 A128T 1 1853974 C T synonymous Rv1644 G123G 6 1855260 G A synonymous Rv1645c S65S 4 1859559 A C nonsynonymous Rv1649 D276A 1 1859989 C T nonsynonymous Rv1650 R78W 5 1862099 C A nonsynonymous Rv1650 T781N 6 1867707 C T synonymous Rv1653 P359P 6 1870031 A C nonsynonymous Rv1656 Q37P 5 1870194 G C nonsynonymous Rv1656 L91F modern 1872959 G A synonymous Rv1659 R107R 3 1873700 G A synonymous Rv1659 L354L 1 1873954 G T nonsynonymous Rv1659 R439L 5 1874985 A C nonsynonymous Rv1660 T276P 6 1875585 C T synonymous Rv1661 T94T 1 1876739 T G nonsynonymous Rv1661 V479G 2 1877744 A C nonsynonymous Rv1661 E814A 6 1882180 C T synonymous Rv1662 A159A 6 1885481 C G nonsynonymous Rv1662 R1260G 6 1886077 G A synonymous Rv1662 A1458A 1 1886263 C G nonsynonymous Rv1662 H1520Q 1 1887284 G A nonsynonymous Rv1663 R258Q modern 1889073 C G nonsynonymous Rv1664 P350A 5 1890948 T G nonsynonymous Rv1664 L975V 2 1897608 G A synonymous Rv1672c L200L 1 1897646 C G nonsynonymous Rv1672c P188A 6 1900021 C T intergenic - - modern 1900800 C T nonsynonymous Rv1675c A59V 1 1902156 G A nonsynonymous Rv1677 G137R 6 1903173 C T synonymous Rv1678 I259I

Appendix B. Lineage-specific SNPs

235

modern 1906336 G A intergenic - - 1 1907177 C T nonsynonymous Rv1682 R259W 6 1907794 G A synonymous Rv1683 V67V 3 1908598 G A synonymous Rv1683 R335R 5 1909456 C T synonymous Rv1683 A621A 5 1911301 G A synonymous Rv1685c A33A 6 1912024 G T nonsynonymous Rv1686c V20F 6 1912582 C T synonymous Rv1687c A113A 6 1912617 G A nonsynonymous Rv1687c D102N 6 1914570 A C synonymous Rv1689 A323A modern 1920120 T G synonymous Rv1696 P146P 5 1923633 C T nonsynonymous Rv1698 H297Y 1 1923985 G A nonsynonymous Rv1699 V53I modern 1924959 C T synonymous Rv1699 G377G 3 1925136 G A synonymous Rv1699 V436V 3 1926029 T C nonsynonymous Rv1700 Y150H modern 1931470 A G intergenic - - modern 1935695 G T nonsynonymous Rv1707 A272S modern 1936525 A G nonsynonymous Rv1708 T56A 6 1937727 G T nonsynonymous Rv1709 A139S 5 1939007 A C nonsynonymous Rv1711 Q57P 1 1940307 G T nonsynonymous Rv1713 W7L 5 1942121 C T nonsynonymous Rv1714 A90V 5 1946975 C G intergenic - - 3 1946999 T G intergenic - - 5 1951438 C T nonsynonymous Rv1725c P105L 6 1951764 G A intergenic - - modern 1952160 C T synonymous Rv1726 T103T 3 1952743 G A nonsynonymous Rv1726 V298M 5 1955339 A C synonymous Rv1729c A77A 6 1955686 G A intergenic - - 2 1955941 C G nonsynonymous Rv1730c D435E 1 1956930 A C nonsynonymous Rv1730c T106P 6 1958423 C T synonymous Rv1731 I249I 5 1961730 C T nonsynonymous Rv1735c A20V 1 1963383 C T synonymous Rv1736c Y268Y modern 1963957 G A nonsynonymous Rv1736c G77D 5 1964984 G A synonymous Rv1737c R129R 6 1965434 A G intergenic - - 2 1967543 G T nonsynonymous Rv1739c G32V 6 1968116 A C nonsynonymous Rv1741 K67T 5 1968284 C T nonsynonymous Rv1742 R38W 6 1970407 G T synonymous Rv1743 P468P 6 1970432 G A nonsynonymous Rv1743 A477T 4 1971725 G C synonymous Rv1745c R89R 5 1971965 A C synonymous Rv1745c P9P 3 1972901 C T nonsynonymous Rv1746 A255V 6 1975960 G T synonymous Rv1747 S777S 3 1977646 C T synonymous Rv1749c A80A 1 1978807 C A nonsynonymous Rv1750c A254D 1 1979026 G A nonsynonymous Rv1750c S181N 2 1980652 G T synonymous Rv1751 P344P 5 1989054 A T nonsynonymous Rv1758 I5F

Appendix B. Lineage-specific SNPs

236

5 1989057 G A nonsynonymous Rv1758 G6R 1 1989370 C T nonsynonymous Rv1758 P110L 5 1989553 G C nonsynonymous Rv1758 G171A 6 1992683 C T intergenic - - 3 1993561 T C nonsynonymous Rv1760 C137R 6 1993683 C T synonymous Rv1760 V177V 5 1997460 G A synonymous Rv1765c R352R 3 2003252 C T synonymous Rv1769 A209A 5 2005152 G A synonymous Rv1770 E425E 5 2005607 T G nonsynonymous Rv1771 S149R 6 2005758 G C nonsynonymous Rv1771 E200Q 6 2006954 C T intergenic - - 5 2007015 C G intergenic - - 5 2008140 A C synonymous Rv1774 P103P 4 2010614 A G intergenic - - 3 2010880 T G synonymous Rv1777 G75G 3 2011568 G C nonsynonymous Rv1777 E305Q 1 2013943 C G nonsynonymous Rv1779c D179E modern 2017291 G A synonymous Rv1781c R62R 1 2017560 A T intergenic - - 5 2017860 C T nonsynonymous Rv1782 R41C 6 2017861 G A nonsynonymous Rv1782 R41H 5 2018883 G C nonsynonymous Rv1782 G382R modern 2019236 G T synonymous Rv1782 P499P 1 2023211 G T nonsynonymous Rv1784 V860F 6 2025032 C T intergenic - - 6 2032898 C T nonsynonymous Rv1795 P220L 6 2033021 G A nonsynonymous Rv1795 G261D 5 2033307 T C synonymous Rv1795 A356A 3 2034676 C T synonymous Rv1796 I316I 1 2035937 G A nonsynonymous Rv1797 R152H 5 2047454 G A intergenic - - 6 2049907 G A intergenic - - 6 2053439 C T intergenic - - 6 2053762 C T nonsynonymous Rv1811 A107V 3 2056184 C T intergenic - - 5 2060377 C T synonymous Rv1817 Y261Y 6 2060557 C T synonymous Rv1817 F321F 5 2060606 C T nonsynonymous Rv1817 H338Y modern 2062922 G A nonsynonymous Rv1819c V603I 6 2063121 A T synonymous Rv1819c P536P 1 2066471 T C synonymous Rv1821 G5G 5 2069546 T C nonsynonymous Rv1822 V156A 6 2071192 C T nonsynonymous Rv1825 R53W 6 2071410 C T synonymous Rv1825 D125D modern 2072190 C A nonsynonymous Rv1826 A80E 2 2072313 C A nonsynonymous Rv1826 T121K 5 2080594 C T synonymous Rv1834 R255R 1 2083124 A G nonsynonymous Rv1836c S505G 5 2087453 T C nonsynonymous Rv1838c I67T 3 2087652 G A nonsynonymous Rv1838c V1M 1 2090306 C T nonsynonymous Rv1841c P138L 5 2090366 T C nonsynonymous Rv1841c M118T

Appendix B. Lineage-specific SNPs

237

6 2090776 T C nonsynonymous Rv1842c L437S 1 2092391 C A synonymous Rv1843c I436I 1 2092970 G A synonymous Rv1843c L243L modern 2093715 C T intergenic - - 5 2095234 G T nonsynonymous Rv1845c R312L 1 2096094 G A synonymous Rv1845c T25T 1 2096430 T G nonsynonymous Rv1846c L57R modern 2097144 C G nonsynonymous Rv1847 L90V 5 2099402 T G nonsynonymous Rv1850 V481G 5 2099631 C T synonymous Rv1850 G557G 5 2101921 G A synonymous Rv1854c S374S 5 2103112 C A intergenic - - 1 2104779 G A synonymous Rv1856c G15G 5 2107050 G A synonymous Rv1859 V159V 6 2107511 C T nonsynonymous Rv1859 T313M 6 2108374 C T synonymous Rv1860 D213D 4 2108890 C A intergenic - - 5 2108980 A T intergenic - - modern 2110365 C A synonymous Rv1862 T274T 6 2115064 C T nonsynonymous Rv1866 A642V 6 2115776 C T nonsynonymous Rv1867 P5S modern 2120796 T A stopgain Rv1870c L212X 6 2122380 C G nonsynonymous Rv1872c L258V 5 2122443 G T nonsynonymous Rv1872c A237S modern 2122625 C T nonsynonymous Rv1872c A176V 4 2122976 C G nonsynonymous Rv1872c A59G 5 2123146 G A synonymous Rv1872c A2A 6 2124926 T C intergenic - - 5 2125054 C T intergenic - - 6 2125863 C T intergenic - - 6 2127646 C T synonymous Rv1877 T581T modern 2128372 G A synonymous Rv1878 T117T modern 2129281 A C synonymous Rv1878 A420A modern 2130529 G A intergenic - - 5 2130784 G A synonymous Rv1880c P358P 1 2132062 G C nonsynonymous Rv1881c G90R 5 2132077 A G nonsynonymous Rv1881c I85V 5 2136642 C T nonsynonymous Rv1887 L129F 1 2138767 C G nonsynonymous Rv1889c A84G 5 2140748 G C synonymous Rv1894c A374A 3 2142250 C T intergenic - - 6 2143839 A G nonsynonymous Rv1896c E203G modern 2145878 C T nonsynonymous Rv1899c P123L 6 2150754 A G nonsynonymous Rv1903 T131A modern 2151678 A C nonsynonymous Rv1905c T240P 3 2151780 C A nonsynonymous Rv1905c Q206K 3 2153184 T G intergenic - - 4 2154724 T G nonsynonymous Rv1908c L463R 6 2155503 C T synonymous Rv1908c T203T modern 2156868 A C synonymous Rv1910c G144G 4 2158109 G A nonsynonymous Rv1912c G328D 5 2158190 T G nonsynonymous Rv1912c I301S 3 2158905 G T stopgain Rv1912c G63X

Appendix B. Lineage-specific SNPs

238

6 2159337 G T synonymous Rv1913 A49A 6 2167564 G T intergenic - - 6 2172012 G C nonsynonymous Rv1920 M130I 2 2172380 A C nonsynonymous Rv1920 E253A 3 2172526 T G stoplost Rv1921c X424G 5 2173728 C A nonsynonymous Rv1921c A23D 5 2176006 C A synonymous Rv1923 G278G 5 2176648 T G nonsynonymous Rv1924c W95G modern 2177073 C T intergenic - - 1 2177968 G T synonymous Rv1925 A294A 5 2178941 G A nonsynonymous Rv1925 V619M 6 2181541 C A synonymous Rv1929c P122P 1 2185358 C T synonymous Rv1934c T277T 2 2185674 T C nonsynonymous Rv1934c I172T 3 2186127 C T nonsynonymous Rv1934c A21V 3 2186236 G A synonymous Rv1935c P308P 5 2186371 C T synonymous Rv1935c A263A 5 2186421 A C nonsynonymous Rv1935c T247P 5 2195637 T G nonsynonymous Rv1944c Y100D 6 2195922 A G nonsynonymous Rv1944c T5A 6 2195923 T G nonsynonymous Rv1944c D4E 4 2199052 C G nonsynonymous Rv1948c R5G 6 2199061 A C nonsynonymous Rv1948c T2P modern 2199416 C T nonsynonymous Rv1949c L207F 5 2206970 G A intergenic - - 5 2208538 G A stopgain Rv1965 W11X 4 2209465 A G nonsynonymous Rv1966 T47A 5 2210198 A G nonsynonymous Rv1966 Y291C 3 2216345 C A synonymous Rv1971 P363P 1 2216370 G C nonsynonymous Rv1971 G372R modern 2218012 G A synonymous Rv1974 A118A 1 2218488 G C nonsynonymous Rv1975 S146T 3 2220947 A G nonsynonymous Rv1978 M14V 6 2221313 G A nonsynonymous Rv1978 A136T modern 2222308 G A nonsynonymous Rv1979c G286D 5 2223902 T A nonsynonymous Rv1980c I43N 6 2225175 T G nonsynonymous Rv1981c L5R 4 2229801 C G synonymous Rv1985c P34P modern 2231486 G A intergenic - - 1 2237497 C G nonsynonymous Rv1993c L27V 1 2238930 A C intergenic - - 1 2239055 C T nonsynonymous Rv1996 P18S 1 2240062 C G intergenic - - 1 2241646 C T synonymous Rv1997 A496A modern 2241742 G A synonymous Rv1997 P528P 1 2242808 G A nonsynonymous Rv1997 A884T modern 2243034 C A nonsynonymous Rv1998c R230S 6 2244343 C A synonymous Rv1999c R266R 3 2245916 T G synonymous Rv2000 A236A 5 2246459 C A synonymous Rv2000 A417A modern 2246960 C G synonymous Rv2001 V43V 1 2249035 T C nonsynonymous Rv2003c M129T 1 2255942 G T nonsynonymous Rv2006 R1314L

Appendix B. Lineage-specific SNPs

239

4 2260100 T C intergenic - - 6 2261693 G C intergenic - - 6 2265993 A C nonsynonymous Rv2019 Q2P 6 2266051 C T synonymous Rv2019 G21G 2 2267015 A G synonymous Rv2021c A32A 6 2267976 C A synonymous Rv2023c R45R 2 2268627 C G nonsynonymous Rv2023A Q34E 5 2268887 C G nonsynonymous Rv2024c L452V 6 2269376 C T nonsynonymous Rv2024c R289C 6 2274463 T G nonsynonymous Rv2027c L16V 3 2275764 C T nonsynonymous Rv2029c L221F 6 2275771 A G synonymous Rv2029c A218A 6 2276918 T C synonymous Rv2030c H523H 5 2278426 C A nonsynonymous Rv2030c R21S 6 2281289 G T intergenic - - modern 2282376 T C nonsynonymous Rv2036 V93A modern 2282377 T C synonymous Rv2036 V93V 5 2283293 C G synonymous Rv2037c T143T 3 2284456 A C nonsynonymous Rv2038c E114A 3 2285558 C T synonymous Rv2039c C28C 2 2288085 C G synonymous Rv2042c A199A 5 2290062 G A nonsynonymous Rv2045c A387T 5 2291331 C A synonymous Rv2046 G21G 5 2291331 C T synonymous Rv2046 G21G 2 2294007 A G nonsynonymous Rv2047c T174A 3 2296876 G C nonsynonymous Rv2048c G3371R 1 2297766 C T nonsynonymous Rv2048c A3074V modern 2301089 G A synonymous Rv2048c L1966L 5 2304017 G A synonymous Rv2048c L990L 3 2306472 A T nonsynonymous Rv2048c Y172F 3 2306472 A G nonsynonymous Rv2048c Y172C 3 2309203 C T nonsynonymous Rv2051c A518V 1 2309356 C T nonsynonymous Rv2051c T467I 5 2313815 G A nonsynonymous Rv2054 A231T 1 2321358 G C nonsynonymous Rv2063A R101P 5 2323291 C T synonymous Rv2066 A39A 6 2323880 T C synonymous Rv2066 L236L 3 2325320 A G nonsynonymous Rv2067c D184G 1 2327904 A C nonsynonymous Rv2070c I108L 5 2328420 G A synonymous Rv2071c T186T 4 2328543 G A nonsynonymous Rv2071c M145I 6 2328627 G A synonymous Rv2071c A117A 5 2328820 A G nonsynonymous Rv2071c D53G 6 2329466 G A synonymous Rv2072c P227P 1 2331255 C T nonsynonymous Rv2074 A88V 4 2331620 G T synonymous Rv2075c G420G 4 2331789 A C nonsynonymous Rv2075c Q364P 3 2333215 G A nonsynonymous Rv2076c C25Y 3 2335080 T C synonymous Rv2078 L8L 1 2335500 C T nonsynonymous Rv2079 A49V 5 2335650 G A nonsynonymous Rv2079 G99D 6 2336985 G C nonsynonymous Rv2079 G544A 2 2337179 C T stopgain Rv2079 Q609X

Appendix B. Lineage-specific SNPs

240

6 2337373 T C nonsynonymous Rv2080 V23A 5 2338773 G A nonsynonymous Rv2082 R22Q 2 2338810 T C synonymous Rv2082 R34R 2 2338811 A G nonsynonymous Rv2082 K35E 5 2338961 G A nonsynonymous Rv2082 V85I 3 2339240 G A nonsynonymous Rv2082 G178S 4 2339255 G A nonsynonymous Rv2082 A183T 5 2339605 A G synonymous Rv2082 P299P modern 2341030 A G nonsynonymous Rv2083 T54A 3 2345085 A C synonymous Rv2088 A225A 3 2346929 G A synonymous Rv2089c L132L 6 2347616 C A nonsynonymous Rv2090 L82I 1 2348482 G T nonsynonymous Rv2090 E370D 3 2348708 C T synonymous Rv2091c V195V 6 2349116 G A synonymous Rv2091c P59P 5 2349418 T C synonymous Rv2092c A879A modern 2350186 C T synonymous Rv2092c R623R 5 2350534 A G synonymous Rv2092c E507E 6 2351522 C T nonsynonymous Rv2092c T178M 3 2353385 C G nonsynonymous Rv2095c Q311E 5 2361174 C T nonsynonymous Rv2101 A312V 4 2369186 G C nonsynonymous Rv2109c G182R 6 2370902 G A intergenic - - 6 2372951 G C nonsynonymous Rv2113 G108R 1 2374442 G A synonymous Rv2114 S203S modern 2376425 G A intergenic - - modern 2379743 G C intergenic - - 6 2379997 T C nonsynonymous Rv2121c S222P 5 2382645 C T synonymous Rv2124c G1141G 5 2385408 G T nonsynonymous Rv2124c E220D 5 2388205 A G intergenic - - 4 2388641 A G nonsynonymous Rv2127 D9G 6 2389698 C A nonsynonymous Rv2127 N361K modern 2390299 G A intergenic - - 5 2393590 C G stopgain Rv2132 Y60X 1 2397760 C G nonsynonymous Rv2138 A144G 2 2399734 G A nonsynonymous Rv2139 G339S 1 2400031 C T nonsynonymous Rv2140c R100C 3 2402765 G A intergenic - - 5 2408524 C T nonsynonymous Rv2150c S334L 4 2413246 T G nonsynonymous Rv2153c L36V 5 2414989 G C nonsynonymous Rv2155c G469A 6 2415351 G T synonymous Rv2155c V348V 5 2419044 T C nonsynonymous Rv2158c V522A modern 2419142 C T synonymous Rv2158c V489V

4 2421816 C T nonsynonymous Rv2160c A63V-

Rv2160A 5 2422502 G A synonymous Rv2161c L212L modern 2424864 G A intergenic - - 1 2425097 T G nonsynonymous Rv2163c V664G 4 2425471 G A synonymous Rv2163c R539R 6 2427828 C G synonymous Rv2164c G137G 6 2428953 G A nonsynonymous Rv2165c G106D

Appendix B. Lineage-specific SNPs

241

5 2432185 A G intergenic - - 3 2434749 C T intergenic - - 6 2435582 G A nonsynonymous Rv2173 A246T modern 2437259 T G nonsynonymous Rv2174 S451A 6 2437837 C T nonsynonymous Rv2175c P17L 6 2438094 T G nonsynonymous Rv2176 S52A 3 2440935 C G synonymous Rv2178c S262S 6 2443508 G T nonsynonymous Rv2181 L69F 5 2445414 C A intergenic - - 6 2447150 C A synonymous Rv2185c P117P 6 2447426 C A synonymous Rv2185c I25I modern 2447539 G A intergenic - - 1 2448288 G A stopgain Rv2187 W43X 6 2448402 C A stopgain Rv2187 Y81X 4 2448458 T C nonsynonymous Rv2187 I100T modern 2449295 A G nonsynonymous Rv2187 E379G 3 2449826 C G nonsynonymous Rv2187 S556W 1 2450045 C T nonsynonymous Rv2188c T369M 1 2451081 G C nonsynonymous Rv2188c E24Q 3 2452452 C T nonsynonymous Rv2190c T274M 5 2452657 G T nonsynonymous Rv2190c V206F 6 2453933 C A nonsynonymous Rv2191 R39S 6 2458234 A C nonsynonymous Rv2194 K228Q 6 2461545 G T nonsynonymous Rv2197c A202S 5 2463455 G A synonymous Rv2199c R66R 5 2465721 T C nonsynonymous Rv2201 V242A 1 2470485 G T intergenic - - 4 2470591 C A intergenic - - 3 2472029 T G nonsynonymous Rv2207 W207G modern 2472956 C T nonsynonymous Rv2208 S155L 6 2474271 G A nonsynonymous Rv2209 G291E 5 2477562 C T synonymous Rv2212 L125L 3 2477984 G A synonymous Rv2212 S265S 6 2478180 C A nonsynonymous Rv2212 L331I 3 2478619 G A synonymous Rv2213 L94L 6 2478967 C G nonsynonymous Rv2213 F210L 3 2480809 C G nonsynonymous Rv2214c A298G 5 2485956 G A synonymous Rv2218 E228E 5 2488898 C A nonsynonymous Rv2220 D428E 5 2489855 G A synonymous Rv2221c T833T 5 2490116 A C synonymous Rv2221c A746A 5 2493513 A G nonsynonymous Rv2222c Q77R 3 2494430 G C nonsynonymous Rv2223c G324R modern 2495500 G A synonymous Rv2224c L508L 5 2498200 G T nonsynonymous Rv2225 M153I 5 2500610 G A intergenic - - 6 2500697 C T intergenic - - 1 2500892 C T intergenic - - modern 2501148 C T nonsynonymous Rv2227 A73V 6 2501401 G T synonymous Rv2227 P157P 6 2501668 G A synonymous Rv2228c L357L 5 2503257 C T synonymous Rv2229c I72I 5 2503491 A C nonsynonymous Rv2230c E373A

Appendix B. Lineage-specific SNPs

242

1 2503549 G T nonsynonymous Rv2230c A354S 3 2504177 G A synonymous Rv2230c E144E 1 2508395 G A synonymous Rv2235 P253P 5 2508857 C T synonymous Rv2236c A173A 1 2509181 G C synonymous Rv2236c V65V 6 2509362 C T nonsynonymous Rv2236c T5I 2 2510350 C G intergenic - - modern 2511712 A C nonsynonymous Rv2240c K259T 5 2512359 C T synonymous Rv2240c A43A 6 2514867 G A nonsynonymous Rv2241 A777T 3 2516271 T C nonsynonymous Rv2242 M323T 5 2516804 A C synonymous Rv2243 A6A modern 2518132 T C synonymous Rv2245 T6T 6 2520466 G A synonymous Rv2246 A357A 2 2521428 A G nonsynonymous Rv2247 D229G 6 2522284 G A intergenic - - 5 2522650 T C synonymous Rv2248 R97R 6 2522878 C A synonymous Rv2248 L173L 5 2525534 C T nonsynonymous Rv2250A R45C 6 2526709 T C nonsynonymous Rv2251 M382T 1 2528931 C T synonymous Rv2254c N15N 5 2530101 A T nonsynonymous Rv2257c D241V 1 2530434 C A nonsynonymous Rv2257c P130H 5 2531033 C G nonsynonymous Rv2258c P289A 5 2531035 T C nonsynonymous Rv2258c V288A 5 2532788 A G nonsynonymous Rv2259 T182A 3 2536312 C T synonymous Rv2263 R224R 1 2538793 G A nonsynonymous Rv2265 G32S 3 2540554 T C nonsynonymous Rv2266 S151P 3 2541477 C G intergenic - - 3 2542543 C T nonsynonymous Rv2267c R90C 2 2543395 A G synonymous Rv2268c E294E 6 2544466 G A synonymous Rv2269c R52R 1 2544979 T A nonsynonymous Rv2270 H94Q 1 2547274 A G nonsynonymous Rv2275 D131G 2 2548700 C T nonsynonymous Rv2276 H318Y 5 2549057 C A intergenic - - 6 2550019 G T nonsynonymous Rv2277c R4L 1 2553682 C T synonymous Rv2281 Y170Y 5 2561261 T C nonsynonymous Rv2287 V520A modern 2562783 C T nonsynonymous Rv2290 A62V 5 2562933 T C nonsynonymous Rv2290 I112T modern 2563958 C A nonsynonymous Rv2291 A262E 6 2566596 G C intergenic - - 1 2569593 A G nonsynonymous Rv2298 H171R 6 2571678 C T stopgain Rv2299c Q109X 5 2572854 A T nonsynonymous Rv2300c K52M 1 2573434 G T nonsynonymous Rv2301 W140C 1 2574598 G A nonsynonymous Rv2303c S141N 5 2574950 A G nonsynonymous Rv2303c I24V 6 2576251 G A nonsynonymous Rv2305 G148D 5 2576863 G C nonsynonymous Rv2305 S352T modern 2577246 G A nonsynonymous Rv2306A V47I

Appendix B. Lineage-specific SNPs

243

6 2577994 C T nonsynonymous Rv2307c R235W modern 2581109 A C synonymous Rv2308 R231R 1 2582324 G A intergenic - - 2 2586076 G C synonymous Rv2314c R405R 6 2590122 G A synonymous Rv2317 V142V 3 2591172 G A nonsynonymous Rv2318 A219T 6 2592510 C T nonsynonymous Rv2319c R73C 6 2593621 C A nonsynonymous Rv2320c A178E 6 2596056 G A nonsynonymous Rv2323c V72I 5 2598899 C A nonsynonymous Rv2326c T350N 1 2602575 C T nonsynonymous Rv2329c P296L 5 2603523 G C intergenic - - modern 2605293 T G nonsynonymous Rv2332 D62E modern 2608488 C T intergenic - - 5 2609302 G C nonsynonymous Rv2334 K169N 1 2611704 A C synonymous Rv2336 R290R 6 2614882 G A nonsynonymous Rv2339 A64T 3 2615413 A G nonsynonymous Rv2339 T241A 5 2615969 C A nonsynonymous Rv2339 A426E 5 2616527 G T nonsynonymous Rv2339 R612L 5 2617442 C A stopgain Rv2339 S917X 4 2619271 C T intergenic - - 1 2622508 G A synonymous Rv2344c L415L 5 2622927 C G nonsynonymous Rv2344c P276A 6 2623603 A G synonymous Rv2344c A50A 6 2623917 T C nonsynonymous Rv2345 S33P 6 2624945 C A nonsynonymous Rv2345 D375E 3 2624986 T G nonsynonymous Rv2345 V389G 4 2625924 G A synonymous Rv2346c A83A 5 2626018 A G nonsynonymous Rv2346c E52G 3 2626095 G C synonymous Rv2346c A26A modern 2626108 C G nonsynonymous Rv2346c A22G modern 2626189 A C intergenic - - modern 2626191 T C intergenic - - 3 2626513 A T nonsynonymous Rv2347c T3S 3 2626514 A C synonymous Rv2347c A2A 3 2626600 G A intergenic - - modern 2631641 C A synonymous Rv2351c P145P 3 2632362 T C intergenic - - 6 2632373 C A intergenic - - modern 2632500 G A intergenic - - 1 2637088 C T intergenic - - 6 2641813 C T nonsynonymous Rv2359 A55V 6 2641828 T C nonsynonymous Rv2359 V60A modern 2641840 G A nonsynonymous Rv2359 R64H 1 2643653 C T synonymous Rv2362c G202G 5 2645780 G A synonymous Rv2364c L298L 3 2652254 C G nonsynonymous Rv2372c A191G 1 2652908 G C nonsynonymous Rv2373c E360D 5 2656136 T A intergenic - - 3 2656635 C T nonsynonymous Rv2378c P357S 1 2660319 C G nonsynonymous Rv2379c D589E 1 2660319 C T synonymous Rv2379c D589D

Appendix B. Lineage-specific SNPs

244

3 2661039 C T synonymous Rv2379c I349I 1 2663210 C T synonymous Rv2380c A1302A 5 2663463 G T nonsynonymous Rv2380c R1218L 6 2672906 C T nonsynonymous Rv2383c L978F 2 2673818 G C nonsynonymous Rv2383c V674L 5 2682158 C T nonsynonymous Rv2388c L329F 5 2683729 A G nonsynonymous Rv2390c S180G 1 2688225 G C nonsynonymous Rv2394 M72I 3 2688700 C T nonsynonymous Rv2394 P231S 6 2688726 T C synonymous Rv2394 A239A 6 2689193 G A nonsynonymous Rv2394 R395Q 3 2690160 A G nonsynonymous Rv2395 N30S modern 2691713 C T nonsynonymous Rv2395 P548S 6 2692608 A G intergenic - - 1 2696977 C G synonymous Rv2400c A246A 6 2697218 G T nonsynonymous Rv2400c R166L 3 2700222 T C nonsynonymous Rv2402 L565P 5 2701940 C T nonsynonymous Rv2404c P437S 5 2702166 C A synonymous Rv2404c R361R 5 2702403 T C synonymous Rv2404c L282L 5 2702612 G T nonsynonymous Rv2404c G213C 6 2703018 C A synonymous Rv2404c G77G 5 2703964 G A intergenic - - 1 2704291 C T synonymous Rv2406c I49I modern 2704884 A T nonsynonymous Rv2407 H63L 6 2705145 C A nonsynonymous Rv2407 T150K modern 2709795 C T synonymous Rv2411c A57A 6 2710422 G A nonsynonymous Rv2413c A294T 2 2711722 A C synonymous Rv2414c P385P 3 2712328 G A synonymous Rv2414c P183P modern 2712913 T G nonsynonymous Rv2415c L291R 6 2719057 G A intergenic - - 3 2720069 G A nonsynonymous Rv2423 G158E 3 2720444 C T nonsynonymous Rv2423 S283F 4 2723506 G A synonymous Rv2426c L226L 5 2724331 A G nonsynonymous Rv2427c I383V 1 2726051 C T nonsynonymous Rv2427A L13F 3 2726105 G A intergenic - - 1 2727037 A T nonsynonymous Rv2429 M78L 6 2730360 C T synonymous Rv2434c V67V 5 2730711 C T nonsynonymous Rv2435c A680V 1 2731741 C T synonymous Rv2435c L337L 6 2733100 G T intergenic - - 2 2734482 A T nonsynonymous Rv2437 Y36F 3 2738221 C T synonymous Rv2439c I9I 6 2739242 T C nonsynonymous Rv2440c S149P 4 2740693 C T intergenic - - 2 2741209 G A synonymous Rv2443 L167L 5 2741269 C A synonymous Rv2443 G187G 5 2744225 T C synonymous Rv2444c L254L 1 2745739 G A intergenic - - 1 2745839 C T synonymous Rv2446c A100A 5 2748366 G T nonsynonymous Rv2448c E620D

Appendix B. Lineage-specific SNPs

245

6 2751300 C T synonymous Rv2449c T91T 5 2752132 T C synonymous Rv2450c L17L 5 2753821 C G nonsynonymous Rv2454c A309G 3 2753869 T C nonsynonymous Rv2454c V293A 1 2755112 C T synonymous Rv2455c I531I 6 2757464 G T synonymous Rv2456c L243L 1 2759534 C G intergenic - - 1 2764206 T C synonymous Rv2462c D362D modern 2764939 T C nonsynonymous Rv2462c L118P 6 2770011 T G synonymous Rv2467 A342A 3 2771383 A G nonsynonymous Rv2467 S800G 6 2772741 G T nonsynonymous Rv2469c A99S 6 2772760 C G synonymous Rv2469c S92S 6 2772954 A G nonsynonymous Rv2469c S28G 1 2773955 G C nonsynonymous Rv2471 S131T 3 2782498 C T synonymous Rv2477c D515D 5 2784162 C A nonsynonymous Rv2478c D149E 6 2789237 A G nonsynonymous Rv2482c D16G 2 2789798 C A synonymous Rv2483c R409R modern 2790458 G T nonsynonymous Rv2483c G189C 4 2791098 A G nonsynonymous Rv2484c D466G 1 2791257 T A nonsynonymous Rv2484c I413N modern 2791475 C T synonymous Rv2484c A340A 3 2798595 G A synonymous Rv2488c A762A 1 2799493 T G nonsynonymous Rv2488c I463S 4 2807486 A C nonsynonymous Rv2492 D70A 5 2808296 G A nonsynonymous Rv2493 E72K 1 2809895 C T synonymous Rv2495c L15L 5 2810816 A G nonsynonymous Rv2496c E56G 5 2811013 G A nonsynonymous Rv2497c E362K 6 2813515 C T synonymous Rv2499c L72L 6 2817056 G A synonymous Rv2502c A473A 1 2817158 G T nonsynonymous Rv2502c M439I 3 2817747 C T nonsynonymous Rv2502c A243V 5 2819093 C G nonsynonymous Rv2503c A12G 6 2819183 C T nonsynonymous Rv2504c R230W 5 2820743 T C nonsynonymous Rv2505c V285A 5 2822701 A C synonymous Rv2507 P88P 6 2823743 C A nonsynonymous Rv2508c A284E 1 2824432 C T synonymous Rv2508c T54T 4 2825466 A G synonymous Rv2509 K263K 1 2828104 C T intergenic - - 6 2831046 C T nonsynonymous Rv2514c A98V 2 2833329 C T nonsynonymous Rv2516c A62V 6 2835261 G A nonsynonymous Rv2518c M25I 5 2839648 G A nonsynonymous Rv2523c V95I 4 2841022 C T nonsynonymous Rv2524c R2771C 5 2843482 C T nonsynonymous Rv2524c P1951S 5 2844125 G T synonymous Rv2524c P1736P 6 2844335 G T synonymous Rv2524c T1666T 4 2847281 C T synonymous Rv2524c D684D 6 2847318 G C nonsynonymous Rv2524c G672A 6 2847737 C T synonymous Rv2524c I532I

Appendix B. Lineage-specific SNPs

246

6 2848800 C T nonsynonymous Rv2524c A178V 6 2851746 A G intergenic - - 5 2852798 C T intergenic - - 5 2854669 T C synonymous Rv2530c D6D 6 2854864 C T nonsynonymous Rv2530A A15V 5 2855231 T C nonsynonymous Rv2531c F851L 6 2855422 T C nonsynonymous Rv2531c I787T 5 2855959 C T nonsynonymous Rv2531c P608L modern 2858669 A C nonsynonymous Rv2533c D19A 5 2859147 G A synonymous Rv2534c K48K 3 2867254 A G nonsynonymous Rv2544 H44R 2 2867298 C A nonsynonymous Rv2544 H59N 2 2867347 A G nonsynonymous Rv2544 Q75R 2 2867401 A C nonsynonymous Rv2544 N93T 2 2867756 T C synonymous Rv2544 I211I 5 2868769 G A nonsynonymous Rv2547 G55D 1 2869242 T C intergenic - - modern 2870386 T C intergenic - - 6 2871717 G A nonsynonymous Rv2552c R100Q 5 2874162 C T nonsynonymous Rv2555c A775V 5 2875717 A C nonsynonymous Rv2555c I257L 6 2875808 G C nonsynonymous Rv2555c K226N modern 2878980 A G nonsynonymous Rv2559c D317G 1 2881244 A G intergenic - - 3 2881337 C T intergenic - - 3 2881569 A G nonsynonymous Rv2561 E54G 5 2881938 G C nonsynonymous Rv2562 G61R 6 2886400 G A nonsynonymous Rv2566 G10S 4 2886570 G A synonymous Rv2566 E66E 6 2886640 C G nonsynonymous Rv2566 L90V 6 2887964 C T nonsynonymous Rv2566 A531V 6 2891366 G A synonymous Rv2567 A524A 6 2892917 C T synonymous Rv2568c L185L 6 2894322 C G nonsynonymous Rv2569c D29E 6 2894458 C T intergenic - - 1 2894594 G A nonsynonymous Rv2570 G28D 1 2894642 A G nonsynonymous Rv2570 E44G 3 2895473 T C nonsynonymous Rv2571c F163S 6 2896260 C T nonsynonymous Rv2572c A515V 1 2897528 C T synonymous Rv2572c A92A 3 2897660 A G synonymous Rv2572c A48A modern 2897871 G A intergenic - - 1 2899890 G T nonsynonymous Rv2575 Q184H 6 2900967 G A nonsynonymous Rv2577 G17E 3 2903050 C T nonsynonymous Rv2578c T161I 6 2904550 G A intergenic - - 6 2904864 T C nonsynonymous Rv2580c M410T 5 2910483 C A nonsynonymous Rv2584c L140I 3 2910852 G C nonsynonymous Rv2584c A17P 6 2912815 G A synonymous Rv2586c G399G 5 2919947 T C nonsynonymous Rv2590 F693L 5 2921513 C G intergenic - - 5 2921541 A C intergenic - -

Appendix B. Lineage-specific SNPs

247

3 2925462 T A intergenic - - 1 2925683 G A synonymous Rv2595 L64L 4 2925962 C T nonsynonymous Rv2596 R77C 5 2926445 G A nonsynonymous Rv2597 A31T 1 2926882 C T synonymous Rv2597 R176R 5 2927086 G A nonsynonymous Rv2598 R34Q modern 2927511 T G nonsynonymous Rv2599 I12S 6 2927864 G A nonsynonymous Rv2599 G130S 5 2934398 C G synonymous Rv2607 T67T 6 2939177 T C synonymous Rv2611c Y262Y 1 2940608 G A nonsynonymous Rv2612c S2N 3 2941179 C T synonymous Rv2613c R6R 6 2945042 C G intergenic - - 5 2945389 G A synonymous Rv2616 A20A 2 2948230 C T nonsynonymous Rv2621c A110V 3 2948524 A T nonsynonymous Rv2621c E12V 1 2948650 T C synonymous Rv2622 R5R 3 2949251 G A nonsynonymous Rv2622 V206I modern 2953307 T G intergenic - - 6 2954318 C T nonsynonymous Rv2627c T144M modern 2955233 T C nonsynonymous Rv2628 L59S 5 2955343 G A nonsynonymous Rv2628 A96T 5 2958044 G T nonsynonymous Rv2631 G158V 3 2958693 G A synonymous Rv2631 A374A 3 2959257 A T intergenic - - 3 2959265 A T intergenic - - modern 2959324 G A intergenic - - 3 2964594 G A nonsynonymous Rv2638 A64T 5 2964876 G C intergenic - - 6 2968468 G C intergenic - - 2 2969197 A G intergenic - - modern 2970017 C G intergenic - - modern 2970019 A G intergenic - - 6 2972107 C T intergenic - - 5 2976579 C T intergenic - - modern 2980970 G T nonsynonymous Rv2660c C74F 5 2981030 C G nonsynonymous Rv2660c P54R 5 2981688 A G synonymous Rv2662 A69A 6 2984105 G A nonsynonymous Rv2667 M70I modern 2985216 A G nonsynonymous Rv2668 K162E 1 2987918 G A synonymous Rv2672 R79R 5 2988374 C T synonymous Rv2672 A231A 4 2988630 G C nonsynonymous Rv2672 D317H 5 2991646 G T synonymous Rv2675c V97V 1 2992564 C T nonsynonymous Rv2676c S22L 1 2993523 A G nonsynonymous Rv2677c Q157R 4 2994187 G A synonymous Rv2678c L292L 5 2998287 G C nonsynonymous Rv2682c G561A modern 3000362 C T nonsynonymous Rv2683 S84L 5 3001754 G T nonsynonymous Rv2684 A381S 4 3003115 G C nonsynonymous Rv2685 G378A 5 3004427 G T nonsynonymous Rv2687c V108L modern 3006898 T C synonymous Rv2689c Y55Y

Appendix B. Lineage-specific SNPs

248

modern 3007238 C T stopgain Rv2690c R658X 5 3009738 A C nonsynonymous Rv2691 Q132P 5 3009759 G C nonsynonymous Rv2691 W139S 3 3010014 G A nonsynonymous Rv2691 G224E 4 3010420 G A nonsynonymous Rv2692 V133I 2 3010993 G A nonsynonymous Rv2693c G126R 6 3011566 C A nonsynonymous Rv2694c R68S 3 3011837 A G intergenic - - 6 3011903 G A intergenic - - 6 3014016 G A synonymous Rv2697c L44L 1 3015379 G A synonymous Rv2700 P59P 5 3015639 A G nonsynonymous Rv2700 Q146R 5 3015834 T C nonsynonymous Rv2700 I211T 5 3016149 C G nonsynonymous Rv2701c A196G 6 3016608 A C nonsynonymous Rv2701c D43A 6 3022369 G T intergenic - - 2 3024021 C A synonymous Rv2711 R153R 1 3025431 C T intergenic - - 3 3027548 C T nonsynonymous Rv2714 P162S modern 3027606 G C nonsynonymous Rv2714 W181S 4 3027798 C T nonsynonymous Rv2714 A245V 6 3029177 C T synonymous Rv2716 A2A 6 3029360 C T synonymous Rv2716 T63T 4 3031168 C T nonsynonymous Rv2719c H124Y 1 3031285 T A nonsynonymous Rv2719c L85M 2 3033189 C T synonymous Rv2721c P477P 5 3035033 C T nonsynonymous Rv2723 S42L 2 3036826 G A nonsynonymous Rv2724c V156M 5 3037048 G A nonsynonymous Rv2724c E82K 5 3037196 G A synonymous Rv2724c A32A 6 3037234 C T nonsynonymous Rv2724c R20C 5 3039020 G A nonsynonymous Rv2726c V261I 6 3039842 T A nonsynonymous Rv2727c W310R 3 3040344 G A synonymous Rv2727c E142E 3 3043700 C T synonymous Rv2731 A225A 6 3043960 C G nonsynonymous Rv2731 A312G 6 3049728 A C nonsynonymous Rv2737c Q566P 5 3050362 G T nonsynonymous Rv2737c D355Y 5 3051911 C T synonymous Rv2738c G34G 5 3052223 T G nonsynonymous Rv2739c W323G 5 3056742 C A nonsynonymous Rv2743c T164K 6 3057309 C T nonsynonymous Rv2744c P252L 6 3057375 G A nonsynonymous Rv2744c R230Q 3 3059791 G C intergenic - - 6 3068710 G T nonsynonymous Rv2756c K458N 5 3068778 G A nonsynonymous Rv2756c E436K 3 3069566 G A nonsynonymous Rv2756c G173D modern 3069805 G C synonymous Rv2756c V93V 3 3072285 T G nonsynonymous Rv2761c L119R 1 3074830 C T synonymous Rv2765 Y65Y 3 3076172 G A nonsynonymous Rv2766c D67N 6 3085752 C T synonymous Rv2778c R144R 1 3086261 A G nonsynonymous Rv2779c Q165R

Appendix B. Lineage-specific SNPs

249

6 3086728 G A nonsynonymous Rv2779c M9I 5 3086788 C T intergenic - - 5 3087187 C G nonsynonymous Rv2780 A123G 5 3087190 A C nonsynonymous Rv2780 D124A modern 3088625 C T synonymous Rv2781c V120V 3 3089299 G C nonsynonymous Rv2782c G355R 5 3096576 G A nonsynonymous Rv2787 R489Q 1 3097349 C T stopgain Rv2788 Q131X 5 3098714 A G synonymous Rv2789c S75S 6 3103497 G T nonsynonymous Rv2794c W148C 4 3104189 C T synonymous Rv2795c C241C modern 3105144 T G nonsynonymous Rv2796c F159C 5 3106231 G A synonymous Rv2797c A359A 5 3106491 C T stopgain Rv2797c Q273X 5 3108299 T G nonsynonymous Rv2799 S178A 3 3111280 C T nonsynonymous Rv2802c L182F 2 3111476 G A synonymous Rv2802c S116S 3 3112700 G A nonsynonymous Rv2804c G132D

4 3112877 A G nonsynonymous Rv2805 D4G-

Rv2804c 1 3114814 G A intergenic - - 5 3115108 G A synonymous Rv2808 E21E 5 3116253 C A nonsynonymous Rv2811 Q39K modern 3118449 G A nonsynonymous Rv2813 V76I 6 3119277 G C intergenic - - modern 3119513 T C intergenic - - modern 3119737 G C intergenic - - modern 3119740 T G intergenic - - modern 3119741 T A intergenic - - 6 3119769 A G intergenic - - 1 3120212 G A intergenic - - 5 3121880 G A intergenic - - 5 3122621 C G intergenic - - modern 3122949 C T intergenic - - 1 3122954 C T intergenic - - 3 3123247 T A intergenic - - 6 3123291 C T intergenic - - 5 3125087 T G nonsynonymous Rv2818c I353S 6 3125180 G C nonsynonymous Rv2818c R322P 6 3125235 C T stopgain Rv2818c Q304X 6 3127466 A C synonymous Rv2820c R269R 5 3128667 C T synonymous Rv2821c L99L 5 3130150 C G nonsynonymous Rv2823c R542G 1 3132956 C A nonsynonymous Rv2825c A195E 1 3132975 G C nonsynonymous Rv2825c V189L 4 3133054 C G nonsynonymous Rv2825c C162W 4 3133055 G C nonsynonymous Rv2825c C162S 6 3134839 C T synonymous Rv2827c R215R 1 3135852 C A nonsynonymous Rv2828c A161E 1 3135950 G C synonymous Rv2828c S128S 6 3137026 T G nonsynonymous Rv2830c S67A modern 3137237 C T intergenic - - 1 3137681 G A synonymous Rv2831 V137V

Appendix B. Lineage-specific SNPs

250

6 3146953 A G synonymous Rv2839c E307E 6 3147243 C T nonsynonymous Rv2839c P211S 6 3148174 T C nonsynonymous Rv2840c Y29H 6 3148356 G A intergenic - - 5 3148511 A C nonsynonymous Rv2841c E306D 6 3149678 G A nonsynonymous Rv2842c R100H 5 3151813 C T nonsynonymous Rv2845c R380C 6 3152837 G C synonymous Rv2845c A38A 6 3155164 C T synonymous Rv2847c S236S 5 3157540 A G nonsynonymous Rv2849c Q202R 5 3158512 C T stopgain Rv2850c R515X 5 3159164 C T synonymous Rv2850c D297D 5 3165203 A G intergenic - - 3 3165807 G A synonymous Rv2855 V201V 3 3168492 G A intergenic - - 1 3169993 T C nonsynonymous Rv2858c L244P 6 3172564 C A nonsynonymous Rv2860c T146N 6 3173645 G C nonsynonymous Rv2861c G125R 3 3174013 C T nonsynonymous Rv2861c P2L 4 3174496 C T nonsynonymous Rv2862c R50C 5 3175460 G A synonymous Rv2864c L602L 4 3180988 T G nonsynonymous Rv2869c F259V 5 3184670 T C intergenic - - 6 3187539 C T synonymous Rv2875 V170V 5 3187718 G A nonsynonymous Rv2876 S19N 6 3187792 T C nonsynonymous Rv2876 W44R 5 3188332 C T synonymous Rv2877c V180V 1 3188428 G A synonymous Rv2877c P148P modern 3188769 C T nonsynonymous Rv2877c H35Y 4 3189242 C T synonymous Rv2878c A52A modern 3189580 C T intergenic - - 1 3190342 C T nonsynonymous Rv2880c P113S 5 3193575 C T synonymous Rv2884 A61A 1 3197917 C T synonymous Rv2888c G123G 5 3199103 C T nonsynonymous Rv2889c A2V modern 3200304 G A synonymous Rv2891 V13V 3 3200478 G A synonymous Rv2891 L71L 6 3202515 G A synonymous Rv2893 A32A 1 3202629 C T synonymous Rv2893 H70H 3 3202731 A C synonymous Rv2893 G104G 5 3205077 T C synonymous Rv2895c D52D 3 3208600 C T synonymous Rv2899c Y269Y 3 3212723 G T synonymous Rv2902c A78A 6 3213255 G A synonymous Rv2903c K200K 5 3214120 T C nonsynonymous Rv2904c V45A 1 3214481 G T intergenic - - 3 3219500 A T nonsynonymous Rv2912c K121M 1 3229692 C T nonsynonymous Rv2918c A330V 1 3233605 C T synonymous Rv2921c L179L 3 3236442 C T synonymous Rv2922c H455H 3 3236497 T A nonsynonymous Rv2922c V437E 6 3236716 T G nonsynonymous Rv2922c L364R 5 3238190 C A nonsynonymous Rv2923c L104M

Appendix B. Lineage-specific SNPs

251

1 3241244 C T nonsynonymous Rv2927c T239I 5 3242131 G A intergenic - - 6 3243312 C G intergenic - - 6 3244091 C A nonsynonymous Rv2930 A132E 3 3244113 T C synonymous Rv2930 P139P modern 3244414 G A nonsynonymous Rv2930 V240M 1 3247089 G A nonsynonymous Rv2931 G549S 3 3247298 C A synonymous Rv2931 G618G 3 3247319 C T synonymous Rv2931 G625G 3 3247340 G A synonymous Rv2931 V632V 6 3247579 G A nonsynonymous Rv2931 R712Q 1 3254758 A C synonymous Rv2932 P1229P modern 3254880 T G nonsynonymous Rv2932 L1270R 6 3255169 A G synonymous Rv2932 A1366A 5 3265806 A G nonsynonymous Rv2934 I1187V 4 3266030 G A synonymous Rv2934 S1261S modern 3271037 G A nonsynonymous Rv2935 D1101N 3 3273107 C A synonymous Rv2936 A298A modern 3273138 G C nonsynonymous Rv2936 D309H 1 3274545 G A synonymous Rv2938 L158L 5 3275857 G C synonymous Rv2939 L303L 2 3276703 A C nonsynonymous Rv2940c T2005P 6 3277599 C G nonsynonymous Rv2940c A1706G 1 3280132 C T nonsynonymous Rv2940c R862W 6 3281634 C T nonsynonymous Rv2940c A361V 6 3283592 G C synonymous Rv2941 G86G 2 3284855 C T synonymous Rv2941 I507I 6 3286107 C A synonymous Rv2942 V346V 6 3286566 C T synonymous Rv2942 A499A 5 3286789 G A nonsynonymous Rv2942 G574S 1 3293423 T C synonymous Rv2946c D977D 1 3293601 A G nonsynonymous Rv2946c Q918R 1 3295124 G A synonymous Rv2946c R410R modern 3296721 C G nonsynonymous Rv2947c R374G 5 3296934 G T nonsynonymous Rv2947c A303S 5 3296935 G T synonymous Rv2947c L302L 5 3297989 A G nonsynonymous Rv2948c I656V 5 3298691 A C synonymous Rv2948c R422R 1 3299413 C T nonsynonymous Rv2948c A181V 5 3300479 T G nonsynonymous Rv2949c V31G 3 3302589 A C intergenic - - modern 3302683 T C intergenic - - 2 3304966 G A nonsynonymous Rv2952 G176R 1 3306169 T C synonymous Rv2953 T297T 1 3306175 G C synonymous Rv2953 A299A 3 3306441 G A nonsynonymous Rv2953 R388Q 1 3306594 T G intergenic - - 1 3308446 A G nonsynonymous Rv2955c T34A 5 3309071 G A nonsynonymous Rv2956 G135D 5 3309916 C A nonsynonymous Rv2957 F149L modern 3311119 G C synonymous Rv2958c V294V 3 3312620 G T nonsynonymous Rv2959c E73D 1 3312942 G A intergenic - -

Appendix B. Lineage-specific SNPs

252

4 3314412 C T synonymous Rv2962c A237A 6 3317795 G C intergenic - - 5 3320271 G A nonsynonymous Rv2967c A926T 5 3325336 C T nonsynonymous Rv2970c A123V 1 3326150 A G nonsynonymous Rv2971 D17G 4 3326554 C A nonsynonymous Rv2971 H152N 1 3328495 C G synonymous Rv2973c R484R 1 3336528 T A intergenic - - 5 3345427 T C nonsynonymous Rv2988c S217P 6 3346980 G A nonsynonymous Rv2990c G247E 6 3348258 G A nonsynonymous Rv2991 A93T 3 3348536 C T intergenic - - 3 3349917 C T nonsynonymous Rv2992c H121Y 5 3351172 G A intergenic - - 1 3351472 G A stopgain Rv2994 W68X 6 3353082 C G synonymous Rv2995c T129T 3 3355949 A C nonsynonymous Rv2997 D284A 6 3356517 T C synonymous Rv2997 P473P 5 3356624 G C intergenic - - 3 3357464 C G intergenic - - 6 3363185 A T intergenic - - 5 3363584 G A synonymous Rv3004 T79T 1 3365841 C T nonsynonymous Rv3007c P204S 1 3366420 G C nonsynonymous Rv3007c G11R modern 3369869 T G intergenic - - modern 3371260 G A nonsynonymous Rv3011c V59I 1 3378828 G T nonsynonymous Rv3019c W58C 6 3383287 A G nonsynonymous Rv3024c K201R 3 3385218 G A nonsynonymous Rv3026c G287E 6 3389840 C A nonsynonymous Rv3030 T247N 5 3391138 G A nonsynonymous Rv3031 S406N 6 3393311 C T intergenic - - 3 3393640 C T synonymous Rv3033 N87N 5 3395038 G A intergenic - - 3 3395654 G A synonymous Rv3035 L92L 6 3395847 G T nonsynonymous Rv3035 V157F 5 3399945 A C nonsynonymous Rv3039c E80A 6 3400476 G T nonsynonymous Rv3040c A195S 1 3401850 C T synonymous Rv3041c D23D 1 3406798 G A nonsynonymous Rv3045 A172T 1 3407028 C T synonymous Rv3045 N248N 3 3413785 G A nonsynonymous Rv3051c V128I modern 3415332 G A intergenic - - 5 3416432 G A nonsynonymous Rv3055 V118I 5 3416630 G A nonsynonymous Rv3055 G184S 4 3420825 G A nonsynonymous Rv3059 G445D 1 3424462 G A synonymous Rv3061c L322L 6 3425523 T C intergenic - - 6 3425952 C A synonymous Rv3062 G123G 5 3426279 C A nonsynonymous Rv3062 S232R 1 3427632 C T synonymous Rv3063 A130A 3 3428897 G C nonsynonymous Rv3063 R552P 6 3429605 G A intergenic - -

Appendix B. Lineage-specific SNPs

253

1 3431407 G T intergenic - - 5 3431529 C T synonymous Rv3067 P34P 5 3433326 G A synonymous Rv3068c L99L 3 3437007 G C nonsynonymous Rv3074 A77P 6 3438386 C T synonymous Rv3075c I196I modern 3440542 G A synonymous Rv3077 G334G 1 3442240 G T stopgain Rv3079c E120X 5 3445777 G C nonsynonymous Rv3080c R71P 1 3447480 T G nonsynonymous Rv3082c L316R 1 3448714 G C nonsynonymous Rv3083 D71H 1 3453382 C T nonsynonymous Rv3087 A153V 4 3454263 G C nonsynonymous Rv3087 V447L 5 3457858 C A intergenic - - 3 3459081 G C nonsynonymous Rv3090 A291P 1 3460765 G C synonymous Rv3091 P550P 6 3463724 G T synonymous Rv3094c A56A modern 3464629 G A nonsynonymous Rv3096 G28E 4 3467465 C G synonymous Rv3098c A66A 1 3474597 C A synonymous Rv3106 V197V modern 3475159 G A nonsynonymous Rv3106 D385N 5 3478253 C A nonsynonymous Rv3109 A202E 5 3478767 G A intergenic - - 1 3479561 G A nonsynonymous Rv3111 D131N 6 3479798 G A synonymous Rv3112 G33G 4 3480789 C T nonsynonymous Rv3114 P11S 2 3487108 C T synonymous Rv3121 A200A 3 3488122 G A nonsynonymous Rv3122 G12R 6 3488556 G T nonsynonymous Rv3122 R156S 6 3488687 G A nonsynonymous Rv3123 R40Q 6 3489340 C G intergenic - - 3 3489665 C T nonsynonymous Rv3124 P54S 6 3493823 T C nonsynonymous Rv3128c L120S 5 3496002 A C nonsynonymous Rv3130c E122A 1 3497586 G C synonymous Rv3132c A560A modern 3498418 C T nonsynonymous Rv3132c T283I 5 3499247 G A nonsynonymous Rv3132c V7I 3 3499497 C G nonsynonymous Rv3133c A140G 5 3503284 A G intergenic - - 5 3504184 T A intergenic - - 5 3504410 C A nonsynonymous Rv3138 H72Q modern 3505005 C A nonsynonymous Rv3138 P271T 5 3506470 G A nonsynonymous Rv3139 A370T 6 3508970 A G synonymous Rv3141 E292E modern 3509091 C G intergenic - - 3 3509231 C T nonsynonymous Rv3142c R106C 3 3509301 C G synonymous Rv3142c V82V 5 3511335 T G intergenic - - 3 3515467 C T nonsynonymous Rv3150 P19L modern 3521044 G A nonsynonymous Rv3153 A180T 1 3526986 A G synonymous Rv3158 A399A 1 3530145 G T nonsynonymous Rv3161c D332Y 4 3530955 C G nonsynonymous Rv3161c L62V 1 3533759 C T synonymous Rv3164c P49P

Appendix B. Lineage-specific SNPs

254

1 3536008 C A nonsynonymous Rv3167c P17Q 3 3539353 G A synonymous Rv3170 Q283Q 2 3542049 G A intergenic - - 6 3546678 G A nonsynonymous Rv3178 A81T 5 3552581 T G intergenic - - 6 3554217 C T intergenic - - 6 3554298 A G nonsynonymous Rv3188 M1V modern 3555699 C T nonsynonymous Rv3190c A330V 3 3557253 C T intergenic - - 5 3558733 C G intergenic - - modern 3560645 A G nonsynonymous Rv3193c Q843R 3 3562338 C A nonsynonymous Rv3193c L279M 6 3564897 G A nonsynonymous Rv3195 M178I 5 3566143 C G nonsynonymous Rv3196 A119G 6 3568004 C A synonymous Rv3197 R327R 6 3570843 C T synonymous Rv3198c S123S 6 3571742 G C nonsynonymous Rv3199c A268P modern 3571834 C A nonsynonymous Rv3199c P237Q 5 3573080 A G nonsynonymous Rv3200c H197R modern 3573636 G A nonsynonymous Rv3200c D12N 2 3577497 G A nonsynonymous Rv3202c E902K modern 3577958 C T nonsynonymous Rv3202c A748V 5 3580275 G C intergenic - - 6 3586551 C A nonsynonymous Rv3209 P93Q 5 3591082 C T synonymous Rv3213c N137N 6 3591488 C T nonsynonymous Rv3213c T2I 5 3591661 C T nonsynonymous Rv3214 H6Y 5 3592709 C T synonymous Rv3215 G152G 5 3596354 A G nonsynonymous Rv3220c D394G 6 3596407 G A synonymous Rv3220c V376V 2 3597249 G C nonsynonymous Rv3220c G96R 1 3597682 G A nonsynonymous Rv3221c V29I 6 3599099 G T nonsynonymous Rv3223c E151D 6 3603178 C T nonsynonymous Rv3226c H49Y 5 3603523 G A synonymous Rv3227 A49A 6 3617228 G C nonsynonymous Rv3239c G126R 6 3622618 G C nonsynonymous Rv3243c L158F modern 3624486 G A nonsynonymous Rv3244c G142D 5 3630235 C T synonymous Rv3249c R51R 6 3635117 T C nonsynonymous Rv3255c V384A 4 3638093 T C nonsynonymous Rv3257c L206S 3 3645524 G A nonsynonymous Rv3264c D152N 6 3646030 A C synonymous Rv3265c S285S 6 3646033 T G nonsynonymous Rv3265c S284R modern 3647041 C T nonsynonymous Rv3266c P257S 1 3647591 T C synonymous Rv3266c N73N 5 3648267 G T nonsynonymous Rv3267 G128V 5 3651849 T G nonsynonymous Rv3270 S442A 3 3656206 C A nonsynonymous Rv3273 R524S 5 3656289 C G nonsynonymous Rv3273 D551E 5 3658226 C T nonsynonymous Rv3275c A138V modern 3658266 C G nonsynonymous Rv3275c R125G 1 3661802 G A nonsynonymous Rv3279c G71S

Appendix B. Lineage-specific SNPs

255

4 3670040 A G synonymous Rv3289c A124A 1 3670118 C A synonymous Rv3289c G98G 1 3671532 C G nonsynonymous Rv3290c A88G 1 3671843 A C intergenic - - 1 3672105 C G nonsynonymous Rv3291c H65D 6 3673210 G T nonsynonymous Rv3292 G295C 2 3674157 T A nonsynonymous Rv3293 C186S 5 3674194 G C nonsynonymous Rv3293 G198A 1 3678091 G A synonymous Rv3296 P439P 3 3678094 C T synonymous Rv3296 A440A 2 3678249 A C nonsynonymous Rv3296 K492T 3 3679764 G A nonsynonymous Rv3296 S997N modern 3679949 G A nonsynonymous Rv3296 E1059K 6 3681349 C A synonymous Rv3297 T10T 4 3681548 C A synonymous Rv3297 R77R 1 3683237 C A synonymous Rv3299c A909A 1 3683715 A C nonsynonymous Rv3299c Q750P 5 3684169 G T nonsynonymous Rv3299c D599Y 5 3685487 G A synonymous Rv3299c L159L 3 3685510 A G nonsynonymous Rv3299c S152G 1 3687372 T C nonsynonymous Rv3301c L69S 6 3688648 C T synonymous Rv3302c R265R 4 3690016 C T nonsynonymous Rv3303c S308L 6 3693548 G A nonsynonymous Rv3306c A148T 4 3693681 G T synonymous Rv3306c A103A modern 3696181 G A nonsynonymous Rv3308 G440S 3 3697585 C T nonsynonymous Rv3310 L130F modern 3697708 G A nonsynonymous Rv3310 A171T 5 3701552 G A synonymous Rv3313c E211E 6 3702543 C T nonsynonymous Rv3314c R309W 1 3704261 C A nonsynonymous Rv3316 L54I 6 3705098 G A synonymous Rv3318 V33V 5 3705776 C T synonymous Rv3318 R259R 6 3706343 C T synonymous Rv3318 N448N 5 3708317 G A intergenic - - 5 3708768 C T synonymous Rv3322c A95A 3 3714639 G A nonsynonymous Rv3329 R83H 3 3715775 C T intergenic - - 3 3729342 C T nonsynonymous Rv3342 A240V modern 3743549 G A intergenic - - 1 3753414 T G intergenic - - modern 3753415 G C intergenic - - 6 3755443 G A intergenic - - 6 3767368 T G nonsynonymous Rv3351c V258G 1 3770325 G A nonsynonymous Rv3356c A109T 3 3771009 T G synonymous Rv3357 S79S modern 3771628 C A synonymous Rv3359 I95I modern 3772616 A G intergenic - - 1 3774506 C T synonymous Rv3364c G123G 2 3775409 A G nonsynonymous Rv3365c Q698R 2 3775441 T G nonsynonymous Rv3365c S687R 6 3775639 C A synonymous Rv3365c G621G 6 3776265 C A synonymous Rv3365c R413R

Appendix B. Lineage-specific SNPs

256

2 3778011 G A nonsynonymous Rv3366 R92Q 2 3778012 G T synonymous Rv3366 R92R 5 3778148 G A nonsynonymous Rv3366 A138T 2 3778396 C T intergenic - - 5 3780715 A G nonsynonymous Rv3368c T89A modern 3785946 G A nonsynonymous Rv3371 G339R modern 3786033 G A nonsynonymous Rv3371 V368I 3 3787466 G C nonsynonymous Rv3372 E385Q modern 3788365 G T stopgain Rv3373 G214X 6 3789077 C T nonsynonymous Rv3375 P153S 5 3790652 C T nonsynonymous Rv3376 T166M 5 3790693 G T nonsynonymous Rv3376 G180C 6 3792262 G T nonsynonymous Rv3377c G31V 1 3793634 T C nonsynonymous Rv3379c C412R 3 3799512 G A nonsynonymous Rv3384c A42T modern 3808103 G A nonsynonymous Rv3392c S112N 6 3811327 G A nonsynonymous Rv3395c V104M modern 3811629 T C nonsynonymous Rv3395c V3A 2 3811672 G T intergenic - - 5 3812009 C A nonsynonymous Rv3395A D97E 3 3821503 T G nonsynonymous Rv3402c L130R 5 3822047 C T intergenic - - 6 3823494 G T nonsynonymous Rv3403c A124S 6 3829664 C T nonsynonymous Rv3410c H83Y 2 3830349 G A nonsynonymous Rv3411c A391T 1 3830566 C T synonymous Rv3411c S318S 4 3830695 C T synonymous Rv3411c A275A 5 3835102 G T stopgain Rv3416 E71X 5 3836728 T C nonsynonymous Rv3417c V55A 5 3840932 C T synonymous Rv3423c N163N 6 3841662 T C intergenic - - 6 3841663 C T intergenic - - 5 3850261 G T intergenic - - 1 3851084 T C nonsynonymous Rv3432c I224T 4 3851887 T G nonsynonymous Rv3433c S443A 4 3851888 G A synonymous Rv3433c A442A 1 3854899 T C intergenic - - 5 3858894 C T nonsynonymous Rv3439c L257F 3 3860216 C T synonymous Rv3441c A385A 1 3861914 G A nonsynonymous Rv3442c A13T 6 3862148 C T synonymous Rv3443c P81P 1 3864041 C T nonsynonymous Rv3446c A164V 5 3864816 G A synonymous Rv3447c S1141S 5 3865398 C A synonymous Rv3447c V947V 5 3866953 T C nonsynonymous Rv3447c V429A 5 3867027 G A synonymous Rv3447c T404T 1 3868738 C A synonymous Rv3448 I129I 3 3869355 T C nonsynonymous Rv3448 I335T 5 3870238 T G nonsynonymous Rv3449 S163A 4 3871246 G A synonymous Rv3450c G417G 5 3872171 C T nonsynonymous Rv3450c A109V 6 3872797 A G nonsynonymous Rv3451 I61V 3 3874745 G C intergenic - -

Appendix B. Lineage-specific SNPs

257

5 3875633 A G nonsynonymous Rv3454 Y271C 5 3876305 C G nonsynonymous Rv3455c T214S 1 3876953 C T synonymous Rv3456c V160V 3 3880175 C A intergenic - - 5 3882025 G A synonymous Rv3464 L63L 6 3883278 A G nonsynonymous Rv3465 T149A 1 3883467 A G intergenic - - 6 3893290 C T nonsynonymous Rv3476c A144V 4 3893480 T C nonsynonymous Rv3476c F81L 4 3895727 A C intergenic - - 6 3898522 G A synonymous Rv3479 S901S modern 3898869 A C nonsynonymous Rv3479 N1017T 1 3899654 G T nonsynonymous Rv3480c R250L 5 3902782 G C nonsynonymous Rv3483c A11P 3 3908062 C T intergenic - - 4 3909235 C G nonsynonymous Rv3490 L334V 5 3909589 G A nonsynonymous Rv3490 G452R 1 3913737 C T synonymous Rv3495c A266A 3 3918649 A G synonymous Rv3499c G184G 5 3919261 G A nonsynonymous Rv3500c A268T 1 3920109 C A synonymous Rv3501c I251I 1 3921094 G A nonsynonymous Rv3502c R316H modern 3922836 C T synonymous Rv3504 D122D 5 3923267 A C nonsynonymous Rv3504 E266A 6 3939062 C G nonsynonymous Rv3510c L66V 5 3939405 G T intergenic - - 5 3945304 C G nonsynonymous Rv3513c Q149E 5 3951036 C G nonsynonymous Rv3515c Q479E 5 3953660 A C nonsynonymous Rv3517 Q77P 3 3954222 G C nonsynonymous Rv3517 E264D 5 3956535 G A synonymous Rv3520c V278V 5 3957514 C T intergenic - - 1 3958007 C T nonsynonymous Rv3521 H163Y 5 3964234 A T intergenic - - 6 3968375 C G synonymous Rv3531c S190S 6 3968618 C T synonymous Rv3531c T109T 2 3970594 C G intergenic - - 1 3974142 C T nonsynonymous Rv3535c P120L 2 3979990 G C nonsynonymous Rv3540c V224L 6 3980437 T G nonsynonymous Rv3540c S75A 6 3981329 C T nonsynonymous Rv3542c H218Y 4 3984321 T C synonymous Rv3545c H375H 1 3984926 C T synonymous Rv3545c L174L 5 3986987 G T synonymous Rv3547 L48L 5 3987180 G A nonsynonymous Rv3547 D113N 1 3989107 C T intergenic - - modern 3989914 T G nonsynonymous Rv3551 S7A 3 3990093 C T synonymous Rv3551 V66V 2 3993058 G A nonsynonymous Rv3554 G125E 3 3994101 A G nonsynonymous Rv3554 I473V modern 3994898 C T nonsynonymous Rv3555c R268W 6 3995060 G T nonsynonymous Rv3555c V214F 6 3999774 C T nonsynonymous Rv3559c A221V

Appendix B. Lineage-specific SNPs

258

5 3999805 A G nonsynonymous Rv3559c T211A 2 4001622 T C intergenic - - 6 4001813 G A synonymous Rv3561 V59V 6 4002847 C A nonsynonymous Rv3561 T404N 1 4003645 C T nonsynonymous Rv3562 A162V 6 4004907 A T nonsynonymous Rv3563 E206V 4 4005114 C G nonsynonymous Rv3563 S275W 5 4006943 G A synonymous Rv3565 V248V 3 4007272 G A nonsynonymous Rv3565 R358Q 6 4007432 G A nonsynonymous Rv3566c E251K 5 4008252 G A nonsynonymous Rv3566A G61D 4 4008747 C T nonsynonymous Rv3567c T179I 5 4008863 A G synonymous Rv3567c S140S 5 4011992 C T synonymous Rv3570c N93N modern 4012219 G A nonsynonymous Rv3570c D18N 5 4012274 C T intergenic - - 3 4012286 C T intergenic - - 1 4013076 G A synonymous Rv3571 G220G 1 4014431 G A synonymous Rv3573c P594P 1 4019103 G T intergenic - - 6 4021757 C A synonymous Rv3579c R213R 1 4022652 C T synonymous Rv3580c S384S 1 4024079 A G nonsynonymous Rv3581c Q90R 5 4026414 G A intergenic - - 1 4026800 G C synonymous Rv3585 A119A 4 4028752 G A nonsynonymous Rv3586 A288T 6 4031202 C A nonsynonymous Rv3589 A237D 5 4033260 C T intergenic - - 3 4033711 T C nonsynonymous Rv3591c V111A 6 4035242 C A nonsynonymous Rv3593 F297L 1 4040517 T C nonsynonymous Rv3596c V63A 5 4040824 G A intergenic - - 2 4041581 G A nonsynonymous Rv3598c A454T 5 4041899 A C nonsynonymous Rv3598c I348L 3 4044872 C T synonymous Rv3602c G113G 5 4045844 C T nonsynonymous Rv3603c T92I 6 4046218 G A intergenic - - 6 4051853 T C nonsynonymous Rv3610c V344A 3 4052349 A C nonsynonymous Rv3610c K179Q modern 4054637 C T synonymous Rv3614c G20G 4 4056416 A C intergenic - - 2 4056693 G T intergenic - - 3 4057036 A G intergenic - - 1 4058711 T C nonsynonymous Rv3618 L5S 3 4059186 G T synonymous Rv3618 A163A 5 4060201 C T nonsynonymous Rv3619c S23L 5 4062582 C G nonsynonymous Rv3623 A19G 6 4064918 G A nonsynonymous Rv3626c G329D 6 4067044 C T nonsynonymous Rv3627c A81V modern 4067152 T C nonsynonymous Rv3627c V45A 5 4067386 C T intergenic - - 6 4069598 T C nonsynonymous Rv3630 S142P 1 4069797 C T nonsynonymous Rv3630 S208L

Appendix B. Lineage-specific SNPs

259

5 4074437 G C synonymous Rv3635 P268P 6 4074919 C T nonsynonymous Rv3635 P429L 5 4075626 C T intergenic - - 6 4078292 A C nonsynonymous Rv3639c K53N 5 4080434 G A nonsynonymous Rv3641c D43N 1 4081987 C G synonymous Rv3644c A245A 1 4081996 C G synonymous Rv3644c P242P 1 4083360 T G nonsynonymous Rv3645 I185S 6 4084010 G A nonsynonymous Rv3645 G402S 1 4085200 C G nonsynonymous Rv3646c D686E 5 4086604 C T synonymous Rv3646c Y218Y 6 4086697 C A synonymous Rv3646c A187A 3 4087495 A G intergenic - - 6 4087670 G A synonymous Rv3647c V173V 3 4087880 C T synonymous Rv3647c P103P 4 4089058 C T nonsynonymous Rv3649 P93L 6 4090238 C T synonymous Rv3649 G486G 3 4090453 C G nonsynonymous Rv3649 A558G 3 4092376 T C nonsynonymous Rv3651 I179T 1 4092921 G T intergenic - - 4 4095295 G A synonymous Rv3655c E2E 1 4096190 C T synonymous Rv3658c A250A 3 4096636 G A nonsynonymous Rv3658c G102S 3 4097569 C A synonymous Rv3659c A142A 6 4098514 A G nonsynonymous Rv3660c D212G 6 4099060 C T nonsynonymous Rv3660c P30L 1 4101018 C T intergenic - - modern 4105757 A G nonsynonymous Rv3666c Q443R 6 4106075 G A nonsynonymous Rv3666c R337H 1 4106154 G A nonsynonymous Rv3666c E311K 4 4107074 G A nonsynonymous Rv3666c R4Q 5 4109151 G A nonsynonymous Rv3667 D454N 6 4110626 C T intergenic - - 4 4112429 G A nonsynonymous Rv3671c V363I 5 4113005 G A nonsynonymous Rv3671c A171T 1 4115029 C T synonymous Rv3673c D43D 3 4115952 C T intergenic - - modern 4116682 C T synonymous Rv3676 L69L 5 4117097 A C nonsynonymous Rv3676 E207A modern 4117161 A C intergenic - - 5 4119114 C G synonymous Rv3679 G113G modern 4119246 C T synonymous Rv3679 D157D 5 4120451 G T synonymous Rv3680 A219A 5 4123685 C A nonsynonymous Rv3682 D590E 5 4123724 G A synonymous Rv3682 A603A 5 4124983 C T synonymous Rv3683 R189R 1 4126087 T G nonsynonymous Rv3684 S217A 6 4127009 C A intergenic - - 6 4128152 G A nonsynonymous Rv3685c V192I 5 4128879 T C nonsynonymous Rv3686c W69R 6 4130604 G A nonsynonymous Rv3689 S83N modern 4130711 C G nonsynonymous Rv3689 L119V 3 4132509 G A intergenic - -

Appendix B. Lineage-specific SNPs

260

2 4133316 A T nonsynonymous Rv3691 T267S 5 4133466 T C synonymous Rv3691 L317L 1 4133907 G A nonsynonymous Rv3692 R131H 5 4134341 G A nonsynonymous Rv3692 V276I 5 4134401 G A nonsynonymous Rv3692 A296T 5 4137136 G A intergenic - - 1 4137190 T C intergenic - - modern 4138377 C T nonsynonymous Rv3696c A460V 3 4138622 C T synonymous Rv3696c R378R 5 4139131 G A nonsynonymous Rv3696c E209K 6 4141285 C T nonsynonymous Rv3698 R265C 3 4142192 G A nonsynonymous Rv3699 G50E 1 4142689 G A nonsynonymous Rv3699 E216K 5 4144371 A G synonymous Rv3701c V182V 4 4145737 C T synonymous Rv3703c Y385Y 6 4148162 G T nonsynonymous Rv3704c A9S 5 4153687 C T intergenic - - 1 4155266 C G synonymous Rv3710 G469G modern 4156239 C T nonsynonymous Rv3711c A164V 4 4156503 A G nonsynonymous Rv3711c D76G 6 4158032 A C nonsynonymous Rv3712 K351T modern 4158361 G A synonymous Rv3713 P45P 1 4160536 C T synonymous Rv3716c G126G 5 4161854 G T synonymous Rv3718c V135V 1 4163558 G C nonsynonymous Rv3719 S418T 3 4166290 C T nonsynonymous Rv3721c H148Y 2 4167656 T C nonsynonymous Rv3722c M158T 1 4169719 C T synonymous Rv3724B A38A 6 4172173 A C nonsynonymous Rv3726 R251S 6 4173849 C T nonsynonymous Rv3727 R299C 2 4174131 G C nonsynonymous Rv3727 G393R 4 4179089 T C nonsynonymous Rv3729 S269P 6 4179391 G A stopgain Rv3729 W369X 2 4179832 G C nonsynonymous Rv3729 Q516H 2 4182387 G C synonymous Rv3731 V210V 6 4183288 G A nonsynonymous Rv3732 G119S 6 4183602 C T synonymous Rv3732 A223A 6 4186050 G T intergenic - - 6 4186864 C T synonymous Rv3736 I77I 1 4187063 G A nonsynonymous Rv3736 G144R 5 4189191 G T nonsynonymous Rv3737 R498L modern 4190532 C A intergenic - - 1 4190596 A C intergenic - - 5 4190639 G A intergenic - - 6 4192341 C T synonymous Rv3741c R171R 5 4192797 A G synonymous Rv3741c Q19Q 6 4193449 G C nonsynonymous Rv3743c G642A 6 4193641 T C nonsynonymous Rv3743c V578A 6 4195182 C T synonymous Rv3743c C64C 3 4195390 A T intergenic - - 3 4195799 C A synonymous Rv3744 G120G 6 4195897 C T synonymous Rv3745c L68L modern 4197189 G A intergenic - -

Appendix B. Lineage-specific SNPs

261

modern 4200220 C T nonsynonymous Rv3753c T1M 5 4200686 C T nonsynonymous Rv3754 T89M 1 4201105 G A nonsynonymous Rv3754 D229N 6 4201352 C T synonymous Rv3755c T179T 5 4201535 G A synonymous Rv3755c V118V 5 4201728 C T nonsynonymous Rv3755c P54L modern 4202383 G T synonymous Rv3756c V77V 5 4205237 G A nonsynonymous Rv3759c G46E 2 4205325 T C nonsynonymous Rv3759c W17R 5 4208331 G T synonymous Rv3762c A182A modern 4210876 G A nonsynonymous Rv3764c R45H 6 4212174 T C intergenic - - 5 4214206 G C nonsynonymous Rv3768 R46P 6 4215001 C A intergenic - - 4 4215484 G C nonsynonymous Rv3770c A98P 4 4217557 G A nonsynonymous Rv3772 A142T 6 4219820 G C nonsynonymous Rv3775 A46P 6 4220427 C T nonsynonymous Rv3775 P248L 3 4222131 C T nonsynonymous Rv3776 A348V 6 4225401 C T synonymous Rv3779 T139T 3 4226275 C T nonsynonymous Rv3779 P431S 6 4228997 C T synonymous Rv3782 C217C 5 4230965 G T nonsynonymous Rv3784 G237V 5 4231528 A G nonsynonymous Rv3785 Y70C modern 4232327 G A synonymous Rv3785 R336R 5 4233541 G A synonymous Rv3786c T19T 6 4236150 C T synonymous Rv3790 G124G 6 4236891 C A synonymous Rv3790 P371P 5 4239843 A C nonsynonymous Rv3792 K638Q modern 4240671 T C nonsynonymous Rv3793 I270T 1 4241042 A G nonsynonymous Rv3793 N394D 6 4241843 C A nonsynonymous Rv3793 L661I 3 4242075 G A nonsynonymous Rv3793 R738Q 6 4244379 C T nonsynonymous Rv3794 P383S 5 4244635 T C nonsynonymous Rv3794 V468A 5 4245147 C T nonsynonymous Rv3794 P639S 1 4245969 C T nonsynonymous Rv3794 P913S 6 4246864 C T synonymous Rv3795 V117V modern 4247646 C A nonsynonymous Rv3795 A378E 4 4251297 C G synonymous Rv3797 G71G 1 4254347 T A intergenic - - 2 4254431 C T synonymous Rv3799c D506D 3 4258447 C T synonymous Rv3800c R900R modern 4261499 T C nonsynonymous Rv3801c V523A 5 4262256 A G nonsynonymous Rv3801c I271V modern 4263279 G A intergenic - - 3 4266647 T C synonymous Rv3804c V4V 2 4267647 A G nonsynonymous Rv3805c D397G 6 4269351 C T synonymous Rv3806c A161A modern 4269387 C A nonsynonymous Rv3806c D149E 6 4269522 C T synonymous Rv3806c T104T modern 4269606 C T synonymous Rv3806c R76R 6 4270171 C A nonsynonymous Rv3807c A56E

Appendix B. Lineage-specific SNPs

262

5 4271348 C T nonsynonymous Rv3808c A311V 6 4271498 T C nonsynonymous Rv3808c L261P 6 4272211 G A synonymous Rv3808c V23V 5 4275241 C A synonymous Rv3811 L148L 1 4275935 A G nonsynonymous Rv3811 M380V 1 4276306 C T synonymous Rv3811 G503G 1 4280441 G T synonymous Rv3815c P116P 1 4281143 G C synonymous Rv3816c V143V 1 4281272 C T synonymous Rv3816c Y100Y 2 4284429 C T nonsynonymous Rv3820c P466L 3 4286826 A G nonsynonymous Rv3822 K36E 4 4287164 G A synonymous Rv3822 G148G modern 4287361 T C nonsynonymous Rv3822 V214A 6 4289216 G A nonsynonymous Rv3823c G772S modern 4289953 C T nonsynonymous Rv3823c A526V 2 4290135 G A synonymous Rv3823c L465L modern 4290564 A C synonymous Rv3823c A322A modern 4290827 C G nonsynonymous Rv3823c R235G 6 4292095 G T synonymous Rv3824c A360A 5 4292317 C T synonymous Rv3824c F286F 3 4292941 C T synonymous Rv3824c H78H 5 4293133 G C nonsynonymous Rv3824c W14C 5 4296229 A G nonsynonymous Rv3825c D1126G 1 4296381 C A synonymous Rv3825c T1075T 2 4301075 G C nonsynonymous Rv3826 E422Q 1 4303407 T G nonsynonymous Rv3829c S534R 6 4303554 C A synonymous Rv3829c G485G 6 4303675 T C nonsynonymous Rv3829c L445P 5 4304824 C G nonsynonymous Rv3829c P62R 3 4305243 C T nonsynonymous Rv3830c P148L 5 4306059 C T synonymous Rv3831 A101A 1 4306339 C A nonsynonymous Rv3832c A158E 1 4307344 C G nonsynonymous Rv3833 L160V 2 4308395 C T synonymous Rv3834c L174L 6 4308991 G A intergenic - - 4 4313128 T C nonsynonymous Rv3839 S122P 5 4313357 G A nonsynonymous Rv3839 R198H 1 4314843 C T synonymous Rv3842c D240D 6 4316322 G A nonsynonymous Rv3843c R92H modern 4316566 C G nonsynonymous Rv3843c R11G 5 4317750 C T intergenic - - modern 4318425 G C intergenic - - 5 4319352 T C synonymous Rv3845 A24A 2 4319985 G C intergenic - - 6 4320299 G A intergenic - - 5 4322042 G A nonsynonymous Rv3847 A169T modern 4323006 A G synonymous Rv3848 A227A 6 4326465 A G nonsynonymous Rv3854c I337V 5 4326928 C T synonymous Rv3854c G182G 5 4327103 G A nonsynonymous Rv3854c G124D 3 4328492 C T nonsynonymous Rv3856c A306V 5 4328644 T C synonymous Rv3856c R255R 4 4329782 A G intergenic - -

Appendix B. Lineage-specific SNPs

263

6 4330238 G A nonsynonymous Rv3858c R423H 6 4333284 A G nonsynonymous Rv3859c N933S modern 4334425 G C nonsynonymous Rv3859c E553Q modern 4336597 A C intergenic - - 3 4336991 C T nonsynonymous Rv3860 A72V 1 4337574 A G synonymous Rv3860 E266E 1 4338603 G A intergenic - - 5 4339610 G A synonymous Rv3863 L254L 6 4339880 C T synonymous Rv3863 A344A 5 4340964 A C nonsynonymous Rv3864 D232A 5 4340966 C A nonsynonymous Rv3864 L233I 5 4340999 G T nonsynonymous Rv3864 E244L 5 4341000 A T nonsynonymous Rv3864 E244V 5 4343653 G A nonsynonymous Rv3868 G114S 3 4343784 G A synonymous Rv3868 K157K 1 4344058 T C nonsynonymous Rv3868 S249P 5 4345036 A C intergenic - - 3 4345548 G A nonsynonymous Rv3869 M170I 6 4346843 C A synonymous Rv3870 T121T 5 4347337 C A nonsynonymous Rv3870 A286D 5 4357657 C T nonsynonymous Rv3879c A709V 1 4357773 C T synonymous Rv3879c T670T 2 4357804 A C nonsynonymous Rv3879c E660A 1 4357946 G A nonsynonymous Rv3879c G613S 6 4358866 C T nonsynonymous Rv3879c P306L 6 4359202 C T nonsynonymous Rv3879c S194F 5 4361250 T C nonsynonymous Rv3881c Y226H 6 4362384 G C nonsynonymous Rv3882c G346A 2 4362568 A C synonymous Rv3882c R285R 6 4364323 C T synonymous Rv3883c D145D 6 4365212 A G nonsynonymous Rv3884c N543D modern 4367649 C T synonymous Rv3885c Y291Y 1 4369499 C T synonymous Rv3886c P224P 5 4371331 G T nonsynonymous Rv3887c K118N 6 4372275 G A synonymous Rv3888c G144G 4 4372353 G C synonymous Rv3888c R118R modern 4374228 G A nonsynonymous Rv3891c A49T modern 4377033 G C synonymous Rv3894c S1140S 2 4378504 A G nonsynonymous Rv3894c D650G 6 4378608 C T synonymous Rv3894c Y615Y 6 4382296 A G nonsynonymous Rv3896c I186V 5 4382553 C T nonsynonymous Rv3896c A100V 1 4383442 A G nonsynonymous Rv3897c I67V 4 4383655 C T stopgain Rv3898c Q111X 4 4384007 G C intergenic - - modern 4385187 C T nonsynonymous Rv3899c P65S 6 4386257 C G nonsynonymous Rv3900c P18A 6 4386625 G A nonsynonymous Rv3901c V64M 5 4386746 A G nonsynonymous Rv3901c I23M 5 4387392 C T synonymous Rv3902c F168F 5 4387423 C T nonsynonymous Rv3902c S158F 5 4388976 C T nonsynonymous Rv3903c P486L 1 4390380 T C nonsynonymous Rv3903c V18A

Appendix B. Lineage-specific SNPs

264

6 4393838 C T synonymous Rv3908 L130L modern 4394210 C G nonsynonymous Rv3909 R7G modern 4395387 G A nonsynonymous Rv3909 S399N 3 4396495 C A synonymous Rv3909 G768G 5 4397110 G A nonsynonymous Rv3910 V172M 5 4397374 G A nonsynonymous Rv3910 A260T 5 4397763 C A synonymous Rv3910 P389P 6 4398223 C T nonsynonymous Rv3910 R543W 1 4398732 G A synonymous Rv3910 L712L 5 4399422 G A synonymous Rv3910 A942A 6 4400663 C G nonsynonymous Rv3911 R160G 5 4400947 C G nonsynonymous Rv3912 D26E 1 4401400 A G synonymous Rv3912 P177P 3 4401509 C T nonsynonymous Rv3912 R214W 4 4407588 G A synonymous Rv3919c A205A 1 4407873 G T synonymous Rv3919c V110V 6 4408570 C T nonsynonymous Rv3920c R110W modern 4408920 G A intergenic - - 4 4408923 T C intergenic - - 3 4409954 C G nonsynonymous Rv3921c A39G 1 4410386 C A synonymous Rv3922c R10R 1 4411016 G A intergenic - -

Appendix C: Lineage-specific SNPs within genes associated with drug resistance

265

Appendix C

Lineage-specific SNPs located within genes associated with M. tuberculosis drug

resistance, as identified in the TBDreaMDB database. Last column indicates drug

resistance: EMB, ethambutol; FLQ, flouroquinolones; INH, isoniazid; RIF, rifampicin;

SM, streptomycin.

Lineage Genomic position

Ancestral allele

Derived allele Mutation type Gene Mutation DR

5 408935 C T nonsynonymous Rv0340 A101V EMB 5 409079 G A nonsynonymous Rv0340 G149E EMB 6 1416633 G C nonsynonymous embR L239V EMB 1 1417019 C T nonsynonymous embR C110Y EMB 5 1507920 G A synonymous Rv1341 V116V EMB 3 3489665 C T nonsynonymous moaR1 P54S EMB 3 3645524 C T nonsynonymous manB D152N EMB 1 3647591 A G synonymous rmlD N73N EMB modern 3647041 G A nonsynonymous rmlD P257S EMB modern 4240671 T C nonsynonymous embC I270T EMB 1 4241042 A G nonsynonymous embC N394D EMB 6 4241843 C A nonsynonymous embC L661I EMB 3 4242075 G A nonsynonymous embC R738Q EMB 6 4244379 C T nonsynonymous embA P383S EMB 5 4244635 T C nonsynonymous embA V468A EMB 5 4245147 C T nonsynonymous embA P639S EMB 1 4245969 C T nonsynonymous embA P913S EMB 6 4246864 C T synonymous embB V117V EMB modern 4247646 C A nonsynonymous embB A378E EMB 6 1674434 T C nonsynonymous inhA V78A ETH 5 4326928 G A synonymous ethA G182G ETH 6 4326465 T C nonsynonymous ethA I337V ETH 5 4327103 C T nonsynonymous ethA G124D ETH 1 6112 G C nonsynonymous gyrB M330I FLQ modern 9143 C T synonymous gyrA I614I FLQ 5 9566 C T synonymous gyrA Y755Y FLQ 1 8452 C T nonsynonymous gyrA A384V FLQ 6 8493 C T nonsynonymous gyrA L398F FLQ 3 157129 C T nonsynonymous fbpC G158S INH modern 412280 G T nonsynonymous iniA Q481H INH 5 2101921 C T synonymous ndh S374S INH 6 2155503 G A synonymous katG T203T INH 4 2154724 A C nonsynonymous katG L463R INH 3 2516271 T C nonsynonymous Rv2242 M323T INH 5 2516804 A C synonymous fabD A6A INH modern 2518132 T C synonymous kasA T6T INH

Appendix C: Lineage-specific SNPs within genes associated with drug resistance

266

2 2521428 A G nonsynonymous accD6 D229G INH 1 2726051 G A nonsynonymous oxyR' L13F INH 5 3506470 G A nonsynonymous fadE24 A370T INH 6 4007432 C T nonsynonymous nat E251K INH

3 2726105 G A intergenic #N/A -

oxyR/ahpC upstream

3 762434 T G synonymous rpoB G876G RIF 4 763031 C T synonymous rpoB A1075A RIF 6 760969 C T nonsynonymous rpoB S388L RIF 6 761723 A C nonsynonymous rpoB E639D RIF

3 759746 C T intergenic #N/A -

rpoB upstream

4 4407588 C T synonymous gid A205A SM 1 4407873 C A synonymous gid V110V SM

Appendix D: Nonsynonymous/synonymous SNP ratio

267

Appendix D

Nonsynonymous to synonymous SNP ratios for the 28 genomes used in the study. A.

Ratio based on lineage branch SNPs. B. Ratio based on all SNPs within each of the 28

strains used in the study. This ratio therefore includes the singleton SNPs present in the

extant strains.

A. Lineage-specific SNPs (internal branches)

Lineage Nonsynonymous SNP Synonymous SNP Nonsynonymous /

Synonymous 1 238 156 1.5 5 385 213 1.8 6 374 206 1.8 2 74 33 2.2 3 182 117 1.6 4 96 46 2.1

Modern branch 172 78 2.2 Average - - 1.9

Appendix D: Nonsynonymous/synonymous SNP ratio

268

B. SNPs in external branches

Strain Lineage Nonsynonymous SNP Synonymous SNP Nonsynonymous /

Synonymous MTB_95_0545 1 473 296 1.7 MTB_K21 1 525 325 1.7 MTB_K67 1 495 313 1.6 MTB_K93 1 486 303 1.7 MTB_T17 1 464 317 1.4 MTB_T83 1 464 307 1.5 MTB_T92 1 506 314 1.7 MTB_N0070 1 489 331 1.4 MTB_N0072 1 491 341 1.4 MAF_11821_03 5 546 321 1.5 MAF_5444_04 5 523 329 1.2 MAF_4141_04 6 590 353 1.5 MAF_GM_0981 6 605 353 1.6 MTB_00_1695 2 527 259 2.0 MTB_98_1833 2 518 271 1.9 MTB_M4100A 2 506 279 1.8 MTB_T67 2 530 273 1.9 MTB_T85 2 534 273 1.9 MTB_N0031 2 514 256 2.0 MTB_91_0079 3 492 295 1.7 MTB_K49 3 490 282 1.9 MTB_SG1 3 525 301 1.9 MTB_4783_04 4 475 262 1.8 MTB_erdman 4 445 229 1.9 MTB_GM_1503 4 472 282 1.6 MTB_H37Rv 4 504 283 1.7 MTB_K37 4 440 242 1.8 MTB_KZN_605 4 461 249 1.8 Average - - - 1.7

Appendix E: RNA-seq differential expression

269

Appendix E

Differentially expressed genes, antisense and sRNAs between Lineage 1 and 2. A. Sense

transcription (gene expression). B. Antisense transcription. C. sRNAs.

A. Sense transcription (N=112)

Gene L

inea

ge 1

Lin

eage

2

Fold

ch

ange

p-value Functional category

Rv0027 76.2 28.5 2.7 1.23E-02 conserved hypotheticals Rv0028 58.1 23.5 2.5 1.51E-02 conserved hypotheticals Rv0082 126.5 35.7 3.5 9.27E-03 intermediary metabolism and

respiration Rv0130 htdZ 121.0 48.1 2.5 1.11E-02 intermediary metabolism and

respiration Rv0157A 144.8 66.1 2.2 1.85E-02 conserved hypotheticals Rv0193c 206.1 78.7 2.6 2.28E-02 conserved hypotheticals Rv0250c 322.7 780.7 0.4 3.71E-02 conserved hypotheticals Rv0275c 131.3 359.4 0.4 3.67E-02 regulatory proteins Rv0469 umaA 2628.7 1217.0 2.2 3.75E-02 lipid metabolism Rv0553 menC 361.9 57.8 6.3 1.79E-03 intermediary metabolism and

respiration Rv0554 bpoC 542.4 239.8 2.3 1.88E-02 virulence, detoxification,

adaptation Rv0557 mgtA 410.6 127.6 3.2 3.14E-05 lipid metabolism Rv0619 galTb 66.8 9.7 6.9 5.12E-07 intermediary metabolism and

respiration Rv0620 galK 37.7 1.0 39.3 5.12E-07 intermediary metabolism and

respiration Rv0653c 156.2 53.2 2.9 1.34E-02 regulatory proteins Rv0686 159.6 477.5 0.3 5.00E-04 cell wall and cell processes Rv0724A 210.0 58.6 3.6 4.84E-03 conserved hypotheticals Rv0783c emrB 544.3 244.1 2.2 3.58E-02 cell wall and cell processes Rv0847 lpqS 236.7 46.7 5.1 1.00E-02 cell wall and cell processes Rv0877 991.9 431.9 2.3 4.03E-02 conserved hypotheticals Rv0890c 1871.3 872.9 2.1 1.65E-02 regulatory proteins Rv1044 79.3 27.5 2.9 7.62E-03 conserved hypotheticals Rv1075c 755.3 298.4 2.5 1.85E-02 cell wall and cell processes Rv1103c mazE3 126.3 57.7 2.2 3.57E-02 virulence, detoxification,

adaptation Rv1233c 5340.6 1249.2 4.3 3.08E-02 cell wall and cell processes Rv1397c vapC10 86.0 862.7 0.1 6.10E-04 virulence, detoxification,

adaptation Rv1433 1469.9 414.0 3.6 4.09E-04 cell wall and cell processes Rv1440 secG 990.7 376.2 2.6 1.99E-02 cell wall and cell processes

Appendix E: RNA-seq differential expression

270

Rv1503c 462.9 55.1 8.4 3.35E-04 conserved hypotheticals Rv1504c 355.2 30.8 11.5 6.04E-15 conserved hypotheticals Rv1505c 1726.9 137.5 12.6 2.53E-19 conserved hypotheticals Rv1506c 475.6 205.2 2.3 4.06E-03 unknown Rv1508c 1550.0 731.3 2.1 4.45E-02 cell wall and cell processes Rv1530 adh 162.8 71.4 2.3 3.65E-02 intermediary metabolism and

respiration Rv1541c lprI 283.8 135.2 2.1 2.20E-02 cell wall and cell processes Rv1551 plsB1 401.3 90.9 4.4 1.25E-06 lipid metabolism Rv1592c 3281.5 701.5 4.7 1.12E-06 conserved hypotheticals Rv1661 pks7 389.2 957.0 0.4 2.07E-03 lipid metabolism Rv1699 pyrG 1516.8 790.4 1.9 3.12E-02 intermediary metabolism and

respiration Rv1733c 24.0 60.9 0.4 2.28E-02 cell wall and cell processes Rv1749c 767.2 404.9 1.9 2.84E-02 cell wall and cell processes Rv1778c 1019.2 404.1 2.5 2.14E-03 conserved hypotheticals Rv1781c malQ 211.9 94.3 2.2 3.08E-02 intermediary metabolism and

respiration Rv1895 275.9 74.4 3.7 7.17E-05 intermediary metabolism and

respiration Rv1912c fadB5 969.7 451.6 2.1 1.65E-02 lipid metabolism Rv1918c PPE35 536.3 1278.3 0.4 1.51E-02 PE/PPE Rv1925 fadD31 728.0 2531.8 0.3 3.80E-05 lipid metabolism Rv1926c mpt63 12021.3 3363.3 3.6 4.29E-02 cell wall and cell processes Rv1929c 1032.0 430.0 2.4 1.11E-02 conserved hypotheticals Rv1979c 250.3 540.7 0.5 1.28E-02 cell wall and cell processes Rv1980c mpt64 914.6 2468.6 0.4 1.28E-02 cell wall and cell processes Rv1981c nrdF1 4224.9 971.0 4.4 1.10E-07 information pathways Rv2051c ppm1 1963.9 901.7 2.2 3.97E-02 cell wall and cell processes Rv2063 mazE7 1210.6 77.3 15.7 2.38E-03 virulence, detoxification,

adaptation Rv2063A mazF7 153.6 31.1 4.9 8.05E-07 virulence, detoxification,

adaptation Rv2080 lppJ 358.4 64.7 5.5 5.36E-09 cell wall and cell processes Rv2090 224.9 92.5 2.4 2.40E-02 information pathways Rv2144c 4074.8 1755.5 2.3 1.56E-02 cell wall and cell processes Rv2161c 62.6 802.9 0.1 1.50E-13 intermediary metabolism and

respiration Rv2189c 51.9 159.3 0.3 6.10E-04 conserved hypotheticals Rv2211c gcvT 2460.5 1131.1 2.2 3.61E-02 intermediary metabolism and

respiration Rv2243 fabD 1747.5 756.7 2.3 3.44E-03 lipid metabolism Rv2274A mazE8 62.8 19.4 3.2 8.15E-03 virulence, detoxification,

adaptation Rv2331 182.0 59.7 3.0 4.86E-03 conserved hypotheticals Rv2428 ahpC 1089.1 326.6 3.3 6.31E-05 virulence, detoxification,

adaptation Rv2429 ahpD 307.3 129.9 2.4 1.49E-02 virulence, detoxification,

adaptation Rv2478c 171.7 54.6 3.1 3.58E-02 conserved hypotheticals Rv2497c bkdA 2822.7 1525.3 1.9 3.58E-02 intermediary metabolism and

respiration Rv2518c ldtB 1508.5 383.2 3.9 2.99E-03 cell wall and cell processes

Appendix E: RNA-seq differential expression

271

Rv2525c 1076.0 520.7 2.1 1.77E-02 conserved hypotheticals Rv2526 vapB17 233.2 1505.4 0.2 3.34E-09 virulence, detoxification,

adaptation Rv2527 vapC17 60.2 553.3 0.1 3.02E-10 virulence, detoxification,

adaptation Rv2528c mrr 113.5 49.2 2.3 1.37E-02 information pathways Rv2573 50.5 14.5 3.5 2.48E-03 conserved hypotheticals Rv2596 vapC40 215.1 96.2 2.2 4.57E-02 virulence, detoxification,

adaptation Rv2697c dut 2014.8 642.7 3.1 6.63E-04 intermediary metabolism and

respiration Rv2707 1353.7 592.0 2.3 1.02E-02 conserved hypotheticals Rv2719c 358.6 130.3 2.8 3.14E-04 cell wall and cell processes Rv2729c 292.2 125.2 2.3 3.58E-02 cell wall and cell processes Rv2758c vapB21 599.5 242.1 2.5 1.37E-02 virulence, detoxification,

adaptation Rv2765 413.4 55.8 7.4 2.21E-11 intermediary metabolism and

respiration Rv2809 763.7 361.3 2.1 2.96E-02 conserved hypotheticals Rv2830c vapB22 290.5 101.5 2.9 1.81E-03 virulence, detoxification,

adaptation Rv2843 215.4 88.0 2.4 2.71E-02 cell wall and cell processes Rv2870c dxr 1241.2 419.3 3.0 2.63E-03 intermediary metabolism and

respiration Rv2938 drrC 708.9 264.5 2.7 2.99E-03 cell wall and cell processes Rv2952 1325.5 666.5 2.0 3.08E-02 intermediary metabolism and

respiration Rv3082c virS 780.3 45.8 17.0 4.09E-20 regulatory proteins Rv3167c 92.0 24.8 3.7 5.19E-05 regulatory proteins Rv3168 1695.7 343.5 4.9 1.25E-06 conserved hypotheticals Rv3196A 134.0 342.0 0.4 3.75E-02 conserved hypotheticals Rv3198c uvrD2 510.1 1224.9 0.4 1.44E-02 information pathways Rv3233c 159.0 747.3 0.2 1.27E-06 lipid metabolism Rv3242c 19.2 54.7 0.4 2.81E-02 conserved hypotheticals Rv3350c PPE56 818.8 318.3 2.6 2.66E-02 PE/PPE Rv3366 spoU 42.5 201.8 0.2 4.35E-06 information pathways Rv3389c htdY 1481.1 504.3 2.9 2.07E-03 intermediary metabolism and

respiration Rv3415c 307.7 89.3 3.4 4.15E-04 conserved hypotheticals Rv3435c 2667.0 826.7 3.2 4.09E-04 cell wall and cell processes Rv3446c 153.0 13.7 11.2 3.58E-06 conserved hypotheticals Rv3500c yrbE4B 1711.8 469.8 3.6 5.19E-05 virulence, detoxification,

adaptation Rv3540c ltp2 1111.1 355.3 3.1 6.08E-03 lipid metabolism Rv3652 PE_PGRS

60 1200.2 74.1 16.2 1.27E-16 PE/PPE

Rv3679 413.8 4344.3 0.1 2.55E-04 cell wall and cell processes Rv3680 570.5 2840.3 0.2 1.25E-06 cell wall and cell processes Rv3695 58.3 268.2 0.2 1.60E-05 cell wall and cell processes Rv3741c 35.7 10.3 3.5 1.85E-02 intermediary metabolism and

respiration Rv3742c 90.7 23.8 3.8 1.05E-02 intermediary metabolism and

respiration

Appendix E: RNA-seq differential expression

272

Rv3810 pirG 1826.3 879.2 2.1 2.61E-02 cell wall and cell processes Rv3812 PE_PGRS

62 219.3 86.9 2.5 6.39E-03 PE/PPE

Rv3829c 409.9 8412.8 0.05 6.32E-05 intermediary metabolism and respiration

Rv3831 52.4 739.0 0.1 1.02E-02 conserved hypotheticals

B. Antisense transcription (N=56)

Gene

Lin

eage

1

Lin

eage

2

Fold

ch

ange

p-value Functional category

Rv0213c 1389.4 539.0 2.6 3.64E-02 intermediary metabolism and respiration

Rv0345 241.8 787.6 0.3 3.82E-02 conserved hypotheticals Rv0354c PPE7 168.2 600.4 0.3 1.04E-02 PE/PPE Rv0423c thiC 43.2 7.7 5.6 1.98E-02 intermediary metabolism and

respiration Rv0440 groEL2 154.3 14.0 11.0 2.59E-07 virulence, detoxification,

adaptation Rv0470c pcaA 844.3 205.3 4.1 8.88E-04 lipid metabolism Rv0482 murB 19.2 67.9 0.3 4.10E-02 cell wall and cell processes Rv0524 hemL 31.2 123.2 0.3 3.95E-02 intermediary metabolism and

respiration Rv0552 744.1 86.1 8.6 8.11E-08 conserved hypotheticals Rv0557 mgtA 161.8 23.9 6.8 2.19E-04 lipid metabolism Rv0635 hadA 36.2 3.1 11.8 3.70E-02 intermediary metabolism and

respiration Rv0682 rpsL 14.4 0.0 inf 2.29E-02 information pathways Rv0689c 56.2 243.1 0.2 1.81E-02 conserved hypotheticals Rv0842 3.6 528.6 0.01 6.42E-06 cell wall and cell processes Rv0870c 209.3 32.2 6.5 1.52E-04 cell wall and cell processes Rv0874c 120.3 590.0 0.2 6.52E-04 conserved hypotheticals Rv0970 30.3 109.0 0.3 3.82E-02 cell wall and cell processes Rv1087A 68.1 285.9 0.2 2.21E-02 cell wall and cell processes Rv1093 glyA1 119.7 8.2 14.7 5.96E-03 intermediary metabolism and

respiration Rv1253 deaD 15.1 624.0 0.02 2.67E-16 information pathways Rv1453 73.4 222.7 0.3 4.10E-02 regulatory proteins Rv1477 ripA 243.7 87.4 2.8 4.77E-02 virulence, detoxification,

adaptation Rv1505c 31.1 121.8 0.3 1.98E-02 conserved hypotheticals Rv1567c 109.9 353.7 0.3 2.30E-02 cell wall and cell processes Rv1700 161.4 3.2 49.7 2.50E-02 information pathways Rv1898 10.6 138.7 0.1 1.09E-02 conserved hypotheticals Rv1900c lipJ 27.9 99.7 0.3 4.62E-02 intermediary metabolism and

respiration

Appendix E: RNA-seq differential expression

273

Rv1926c mpt63 48.9 362.4 0.1 3.06E-05 cell wall and cell processes Rv1982c vapC36 474.8 92.7 5.1 3.68E-05 virulence, detoxification,

adaptation Rv2038c 68.8 13.1 5.2 2.29E-02 cell wall and cell processes Rv2228c 38.6 9.6 4.0 3.95E-02 information pathways Rv2247 accD6 18.7 2.8 6.8 4.92E-02 lipid metabolism Rv2397c cysA1 54.3 210.9 0.3 1.36E-02 cell wall and cell processes Rv2413c 119.1 406.0 0.3 4.10E-02 conserved hypotheticals Rv2528c mrr 74.9 1055.9 0.1 1.17E-09 information pathways Rv2671 ribD 242.3 2.9 82.2 2.67E-16 intermediary metabolism and

respiration Rv2672 365.0 44.4 8.2 1.01E-07 intermediary metabolism and

respiration Rv2724c fadE20 13.3 95.5 0.1 9.45E-05 lipid metabolism Rv2831 echA16 283.7 8.2 34.5 3.86E-04 lipid metabolism Rv2995c leuB 26.5 95.3 0.3 3.70E-02 intermediary metabolism and

respiration Rv3078 hab 1.8 18.9 0.1 2.76E-02 intermediary metabolism and

respiration Rv3143 26.4 139.3 0.2 5.46E-03 regulatory proteins Rv3196A 7.1 39.2 0.2 4.73E-02 conserved hypotheticals Rv3209 84.0 317.2 0.3 2.50E-02 conserved hypotheticals Rv3216 92.6 436.9 0.2 4.10E-02 intermediary metabolism and

respiration Rv3235 417.0 1729.1 0.2 5.96E-03 conserved hypotheticals Rv3254 63.0 235.8 0.3 3.42E-02 conserved hypotheticals Rv3290c lat 51.8 188.7 0.3 3.70E-02 intermediary metabolism and

respiration Rv3587c 115.4 321.9 0.4 4.45E-02 cell wall and cell processes Rv3652_mpr

280.5 11.0 25.6 9.54E-05 sRNA

Rv3673c 23.8 2.2 11.0 1.98E-02 intermediary metabolism and respiration

Rv3708c asd 108.0 1.7 64.3 2.29E-02 intermediary metabolism and respiration

Rv3797 fadE35 39.9 139.8 0.3 3.95E-02 lipid metabolism Rv3830c 46.1 223.6 0.2 1.04E-02 regulatory proteins Rv3832c 115.7 2393.6 0.05 1.10E-02 conserved hypotheticals Rv3842c glpQ1 2852.1 883.9 3.2 3.03E-02 intermediary metabolism and

respiration C. sRNA transcription (N=3)

sRNA

Lin

eage

1

Lin

eage

2

Fold

ch

ange

p-value Functional category

MTS0900 326.8 2678.2 0.1 3.89E-02 NA MTS1338 71.5 911.3 0.1 1.10E-02 NA MTS2458 108.5 364.2 0.3 3.70E-02 NA

Appendix F: Functional categories

274

Appendix F

Functional category representation for differentially expressed genes. Toxin-antitoxins

were found to be significantly over-represented.

Functional class N

umbe

r of

gen

es

anno

tate

d in

ge

nom

e

Diff

eren

tially

ex

pres

sed

gene

s

Diff

eren

tially

ex

pres

sed

gene

s (%

)

Rep

rese

ntat

ion

(fol

d ch

ange

from

ex

pect

ed)

χ2 (a

djus

ted

p-

valu

e)

information pathways 243 5 4.5 0.7 0.97 intermediary metabolism and respiration 925 20 17.9 0.8 0.47

PE/PPE 168 4 3.6 0.9 0.97 regulatory proteins 198 5 4.5 0.9 0.97 conserved hypotheticals 1042 27 24.1 0.9 0.97 lipid metabolism 271 9 8.0 1.2 0.47 cell wall and cell processes 773 27 24.1 1.3 0.97

virulence, detoxification, adaptation 112 4 3.6 1.3 0.97

unknown 16 1 0.9 2.2 0.97 toxin-antitoxins 124 10 8.9 2.9 0.03

Appendix G: Publications

275

Appendix G

List of Publications

Rose, G., Cortes, T., Comas, I., Coscolla, M., Gagneux. S. & Young, D. B. (2013).

Mapping genotype-phenotype diversity amongst clinical isolates of Mycobacterium

tuberculosis by sequence based profiling. Under review.

Cortes, T., Schubert, O., Rose, G., Arnvig, K. B., Comas, I., Aebersold, R. &

Young, D. B. (2013). Genome-wide mapping of transcriptional start sites defines an

extensive leaderless transcriptome in Mycobacterium tuberculosis. Under review.

Kato-Maeda, M., Ho, C., Passarelli, B., Banaei, N., Grinsdale, J., Flores, L.,

Anderson, J., Murray, M., Rose, G., Kawamura, L. M., Pourmand, N., Tariq, M.

A., Gagneux, S., Hopewell, P. C. (2013). Use of Whole Genome Sequencing to

Determine the Microevolution of Mycobacterium tuberculosis during an Outbreak. PLoS

One. 8(3) e58235.

Muller, B., Borrell, S., Rose, G. & Gagneux, S. (2012). The heterogeneous evolution

of multidrug-resistant Mycobacterium tuberculosis. Trends Genet. 29(3) 160-9.

Comas, I., Borrell, S., Roetzer, A., Rose, G., Malla, B., Kato-Maeda, M., Galagan,

J., Niemann, S. & Gagneux, S. (2011). Whole-genome sequencing of rifampicin-

resistant Mycobacterium tuberculosis strains identifies compensatory mutations in RNA

polymerase genes. Nat Genet 44(1) 106-110.

Arnvig, K. B., Comas, I., Thomson, N. R., Houghton, J., Boshoff, H. I., Croucher,

N. J., Rose, G., Perkins, T. T., Parkhill, J., Dougan, G. & Young, D. B. (2011).

Sequence-based analysis uncovers an abundance of non-coding RNA in the total

transcriptome of Mycobacterium tuberculosis. PLoS Pathog 7(11) e1002342.