Single-cell genomics shedding light on marine Thaumarchaeota diversification

24
SHORT COMMUNICATION Single-cell genomics shedding light on marine Thaumarchaeota diversification Haiwei Luo 1 , Bradley B Tolar 1 , Brandon K Swan 2 , Chuanlun L Zhang 1,3 , Ramunas Stepanauskas 2 , Mary Ann Moran 1 and James T Hollibaugh 1 1 Department of Marine Sciences, University of Georgia, Athens, GA, USA; 2 Bigelow Laboratory for Ocean Sciences, East Boothbay, ME, USA and 3 State Key Laboratory of Marine Geology, Tongji University, Shanghai, China Previous studies based on analysis of amoA, 16S ribosomal RNA or accA gene sequences have established that marine Thaumarchaeota fall into two phylogenetically distinct groups corresponding to shallow- and deep-water clades, but it is not clear how water depth interacts with other environmental factors, including light, temperature and location, to affect this pattern of diversification. Earlier studies focused on single-gene distributions were not able to link phylogenetic structure to other aspects of functional adaptation. Here, we analyzed the genome content of 46 uncultivated single Thaumarchaeota cells sampled from epi- and mesopelagic waters of subtropical, temperate and polar oceans. Phylogenomic analysis showed that populations diverged by depth, as expected, and that mesopelagic populations from different locations were well mixed. Functional analysis showed that some traits, including putative DNA photolyase and catalase genes that may be related to adaptive mechanisms to reduce light-induced damage, were found exclusively in members of the epipelagic clade. Our analysis of partial genomes has thus confirmed the depth differentiation of Thaumarchaeota populations observed previously, consistent with the distribution of putative mechanisms to reduce light-induced damage in shallow- and deep-water populations. The ISME Journal (2014) 8, 732–736; doi:10.1038/ismej.2013.202; published online 7 November 2013 Subject Category: Integrated genomics and post-genomics approaches in microbial ecology Keywords: marine group I Archaea; Thaumarchaeota; single-cell genomics; nitrification; ammonia oxidation Thaumarchaeota are abundant in both marine and terrestrial environments and have a significant role in the global nitrogen cycle (Leininger et al., 2006; Wuchter et al., 2006). Oceanic Thaumarchaeota are distributed throughout the water column, with their relative abundance increasing with depth to up to 50% of mesopelagic prokaryotic cells (Karner et al., 2001). Cultivation of marine Thaumarchaeota has proven to be difficult. Only one strain, Nitrosopumilus maritimus SCM1 (Konneke et al., 2005), has been successfully brought into pure culture, while several others have been cultivated as enrichment cultures (Blainey et al., 2011; Mosier et al., 2012a,b; Park et al., 2012a,b). Therefore, our understanding of their diversity and ecological function has been gained largely through culture-independent approaches. Using a few marker genes (16S ribosomal RNA, amoA, accA), many studies have shown that marine Thaumarchaeota fall into shallow- and deep-water phylogenetic clades (Francis et al., 2005; Hallam et al., 2006; Nicol et al., 2011). In addition to depth, light, temperature, latitude and dissolved oxygen have been identified as important correlates of Thaumarchaeota diversity and distributions in the ocean (Prosser and Nicol, 2008; Biller et al., 2012). These single-gene- based analyses have outlined the phylogenetic dis- tribution and diversity of marine Thaumarchaeota, but not provided insights into the adaptive mechanisms giving rise to specific ecotypes. We turned to single amplified genomes (SAGs) in order to link additional metabolic capabilities to taxonomy and to reconstruct phylogeny using characters sampled across the genome. Forty-six single cells related to Thaumarchaeota populations in a variety of marine environments (Supplementary Figure S1) were obtained from epi- and mesopelagic waters of the Southern Ocean, temperate north Atlantic, subtropical north Pacific and south Atlantic (Supplementary Table S1). Genomes of these cells were recovered with variable success, with a mean of 32% ( ± 12%) relative to the Nitrosopumilus maritimus SCM1 genome Correspondence: H Luo or JT Hollibaugh, Department of Marine Sciences, University of Georgia, Athens, GA 30602, USA. E-mail: [email protected] or [email protected] Received 15 August 2013; revised 27 September 2013; accepted 6 October 2013; published online 7 November 2013 The ISME Journal (2014) 8, 732–736 & 2014 International Society for Microbial Ecology All rights reserved 1751-7362/14 www.nature.com/ismej

Transcript of Single-cell genomics shedding light on marine Thaumarchaeota diversification

SHORT COMMUNICATION

Single-cell genomics shedding light on marineThaumarchaeota diversification

Haiwei Luo1, Bradley B Tolar1, Brandon K Swan2, Chuanlun L Zhang1,3,Ramunas Stepanauskas2, Mary Ann Moran1 and James T Hollibaugh1

1Department of Marine Sciences, University of Georgia, Athens, GA, USA; 2Bigelow Laboratory for OceanSciences, East Boothbay, ME, USA and 3State Key Laboratory of Marine Geology, Tongji University,Shanghai, China

Previous studies based on analysis of amoA, 16S ribosomal RNA or accA gene sequenceshave established that marine Thaumarchaeota fall into two phylogenetically distinct groupscorresponding to shallow- and deep-water clades, but it is not clear how water depth interactswith other environmental factors, including light, temperature and location, to affect this patternof diversification. Earlier studies focused on single-gene distributions were not able to linkphylogenetic structure to other aspects of functional adaptation. Here, we analyzed the genomecontent of 46 uncultivated single Thaumarchaeota cells sampled from epi- and mesopelagic watersof subtropical, temperate and polar oceans. Phylogenomic analysis showed that populationsdiverged by depth, as expected, and that mesopelagic populations from different locations were wellmixed. Functional analysis showed that some traits, including putative DNA photolyase and catalasegenes that may be related to adaptive mechanisms to reduce light-induced damage, were foundexclusively in members of the epipelagic clade. Our analysis of partial genomes has thus confirmedthe depth differentiation of Thaumarchaeota populations observed previously, consistent with thedistribution of putative mechanisms to reduce light-induced damage in shallow- and deep-waterpopulations.The ISME Journal (2014) 8, 732–736; doi:10.1038/ismej.2013.202; published online 7 November 2013Subject Category: Integrated genomics and post-genomics approaches in microbial ecologyKeywords: marine group I Archaea; Thaumarchaeota; single-cell genomics; nitrification; ammoniaoxidation

Thaumarchaeota are abundant in both marine andterrestrial environments and have a significant rolein the global nitrogen cycle (Leininger et al., 2006;Wuchter et al., 2006). Oceanic Thaumarchaeota aredistributed throughout the water column, with theirrelative abundance increasing with depth to up to50% of mesopelagic prokaryotic cells (Karner et al.,2001). Cultivation of marine Thaumarchaeota hasproven to be difficult. Only one strain, Nitrosopumilusmaritimus SCM1 (Konneke et al., 2005), has beensuccessfully brought into pure culture, while severalothers have been cultivated as enrichment cultures(Blainey et al., 2011; Mosier et al., 2012a,b; Park et al.,2012a,b). Therefore, our understanding of theirdiversity and ecological function has been gainedlargely through culture-independent approaches.Using a few marker genes (16S ribosomal RNA,amoA, accA), many studies have shown that marine

Thaumarchaeota fall into shallow- and deep-waterphylogenetic clades (Francis et al., 2005; Hallam et al.,2006; Nicol et al., 2011). In addition to depth, light,temperature, latitude and dissolved oxygen have beenidentified as important correlates of Thaumarchaeotadiversity and distributions in the ocean (Prosser andNicol, 2008; Biller et al., 2012). These single-gene-based analyses have outlined the phylogenetic dis-tribution and diversity of marine Thaumarchaeota, butnot provided insights into the adaptive mechanismsgiving rise to specific ecotypes.

We turned to single amplified genomes (SAGs)in order to link additional metabolic capabilitiesto taxonomy and to reconstruct phylogeny usingcharacters sampled across the genome. Forty-sixsingle cells related to Thaumarchaeota populationsin a variety of marine environments (SupplementaryFigure S1) were obtained from epi- and mesopelagicwaters of the Southern Ocean, temperatenorth Atlantic, subtropical north Pacific and southAtlantic (Supplementary Table S1). Genomes ofthese cells were recovered with variable success,with a mean of 32% (±12%) relative tothe Nitrosopumilus maritimus SCM1 genome

Correspondence: H Luo or JT Hollibaugh, Department of MarineSciences, University of Georgia, Athens, GA 30602, USA.E-mail: [email protected] or [email protected] 15 August 2013; revised 27 September 2013; accepted 6October 2013; published online 7 November 2013

The ISME Journal (2014) 8, 732–736& 2014 International Society for Microbial Ecology All rights reserved 1751-7362/14

www.nature.com/ismej

(1.65 Mbp). A maximum likelihood phylogenomictree constructed using a concatenated amino-acidsequence of 97 single-copy orthologous genes(Supplementary Table S2) with 27 041 sites stronglysupported separating the SAGs into two phylo-genetically coherent groups correspondingto epi- and mesopelagic clades (Figure 1). Thewithin-clade average nucleotide identity (ANI) is89.0% (±8.9%) and 86.8% (±5.4%) forepi- and mesopelagic clades, respectively, whereas

the between-clade ANI is 75.4% (±4.1%). When thecomposite genomes from enrichment cultures orfrom metagenomic assembly were included inthe phylogenomic analysis, the surface water SAGsappeared to be evolutionarily separated from all thecultured marine Thaumarchaeota (SupplementaryFigure S2). Among the eight Antarctic SAGssampled, three were obtained from 80 m, a depthassociated with the Winter Water (WW) water mass(Church et al., 2003), whereas the remaining fivewere sampled from 400 m in the Circumpolar DeepWater (CDW) water mass. Although separatedby only B300-m depth, the SAGs from these twowater masses were confidently assigned tothe epi- and mesopelagic clades, respectively(Figure 1). Conversely, among the mesopelagiccells collected 1000s of km apart from CDW andsubtropical waters, SAGs were intermingled withinthe mesopelagic clade (Figure 1).

Previous studies have shown that marineThaumarchaeota are inhibited by light (Merbtet al., 2012) and suggested that sensitivity tophotoinhibition might be a key factor determiningtheir depth distribution (Mincer et al., 2007; Churchet al., 2010; Hu et al., 2011; Merbt et al., 2012). Lightwas also implicated as the factor determining theseasonal dynamics of Thaumarchaeota in the South-ern Ocean, where they are abundant during winterbut nearly absent in summer (Kalanetra et al., 2009).Yet no direct evidence from field populationshas been reported to support these hypotheses.Our analyses of genomes from uncultivatedThaumarchaeota showed that a homolog to deoxy-ribodipyrimidine photolyase, a key gene in thepathway to repair ultraviolet-induced DNA damage(Goosen and Moolenaar, 2008), was present in aSAG sampled from Gulf of Maine surface water(1-m depth), but absent from all of the 42 mesope-lagic SAGs (Supplementary Table S3). The occur-rence of putative DNA photolyases in surface waterThaumarchaeota was reinforced by conducting arigorous search for DNA photolyase genes asso-ciated with Thaumarchaeota reads in the GlobalOcean Survey (GOS) surface water metagenomes.Our identification of seven putative Thaumarch-aeota photolyase genes in the GOS data (Figure 2a) isconsistent with the hypothesis that light is animportant factor structuring marine Thaumarch-aeota populations by depth, and suggests thatsurface water members have evolved effectivemechanisms to cope with ultraviolet-induced DNAdamage. It remains unknown, however, whether theprocess of ammonia oxidation is indeed subject tophotoinhibition, as has been suggested previously(Mincer et al., 2007). DNA photolyase was not foundin the other three epipelagic clade SAGs that wereassociated with the Antarctic WW water mass andcollected from 80 m, which is inconclusive becauseof the low recovery of genome sequence from thesecells (coverage o35% relative to N. maritimusSCM1).

AAA288-N15

AAA008-O05

AAA007-N23

AAA007-E15

AAA008-N07

AAA007-O23

AAA007-N19

AAA008-E02

AAA288-J14

AAA008-P02

AAA001-A19

AAA008-E15

AAA008-O18

AAA008-M23

AB-663-P07

AAA288-I14

AAA288-D22

AB-663-G14

AAA288-O22

AAA008-P23

AAA288-C17

AAA288-P18

AB-663-F14

AB-663-N18

AAA288-N23

AAA007-C21

AAA007-M20

AAA288-D03

AAA008-G03

AAA288-E09

AAA288-M04

AAA288-G05

AAA288-M23

AAA288-K05

AAA288-K20

AAA007-G17

AAA008-E17

AAA288-O17

AAA288-P02

AAA288-K02

AAA008-M21

AB-663-O07

AB-661-I02

AB-661-L21

AB-661-M19

AAA160-J20

Nitrosopumilus maritimus

100100

100

82100

54

100

100

100

100

86

100

95

100

100

100

96

60

80

61

78

66

86

100

100

100

100

100

100

100

52

100

100

99

100

87

100

0.05

770 m, South Atlantic Gyre

800 m, North Pacific Gyre

400 m, AntarcticCircumpolar Deep Water

80 m, Antarctic Winter Water

Surface, Gulf of Maine

Mesopelagicclade

Epipelagicclade

Figure 1 Maximum likelihood phylogenomic analysis of 47genomes of the marine Thaumarchaeota. The tree was constructedusing the RAxML v7.3.0 software (Stamatakis, 2006) using aconcatenated amino-acid sequence of 97 genes with 27 041 sites,with a data partition model determined by the PartitionFindersoftware (Lanfear et al., 2012). Values at the nodes show thenumber of times the clade defined by that node appeared in the 100bootstrapped data sets. Two Crenarchaeota outgroup species arenot shown. Details of tree construction can be found inSupplementary Material. The epi- and mesopelagic clades areindicated by shading. Single-cell genomes from different watermasses/locations/depths are marked with different colors asidentified in the legend inset.

Diversification of marine ThaumarchaeotaH Luo et al

733

The ISME Journal

Members of the epi- and mesopelagic clades alsoappear to differ in their capabilities for reducingoxidative stress. Genes encoding superoxidedismutase, which catalyzes the dismutation ofsuperoxide into oxygen and hydrogen peroxide,are equally abundant in epi- and mesopelagic clades(Supplementary Table S3; w2 test; P40.05). Hydro-gen peroxide is subsequently converted to water bythe enzymes peroxiredoxin (also known as alkylhydroperoxide reductase) and catalase. Althoughperoxiredoxin gene families occur with comparablefrequency in both clades (Supplementary Table S3;w2 test; P40.05), a gene with high homology tocatalase was found exclusively in two WW SAGsof the epipelagic clade (Supplementary Table S3).A key difference between these two types ofantioxidant enzymes is that peroxiredoxin is 100-to 1000-fold less efficient than catalase; the latterbecomes crucial once the former is saturated withhydrogen peroxide (Parsonage et al., 2008). Further,there is evidence that catalase is critical in mini-mizing ultraviolet-induced oxidative damage inbacteria (Costa et al., 2010). Phylogenetic analysissuggested that this gene was acquired throughhorizontal gene transfer (Figure 2b), which is furthersubstantiated by its absence in the genomes of anyof the seven cultured marine Thaumarchaeotasequenced to date by homology search, all ofwhich are members of the epipelagic ecotype(Supplementary Figure S2). These results areconsistent with microbes in epipelagic watersexperiencing a stronger oxidative stress becauseof photochemical and photosynthetic productionof reactive oxygen species compared with those in

deep water where biological activity is the singlesource of superoxide (Diaz et al., 2013).

When gene functions were assigned more broadlyusing either COG (Tatusov et al., 1997) or arCOG(Wolf et al., 2012) categories, we found that thegenome content of the epipelagic clade wassignificantly different from the mesopelagic clade(w2 test; Po0.001), with the signal transductionfunctional category significantly enriched in theepipelagic clade (Rodriguez-Brito et al., 2006). Theability of generalist marine bacteria to respondto a changing environment has been attributedto differences in the sophistication of regulatorymachinery (Lauro et al., 2009; Luo et al., 2013),and this reasoning may apply to selectionpressures operating on epipelagic Thaumarchaeotacompared with those inhabiting the more stablemesopelagic waters. By contrast, we found a higherabundance of urease genes in SAGs from themesopelagic compared with the epipelagic clade(Supplementary Table S3), consistent with a recentreport of the depth distribution of Thaumarchaeotaurease genes in polar oceans (Alonso-Saez et al.,2012).

In conclusion, differentiation of epi- and mesope-lagic Thaumarchaeota populations first detected byanalysis of single genes (Francis et al., 2005; Hallamet al., 2006) is supported by phylogenomic analysisof partial genomes retrieved from uncultivated cells.The exclusive presence of putative DNA photolyaseand catalase in SAGs from the epipelagic is strongevidence that light or light-driven photochemistry isa major factor structuring marine Thaumarchaeotaby depth.

Figure 2 Bayesian phylogenetic tree of (a) photolyase and (b) catalase amino-acid sequences. The trees were constructed using theMrBayes v3.1.2 software (Ronquist and Huelsenbeck, 2003) using the WAG þG4 model. Values at the nodes are posterior probabilities ofthe internal branches. Details of tree construction can be found in Supplementary Material. The distinct phylogenetic groups(Tharmarchaeota, Euryarchaeota, Crenarchaeota, Bacteria) are indicated by shading. The trees consist of sequences from single cells(filled star), reference taxa with sequence id (NCBI gi/accession/locus tag) given in parenthesis, and homologs in GOS metagenomes withsequence id in the format of ‘JCVI_READ_XXXX’.

Diversification of marine ThaumarchaeotaH Luo et al

734

The ISME Journal

Conflict of Interest

The authors declare no conflict of interest.

Acknowledgements

We thank the Georgia Advanced Computing ResourceCenter at the University of Georgia for providing computa-tional resources. This research was funded by grantsfrom NSF (OPP 08-38996; OCE-1232982, EF-826924 andOCE-821374) and the Gordon and Betty Moore Founda-tion, with sequencing support from the US Department ofEnergy Joint Genome Institute Community SupportedProgram grants 2010-77 and 2011-387.

References

Alonso-Saez L, Waller AS, Mende DR, Bakker K,Farnelid H, Yager PL et al. (2012). Role for urea innitrification by polar marine Archaea. Proc Natl AcadSci USA 109: 17989–17994.

Biller SJ, Mosier AC, Wells GF, Francis CA. (2012). Globalbiodiversity of aquatic ammonia-oxidizing Archaea ispartitioned by habitat. Front Microbiol 3: 252.

Blainey PC, Mosier AC, Potanina A, Francis CA,Quake SR. (2011). Genome of a low-salinityammonia-oxidizing archaeon determined by single-cell and metagenomic analysis. PLoS One 6: e16626.

Church MJ, Wai B, Karl DM, DeLong EF. (2010). Abun-dances of crenarchaeal amoA genes and transcripts inthe Pacific Ocean. Environ Microbiol 12: 679–688.

Church MJ, DeLong EF, Ducklow HW, Karner MB, PrestonCM, Karl DM. (2003). Abundance and distributionof planktonic Archaea and Bacteria in the waterswest of the Antarctic Peninsula. Limnol Oceanogr 48:1893–1902.

Costa CS, Pezzoni M, Fernandez RO, Pizarro RA. (2010).Role of the quorum sensing mechanism in theresponse of Pseudomonas aeruginosa to lethal andsublethal UVA irradiation. Photochem Photobiol 86:1334–1342.

Diaz JM, Hansel CM, Voelker BM, Mendes CM, Andeer PF,Zhang T. (2013). Widespread production of extracel-lular superoxide by heterotrophic bacteria. Science340: 1223–1226.

Francis CA, Roberts KJ, Beman JM, Santoro AE,Oakley BB. (2005). Ubiquity and diversity ofammonia-oxidizing Archaea in water columns andsediments of the ocean. Proc Natl Acad Sci USA 102:14683–14688.

Goosen N, Moolenaar GF. (2008). Repair of UV damage inbacteria. DNA Repair 7: 353–379.

Hallam SJ, Mincer TJ, Schleper C, Preston CM, Roberts K,Richardson PM et al. (2006). Pathways of carbonassimilation and ammonia oxidation suggested byenvironmental genomic analyses of marine. Crenarch-aeota. PLoS Biol 4: e95.

Hu A, Jiao N, Zhang R, Yang Z. (2011). Niche partitioningof marine group I Crenarchaeota in the euphoticand upper mesopelagic zones of the East China Sea.Appl Environ Microbiol 77: 7469–7478.

Kalanetra KM, Bano N, Hollibaugh JT. (2009). Ammonia-oxidizing Archaea in the Arctic Ocean and Antarcticcoastal waters. Environ Microbiol 11: 2434–2445.

Karner MB, DeLong EF, Karl DM. (2001). Archaealdominance in the mesopelagic zone of the PacificOcean. Nature 409: 507–510.

Konneke M, Bernhard AE, de la Torre JR, Walker CB,Waterbury JB, Stahl DA. (2005). Isolation of anautotrophic ammonia-oxidizing marine archaeon.Nature 437: 543–546.

Lanfear R, Calcott B, Ho SYW, Guindon S. (2012).PartitionFinder: combined selection of partitioningschemes and substitution models for phylogeneticanalyses. Mol Biol Evol 29: 1695–1701.

Lauro FM, McDougald D, Thomas T, Williams TJ, Egan S,Rice S et al. (2009). The genomic basis of trophicstrategy in marine bacteria. Proc Natl Acad Sci USA106: 15527–15533.

Leininger S, Urich T, Schloter M, Schwark L, Qi J,Nicol GW et al. (2006). Archaea predominate amongammonia-oxidizing prokaryotes in soils. Nature 442:806–809.

Luo H, Csu+ros M, Hughes AL, Moran MA. (2013).Evolution of divergent life history strategies in marineAlphaproteobacteria. mBio 4: e00373–13.

Merbt SN, Stahl DA, Casamayor EO, Martı E, Nicol GW,Prosser JI. (2012). Differential photoinhibition ofbacterial and archaeal ammonia oxidation. FEMSMicrobiol Lett 327: 41–46.

Mincer TJ, Church MJ, Taylor LT, Preston C, Karl DM,DeLong EF. (2007). Quantitative distribution ofpresumptive archaeal and bacterial nitrifiers inMonterey Bay and the North Pacific Subtropical Gyre.Environ Microbiol 9: 1162–1175.

Mosier AC, Allen EE, Kim M, Ferriera S, Francis CA.(2012a). Genome sequence of ‘candidatus Nitrosoarch-aeum limnia’ BG20, a low-salinity ammonia-oxidizingarchaeon from the San Francisco Bay estuary.J Bacteriol 194: 2119–2120.

Mosier AC, Allen EE, Kim M, Ferriera S, Francis CA.(2012b). Genome sequence of ‘‘Candidatus Nitrosopu-milus salaria’’ BD31, an ammonia-oxidizing archaeonfrom the San Francisco Bay estuary. J Bacteriol 194:2121–2122.

Nicol GW, Leininger S, Schleper C. (2011). Distributionand activity of ammonia-oxidizing Archaea innatural environments. In Ward BB, Arp DJ, Klotz MJ(eds) Nitrification. ASM Press: Washington, DC,pp 157–178.

Park S-J, Kim J-G, Jung M-Y, Kim S-J, Cha I-T,Kwon K et al. (2012a). Draft genome sequence ofan ammonia-oxidizing Archaeon, ‘CandidatusNitrosopumilus koreensis’ AR1, from marine sedi-ment. J Bacteriol 194: 6940–6941.

Park S-J, Kim J-G, Jung M-Y, Kim S-J, Cha I-T, Ghai R et al.(2012b). Draft genome sequence of an ammonia-oxidizing Archaeon, ‘Candidatus Nitrosopumilussediminis’ AR2, from svalbard in the Arctic circle.J Bacteriol 194: 6948–6949.

Parsonage D, Karplus PA, Poole LB. (2008). Substratespecificity and redox potential of AhpC, abacterial peroxiredoxin. Proc Natl Acad Sci USA105: 8209–8214.

Prosser JI, Nicol GW. (2008). Relative contributionsof Archaea and bacteria to aerobic ammoniaoxidation in the environment. Environ Microbiol 10:2931–2941.

Rodriguez-Brito B, Rohwer F, Edwards R. (2006). Anapplication of statistics to comparative metagenomics.BMC Bioinformatics 7: 162.

Diversification of marine ThaumarchaeotaH Luo et al

735

The ISME Journal

Ronquist F, Huelsenbeck JP. (2003). MrBayes 3: Bayesianphylogenetic inference under mixed models. Bioinfor-matics 19: 1572–1574.

Stamatakis A. (2006). RAxML-VI-HPC: maximum like-lihood-based phylogenetic analyses with thousands oftaxa and mixed models. Bioinformatics 22: 2688–2690.

Tatusov RL, Koonin EV, Lipman DJ. (1997). A genomicperspective on protein families. Science 278: 631–637.

Wolf Y, Makarova K, Yutin N, Koonin E. (2012). Updatedclusters of orthologous genes for Archaea: a complexancestor of the Archaea and the byways of horizontalgene transfer. Biol Direct 7: 46.

Wuchter C, Abbas B, Coolen MJL, Herfort L, van BleijswijkJ, Timmers P et al. (2006). Archaeal nitrificationin the ocean. Proc Natl Acad Sci USA 103:12317–12322.

Supplementary Information accompanies this paper on The ISME Journal website (http://www.nature.com/ismej)

Diversification of marine ThaumarchaeotaH Luo et al

736

The ISME Journal

Supplemental Material:

Single Cell Genomes of Marine Thaumarcheota Reveal Insights into Population

Differentiation by Depth

Haiwei Luo, Bradley B. Tolar, Brandon K. Swan, Chuanlun L. Zhang, Ramunas Stepanauskas,

Mary Ann Moran, James T. Hollibaugh

Supplemental Methods

Single cell sample collection and construction of single amplified genome (SAG) libraries

Water samples for single cell analyses were collected and replicate, 1 mL subsamples

were cryopreserved with 6% glycine betaine (Sigma) and stored at –80 ºC (Cleland et al., 2004).

Prior to cell sorting, samples with prokaryote cell abundances above 5x105 mL

-1 were diluted

10x with filter-sterilized field samples and screened through a 70 µm mesh-size cell strainer

(BD). For heterotrophic prokaryote detection, diluted subsamples (1-3 mL) were incubated for

10-120 min with SYTO-9 DNA stain (5 µM; Invitrogen). Cell sorting was performed with a

MoFlo™ (Beckman Coulter) flow cytometer using a 488 nm argon laser for excitation, a 70 µm

nozzle orifice and a CyClone™ robotic arm for droplet deposition into microplates. The

cytometer was triggered on side scatter. The “single 1 drop” mode was used for maximal sort

purity. Prokaryote cells were separated from eukaryotes, viruses, and detritus based on SYTO-9

fluorescence (proxy to nucleic acid content) and light side scatter (proxy to particle size) (del

Giorgio et al., 1996). Synechococcus cells were excluded, based on their autofluorescence

signal. Target cells were deposited into 384-well plates containing 600 nL per well of either a)

1x TE buffer or b) prepGEM™ Bacteria (Zygem) reaction mix and stored at –80 ºC until further

processing. Of the 384 wells, 315 were dedicated for single cells, 66 were used as negative

controls (no droplet deposited) and 3 received 10 cells each (positive controls).

The accuracy of droplet deposition was determined by depositing 10 mm fluorescent

beads into 384-well plates then the results were checked by microscopically verifying the

presence of beads in the plate wells. Of the 2-3 plates examined each sort day, with one bead

deposited per well, fewer than 2% of wells were found to contain no bead and 0.4% to contain

more than one bead. The latter is most likely caused by co-deposition of two beads attached to

each other, which at certain orientation may have similar optical properties to single beads.

Cells were sorted into TE buffer were lysed and their DNA was denatured using cold

KOH (Raghunathan et al., 2005). Genomic DNA from the lysed cells was amplified using

multiple displacement amplification (MDA) (Dean et al., 2002; Raghunathan et al., 2005) in 10

µL final volume. The MDA reactions contained 2 U/µL Repliphi polymerase (Epicentre), 1x

reaction buffer (Epicentre), 0.4 mM each dNTP (Epicentre), 2 mM DTT (Epicentre), 50 mM

phosphorylated random hexamers (IDT) and 1 µM SYTO-9 (Invitrogen) (all final

concentration). The MDA reactions were run at 30 °C for 12-16 h, then inactivated by a 15 min

incubation at 65 °C. Amplified genomic DNA was stored at -80 °C until further processing. We

refer to the MDA products originating from individual cells as single amplified genomes

(SAGs).

Prior to cell sorting, the instrument and the workspace were decontaminated for DNA as

previously described (Stepanauskas and Sieracki, 2007). High molecular weight DNA

contaminants were removed from all MDA reagents by a UV treatment in Stratalinker

(Stratagene) (Woyke et al., 2011). During UV treatment, reagents were placed on ice to avoid

overheating. An empirical optimization of the UV exposure was performed to remove all

detectable contaminants without inactivating the reaction. Cell sorting and MDA setup were

performed in a HEPA-filtered environment. As a quality control, the kinetics of all MDA

reactions was monitored by measuring the SYTO-9 fluorescence using either LightCycler 480

(Roche) or FLUOstar Omega (BMG). The critical point (Cp) was determined for each MDA

reaction as the time required to produce half of the maximal fluorescence. The Cp is inversely

correlated to the amount of DNA template (Zhang and Fang, 2006).

PCR screening of SAG libraries

MDA products were diluted 50-fold in TE buffer and 500 nL aliquots of diluted MDA

product served as the template DNA in 5 µL final volume real-time PCR screens. All PCR

reactions were performed using LightCycler 480 SYBR Green I Master Mix (Roche) and the

Roche LightCycler® 480 II real-time thermal cycler. PCR amplification of Archaeal SSU rRNA

from SAGs was done using primers Arch_344F (ACG GGG YGC AGC AGG CGC GA) and

Arch_915R (GTG CTC CCC CGC CAA TTC CT) (Lane et al. 1991). Forward (5´–

GTAAAACGACGGCCAGT–3´) and reverse (5´–CAGGAAACAGCTATGACC–3´) M13

sequencing primers were added to the 5´ ends of each target primer pair to aid direct sequencing

of PCR products. All PCR reactions were run for 40 cycles at the appropriate annealing

temperature, followed by melting curve analysis performed as follows: 95°C for 5 s, 52°C for 1

min, and a continuous temperature ramp (0.11°C/s) from 52 to 97°C. Real-time PCR kinetics and

amplicon melting curves served as proxies for detecting SAGs positive for target genes. New,

20 µL PCR reactions were set up for all PCR-positive SAGs and amplicons were sequenced

from both ends using Sanger technology by Beckman Coulter Genomics.

Single cell sorting, whole genome amplification, real-time PCR screens and PCR product

sequence analyses were performed at the Bigelow Laboratory Single Cell Genomics Center

following protocols described on their web site (www.bigelow.org/scgc). Antarctic SAGS were

also screened at the University of Georgia for the presence of Archaeal amoA genes using

primers and qPCR conditions described in Francis et al. (2005) and Wuchter et al. (2006). PCR

products were sequenced at the Georgia Genomics Facility to verify amplification of the target

gene.

SAG sequencing and analysis

A total of 46 Thaumarchaeota SAGs were chosen for whole genome sequencing based on

multiple displacement amplification (MDA) kinetics, presence of metabolic genes from PCR

screening and geographic location of the sampling site. Three approaches were used for

sequencing marine Thaumarchaeota SAGs: 1) A combination of Illumina and 454 shotgun

sequencing (AAA007-O23), or Illumina only (AB-661-I02, AB-661-L21, AB-661-M19, AB-

663-F14, AB-663-G14, AB-663-N18, AB-663-O07, AB-663-P07, AAA160-J20, AAA001-

A19), as described in Swan et al. (2011); 2) a combination of Illumina and PacBio long read

sequence data (AAA007-N19, AAA288-I14, and AAA288-J14) as described in Martinez-Garcia

et al. (2012) and assembled using Velvet-SC (Chitsaz et al., 2011) and PBcR (Koren et al.,

2012) and; 3) 454 shotgun sequencing of Nextera-prepared libraries followed by dual assembly

with Newbler v2.4 and Geneious Pro v.5.5.6 (Drummond et al., 2011) (all remaining SAGs; total

of 32). For each of these 32 SAGs, raw 454 sequences were trimmed in Geneious Pro v5.5.6 and

any remaining transposons were removed using TagCleaner v0.11 (Schmieder et al., 2010).

Sequences were then assembled separately in Newbler v.2.4 (Roche) using default settings and

Geneious using the high-sensitivity setting. The Newbler-assembled sequences were imported

into Geneious and co-assembled with both the Geneious-assembled contigs and the unused

reads. The dual assembled contigs and all other contigs longer than 300 bp were pooled and

annotated. Nextera-prepared sequencing libraries were generated using the Roche Titanium-

Compatible kit with MDA product as the input DNA, following the manufacturer’s instructions

(Adey et al., 2010). A total of 32 Nextera sequencing libraries constructed from SAGs were

barcoded and sequenced (454 FLX Titanium chemistry) on 1/2 microtiter plate. Whole-genome

sequence data for all Thaumarchaeota SAGs are available in IMG under accession numbers

listed in Supplementary Table S1.

SAG whole genome sequence quality control

Each raw sequence data set was screened against all finished bacterial and archaeal

genome sequences (downloaded from NCBI) and the human genome to identify potential

contamination in the sample. Reads were mapped against reference genomes with bwa version

0.5.9 (Li and Durbin, 2009) using default parameters (96% identity threshold). None of the

libraries showed significant contamination. Additionally, gene sequences of the final assemblies

(see below) were compared against the GenBank nr database by BLASTX and taxonomically

classified using MEGAN (57).

To further verify the absence of contaminating sequences in the assemblies, tetramer

frequencies were extracted from all scaffolds using two alternative settings: 1) sliding window of

1000 bp and 100 bp step size and 2) sliding window of 5000 bp and 500 bp step size. Reverse-

complementary tetramers were combined and the frequencies represented as a N×136 feature

matrix, where N is the number of windows and each column of the matrix corresponds to the

frequency of one of the 136 possible tetramers. Principal component analysis (PCA) was then

used to extract the most important components of this high dimensional feature matrix. The

analysis produced unimodal distribution along the first four PCs for the majority of SAGs,

suggesting homogenous DNA sources. Scaffolds representing extremes on the first four PCs

were identified and manually examined for their closest TBLASTX hits against the NCBI nt

database.

SAG annotation

The gene modeling program Prodigal (http://prodigal.ornl.gov/) was run on the draft

single cell genomes, using default settings that permit overlapping genes and using ATG, GTG,

and TTG as potential starts. The resulting protein translations were compared to the GenBank

non-redundant database (NR), the Swiss-Prot/TrEMBL, Pfam, TIGRFam, Interpro, KEGG, and

COGs databases using BLASTP or HMMER. From these results, product assignments were

made. Initial criteria for automated functional assignment set priority based on TIGRFam, Pfam,

COG, Interpro profiles, pairwise BLAST versus Swiss-Prot/TrEMBL, and KO groups. The

annotation was imported into the Joint Genome Institute Integrated Microbial Genomes (IMG;

http://img.jgi.doe.gov/cgi-bin/pub/main.cgi) (Markowitz et al., 2010).

Phylogenomic tree construction

We compiled two data sets for phylogenomic analyses of marine Thaumarchaeota. The

first data set used sequence data from 46 single cell genomes and the single cultured isolate

Nitrosopumilus maritimus SCM1 genome. The second set included all 8 published composite

Thaumarchaeota genomes in addition to N. maritimus and the 46 single cell genomes used in the

first compilation. These composite genomes are Candidatus Nitrosoarchaeum koreensis MY1,

Candidatus Nitrosoarchaeum limnia SFB1, Candidatus Cenarchaeum symbiosum A, Candidatus

Nitrosoarchaeum limnia BG20, Candidatus Nitrosopumilus salaria BD31, Candidatus

Nitrosopumilus koreensis AR1, Candidatus Nitrosopumilus sediminis AR2, and Candidatus

Nitrososphaera gargensis Ga9.2. Genome sequences from 2 Crenarchaeota, Pyrobaculum

islandicum DSM 4184 and Sulfolobus acidocaldarius DSM 639, were included as outgroups.

These two data sets were analyzed separately, because it is not clear how composite genomes

may affect the phylogenomic reconstruction.

The two data sets were processed in an identical way based on the following procedure.

Orthologous gene families were identified using the OrthoMCL software (Li et al., 2003).

Inparalog copies in a gene family were discarded, and gene members assigned to different COGs

were also discarded. For the remaining single copy orthologous families, only those found in

genomes from at least 25 (in the first data set) or 33 (in the second data set) Thaumarchaeota and

1 outgroup member were retained. This resulted in retention of 97 (in the first data set) or 83 (in

the second data set) gene families. Members in each gene family were aligned at the amino acid

level using MAFFT (Katoh et al., 2005) and the alignments were trimmed using TrimAl

(Capella-Gutiérrez et al., 2009) with the criteria of “-automated1 -resoverlap 0.55 -seqoverlap

60”. Then the trimmed alignments were concatenated, with missing sequences treated as gaps.

To account for heterogeneity in the evolutionary processes among different genes, we applied a

data partition model during phylogenetic construction using the RAxML v7.3.0 software

(Stamatakis, 2006). The PartitionFinder software (Lanfear et al., 2012) grouped the 97 proteins

into 16 partitions and grouped the 83 proteins into 14 partitions, respectively, and estimated the

best-fit substitution matrix for each partition using a maximum likelihood framework. Gamma

distribution of rate variation was also applied in RAxML analysis. Genomes obtained from

single cells have many missing genes and taxa with insufficient phylogenetic signal may become

rogues that take uncertain positions in a phylogenetic tree. We applied the RogueNaRok software

(Aberer et al., 2013) and identified one rogue, SCGC AAA008-M23. Another RAxML

phylogenomic tree was constructed with this genome excluded, but the bootstrap support for

unresolved branches was only slightly improved compared to the original tree. Therefore, only

the original RAxML tree containing sequences from all SAGs is presented. Orthologous protein

sequences are available upon request.

Comparative analysis of genome content

All of the predicted amino acid sequences from the 46 SAGs and Nitrosopumilus

maritimus SCM1 were clustered into orthologous gene families using the OrthoMCL software

(Li et al. 2003). Then the occurrence rate of each family in the 4 epipelagic clade SAGs and the

42 mesopelagic clade SAGs was calculated, respectively. The most interesting ecologically

relevant gene families, that had a higher occurrence rate in one clade compared to the other, were

identified and are listed in Table S3.

Analysis of photolyase and catalase

Inferred amino acid sequences closely related to homologs of photolyase and catalase

were identified in the Global Ocean Survey (GOS) metagenomic database using a three-step

procedure. Firstly, the GOS DNA read sequences were translated to amino acid sequences using

all 6 reading frames. Peptide fragments with at least 60 amino acids were retained. Next, the

photolyase and catalase amino acid sequences identified in the SAGs were used as query

sequences to search against GOS using the BLASTp program. The criteria to retain GOS hits for

further analyses were similarity scores ≥60, alignment lengths ≥100, and bit scores ≥100 for

photolyase, and similarity scores ≥75, alignment lengths ≥310, and bit scores ≥500 for

catalase. These parameter values were estimated based on preliminary phylogenetic analyses

showing that sequences recovered using more relaxed criteria were not related to

Thaumarchaeota. Finally, GOS DNA sequences identified as photolyase and catalase peptide

fragments, were extracted and searched against the NCBI non-redundant database using the

BLASTx program to guarantee that these GOS reads encoded photolyase or catalase.

Phylogenetic analysis of the photolyase and catalase sequences we retrieved followed an

identical procedure. Since the homologous sequences are very divergent, 7 alignment methods

were used and compared to better account for alignment uncertainty. These methods include

(Larkin et al., 2007), MAFFT (Katoh et al., 2005), MUSCLE (Edgar, 2004), T-coffee

(Notredame et al., 2000), DIALIGN (Morgenstern, 2004), Kalign (Lassmann and Sonnhammer,

2005), and OPAL (Wheeler and Kececioglu, 2007). The qualities of the alignments were

compared using the TrimAl software (Capella-Gutiérrez et al., 2009); the best alignment was

selected according to the consistency score calculated by TrimAl. Next, the amino acid

substitution model was determined using the ProtTest v3 software (Darriba et al., 2011). A

phylogenetic tree was constructed using the MrBayes v3.1.2 software (Ronquist and

Huelsenbeck, 2003). One cold and three heated Markov chain Monte Carlo (MCMC) chains

were run for 1,000,000 generations with trees sampled every 100 generations. Two independent

runs of MCMC were performed. The first 25% of all runs were discarded as ‘burn-in’. A 50%

majority-rule consensus tree was constructed from the post-burn-in trees. The average standard

deviation of split frequencies reached <0.01, indicative of convergence.

Supplemental Figure Legends

Figure S1. Maximum likelihood phylogenetic tree of marine Thaumarchaeota 16S rRNA

genes. The tree was constructed using the RAxML v7.3.0 software using the GTR substitution

model with Gamma distributed rate heterogeneity among sites. Values at the nodes show the

number of times the clade defined by that node appeared in the 100 bootstrapped datasets.

Bootstrap values below 50 are not shown. The taxa included in the tree are the Thaumarchaeota

SAGs, cultures, and a few environmental sequences, with 5 soil and hot spring sequences as

outgroups. The epi- and mesopelagic clades are indicated by shading.

Figure S2. Maximum likelihood phylogenomic analysis of 55 Thaumarchaeota genomes.

The tree was constructed using the RAxML v7.3.0 software using a concatenated amino acid

sequence of 83 genes with 24,061 sites, with a data partition model determined by the

PartitionFinder software. Values at the nodes show the number of times the clade defined by that

node appeared in the 100 bootstrapped datasets. Two Crenarchaeota outgroup species are not

shown. Details of tree construction can be found in Supplemental Material. The epi- and

mesopelagic clades are indicated by shading. Single cell genomes from different water

masses/locations/depths are marked with different colors as identified in the legend inset.

Supplemental Table Legends

Table S1. Accession numbers and environmental characteristics of the 46 marine

Thaumarchaeota single-cell amplified genomes used in this study.

Table S2. COG annotations of the 97 proteins used for phylogenomic analysis.

Table S3. Examples of gene families distributed in the epi- and mesopelagic clades of

marine Thaumarchaeota.

References

Aberer, A.J., Krompass, D., and Stamatakis, A. (2013). Pruning rogue taxa

improves phylogenetic accuracy: an efficient algorithm and webservice. Syst Biol 62:

162-166.

Adey A, Morrison H, Asan, Xun X, Kitzman J, Turner E et al. (2010). Rapid,

low-input, low-bias construction of shotgun fragment libraries by high-density in vitro

transposition. Genome Biology 11: R119.

Capella-Gutiérrez, S., Silla-Martínez, J.M., and Gabaldón, T. (2009). trimAl: a

tool for automated alignment trimming in large-scale phylogenetic analyses.

Bioinformatics 25: 1972-1973.

Chitsaz H, Yee-Greenbaum JL, Tesler G, Lombardo M-J, Dupont CL, Badger JH

et al. (2011). Efficient de novo assembly of single-cell bacterial genomes from short-read

data sets. Nat Biotechnol 29: 915–921.

Cleland D, Krader P, McCree C, Tang J, Emerson D (2004). Glycine betaine as a

cryoprotectant for prokaryotes. Journal of Microbiological Methods 58: 31-38.

Darriba, D., Taboada, G.L., Doallo, R., and Posada, D. (2011). ProtTest 3: fast

selection of best-fit models of protein evolution. Bioinformatics 27: 1164-1165.

ean FB, Hosono S, Fang L, Wu X, Faruqi AF, Bray-Ward P et al. (2002).

Comprehensive human genome amplification using multiple displacement amplification.

Proceedings of the National Academy of Sciences of the United States of America 99:

5261-5266.

del Giorgio PA, Bird DF, Prairie YT, Planas D (1996). Flow cytometric

determination of bacterial the green nucleic acid stain SYTO 13. Limnol Oceanogr 41:

783–789.

Drummond AJ, Ashton B, Buxton S, Cheung M, Cooper A, Duran C et al.

(2011). Geneious v5.4, Available from http://www.geneious.com/.

Edgar, R.C. (2004). MUSCLE: multiple sequence alignment with high accuracy

and high throughput. Nucleic Acids Re 32: 1792-1797.

Francis, C. A., K. J. Roberts, J. M. Beman, A. E. Santoro and B. B. Oakley

(2005). "Ubiquity and diversity of ammonia-oxidizing Archaea in water columns and

sediments of the ocean." Proceedings of the National Academy of Sciences of the US

102(41): 14683-14688.

Katoh, K., Kuma, K.-i., Toh, H., and Miyata, T. (2005). MAFFT version 5:

improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33: 511-

518.

Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G et al.

(2012). Hybrid error correction and de novo assembly of single-molecule sequencing

reads. Nat Biotechnol 30: 693–700.

Lane, D. J. 1991. 16S/23S rRNA sequencing. In E. Stackebrandt and M.

Goodfellow (ed.), Nucleic acid techniques in bacterial systematics. John Wiley,

Chichester, UK.

Lanfear, R., Calcott, B., Ho, S.Y.W., and Guindon, S. (2012). PartitionFinder:

combined selection of partitioning schemes and substitution models for phylogenetic

analyses. Mol Biol Evol 29: 1695-1701.

Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A.,

McWilliam, H. et al. (2007). Clustal W and Clustal X version 2.0. Bioinformatics 23:

2947-2948.

Lassmann, T., and Sonnhammer, E. (2005). Kalign - an accurate and fast multiple

sequence alignment algorithm. BMC Bioinformatics 6: 298.

Li H, Durbin R (2009). Fast and accurate short read alignment with Burrows–

Wheeler transform. Bioinformatics 25: 1754–1760.

Li, L., Stoeckert, C.J., and Roos, D.S. (2003). OrthoMCL: identification of

ortholog groups for eukaryotic genomes. Genome Res 13: 2178-2189.

Markowitz VM, Chen I-MA, Palaniappan K, Chu K, Szeto E, Grechkin Y et al.

(2010). The integrated microbial genomes system: an expanding comparative analysis

resource. Nucleic Acids Res 38: D382–D390.

Martinez-Garcia M, Brazel DM, Swan BK, Arnosti C, Chain PSG, Reitenga KG

et al. (2012). Capturing single cell genomes of active polysaccharide degraders: An

unexpected contribution of Verrucomicrobia. PLoS ONE 7: e35314.

Morgenstern, B. (2004). DIALIGN: multiple DNA and protein sequence

alignment at BiBiServ. Nucleic Acids Res 32: W33-W36.

Notredame, C., Higgins, D., and Heringa, J. (2000). T-coffee: a novel method for

fast and accurate multiple sequence alignment. J Mol Biol 302: 205-217.

Raghunathan A, Ferguson HR, Jr., Bornarth CJ, Song W, Driscoll M, Lasken RS

(2005). Genomic DNA amplification from a single bacterium. Applied and

Environmental Microbiology 71: 3342-3347.

Ronquist, F., and Huelsenbeck, J.P. (2003). MrBayes 3: Bayesian phylogenetic

inference under mixed models. Bioinformatics 19: 1572-1574.

Schmieder R, Lim YW, Rohwer F, Edwards R (2010). TagCleaner: Identification

and removal of tag sequences from genomic and metagenomic datasets. BMC

Bioinformatics 11: 341.

Stamatakis, A. (2006). RAxML-VI-HPC: maximum likelihood-based

phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688

- 2690.

Stepanauskas R, Sieracki ME (2007). Matching phylogeny and metabolism in the

uncultured marine bacteria, one cell at a time. Proceedings of the National Academy of

Sciences 104: 9052-9057.

Swan BK, Martinez-Garcia M, Preston CM, Sczyrba A, Woyke T, Lamy D et al.

(2011). Potential for chemolithoautotrophy among ubiquitous bacteria lineages in the

dark ocean. Science 333: 1296–1300.

Wheeler, T.J., and Kececioglu, J.D. (2007). Multiple alignment by aligning

alignments. Bioinformatics 23: i559-i568.

Woyke T, Sczyrba A, Lee J, Rinke C, Tighe D, Clingenpeel S et al. (2011).

Decontamination of MDA reagents for single cell whole genome amplification. PLoS

ONE 6: e26161.

Wuchter, C., B. Abbas, M. J. L. Coolen, L. Herfort, J. van Bleijswijk, P.

Timmers, M. Strous, E. Teira, G. J. Herndl, J. J. Middelburg, S. Schouten and J. S.

Sinninghe Damste (2006). "Archaeal nitrification in the ocean." Proceedings of the

National Academy of Sciences of the USA 103(33): 12317-12322.

Zhang T, Fang H (2006). Applications of real-time polymerase chain reaction for

quantification of microorganisms in environmental samples. Applied Microbiology and

Biotechnology 70: 281-289.

0.04

AAA288-C17_North_Pacific_HQ675812

Uncultured_deep-sea_hydrothermal_vent_archaeon_clone_SSM263-NA01_AB193964

Nitrosopumilus_maritimus_DQ085097

AAA288-O22_North_Pacific_HQ675853

Nitrosopumilus_sp._NM25_AB546961

Uncultured_Arctic_crenarchaeote_clone_SCICEX122426B4_EU199641

AAA008-N07_South_Atlantic_HQ675793

AAA007-E15_South_Atlantic_HQ675760

AB-663-O07_CDW

AAA288-K05_North_Pacific_HQ675842

Uncultured_Antarctic_crenarchaeote_clone_Ant10903C1_EU199467

AAA288-E09_North_Pacific_HQ675824

Uncultured_deep-sea_hydrothermal_vent_archaeon_clone_FS243A-90_DQ270604

AAA288-P18_North_Pacific_HQ675855

Uncultured_Antarctic_crenarchaeote_clone_Ant11103E7_EU199498

Uncultured_Antarctic_crenarchaeote_clone_15G10_AF393304AAA160-J20_Gulf_of_Maine

Marine_group_I_crenarchaeote_clone_HF4000_046D20_DQ300528

Uncultured_Antarctic_crenarchaeote_clone_Ant4407C3_EU199544

AAA008-E17_South_Atlantic_HQ675779

Uncultured_Antarctic_crenarchaeote_clone_Ant1003B4_EU199452

AAA288-I14_North_Pacific_HQ675835

Uncultured_Antarctic_crenarchaeote_clone_83A10_AF393307

AAA008-O05_South_Atlantic_HQ675796

Uncultured_Antarctic_sponge_crenarchaeote_AY320201

Nitrososphaera_gargensis_EU281334

Nitrosoarchaeum_limnia_AEGP01000029

Uncultured_crenarchaeote_74A4_AF393466

Uncultured_Santa_Barbara_Channel_200m_crenarchaeote_clone_SB95-57_U78199

AAA007-C21_South_Atlantic_HQ675759

Uncultured_Arctic_crenarchaeote_clone_SCICEX122416H2_EU199638

Uncultured_aquarium_biofilter_crenarchaeote_FJ971122

AAA288-M23_North_Pacific_HQ675846

Marine_group_I_cren_clone_HF770_010J17_DQ300531

Marine_group_I_crenarchaeote_clone_HF500_73E11_DQ300524

Marine_group_I_crenarchaeote_clone_HF770_036I18_DQ300541

AB-661-M19_WW

AB-661-I02_WW

Cenarchaeum_symbiosum_A_AF083071Nitrosocaldus_yellowstonii_HL72_EU239960

Uncultured_deep-sea_hydrothermal_vent_archaeon_clone_SSM264-EA12_AB194003

Uncultured_deep-sea_hydrothermal_vent_archaeon_clone_Scd-NEA02_AB193965

Uncultured_Antarctic_crenarchaeote_clone_N67a_76_EF645850

AAA001-A19_South_Atlantic_HQ675727

AAA288-P02_North_Pacific_HQ675854

Uncultured_Namibian_upwelling_system_OMZ_crenarchaeote_clone_N67a_43_EF645854

Nitrososphaera_viennensis_EN76_FR773157

AAA288-M04_North_Pacific_HQ675845

AAA008-E02_South_Atlantic_HQ675777

AAA008-P23_South_Atlantic_HQ675800

Uncultured_cren_clone_Ant1023F6_EU199462

Nitrososphaera_viennensis_EN123_FR773158

AAA007-O23_South_Atlantic_HQ675772

AB-663-F14_CDW

Uncultured_Antarctic_crenarchaeote_fosmid_74A4_AF393466

AAA007-N23_South_Atlantic_HQ675770AAA288-J14_North_Pacific_HQ675839

Uncultured_Antarctic_crenarchaeote_clone_Ant25B13G3_EU199522

AAA007-G17_South_Atlantic_HQ675762

Marine_group_I_crenarchaeote_clone_T1_35_A1-2_FJ150824

AAA008-P02_South_Atlantic_HQ675798AAA008-G03_South_Atlantic_HQ675782

AAA288-K20_North_Pacific_HQ675844AAA288-D03_North_Pacific_HQ675814

Uncultured_marine_cren_clone_ZMEW4050_F5_FJ615484

AB-663-P07_CDW

Uncultured_Antarctic_crenarchaeote_clone_Ant4402F2_EU199541

AAA288-D22_North_Pacific_HQ675821

Uncultured_Antarctic_crenarchaeote_clone_Ant25B02D2_EU199515

Marine_group_I_cren_clone_HF770_018L23_DQ300534

Uncultured_Arctic_crenarchaeote_clone_SCICEX122320G7_EU199627

AB-661-L21_WW

AAA008-M21_South_Atlantic_HQ675791

Uncultured_North_Sea_surface_archaeon_clone_TS10C299_AF052946

AAA288-N15_North_Pacific_HQ675849

Uncultured_Arctic_crenarchaeote_95B-55-2G_AY288375

AAA288-N23_North_Pacific_HQ675850

Nitrososphaera_gargensis_EU281335

AAA008-M23_South_Atlantic_HQ675792

AAA288-K02_North_Pacific_HQ675841

Uncultured_crenarchaeote_clone_KM3-86-C1_EF597717

AAA288-G05_North_Pacific_HQ675831

AAA008-E15_South_Atlantic_HQ675859

AAA288-O17_North_Pacific_HQ675852

AAA007-M20_South_Atlantic_HQ675767

AAA008-O18_South_Atlantic_HQ675797

AB-663-G14_CDW

AB-663-N18_CDW

AAA007-N19_South_Atlantic_HQ675769

Uncultured_Arctic_crenarchaeote_clone_SCICEX1231236E12_EU199667

100

92

93

100

100

95

59

96

81

Mesopelagic clade

Epipelagic clade

Figure S1 Luo et al.

AAA288-N15 AAA008-O05 AAA007-N23

AAA007-E15 AAA008-N07 AAA008-M23 AAA008-O18 AAA008-E15

AAA001-A19 AAA288-J14 AAA008-P02 AAA008-E02

AAA007-O23 AAA007-N19 AAA288-I14 AB-663-P07 AAA288-D22

AB-663-G14 AAA288-O22 AAA008-P23 AAA288-C17 AAA288-P18

AB-663-F14 AB-663-N18

AAA288-N23 AAA007-C21 AAA007-M20 AAA288-D03

AAA008-G03 AAA288-E09 AAA288-M04 AAA288-G05 AAA288-M23 AAA288-K05 AAA288-K20

AAA007-G17 AAA008-E17 AAA288-O17

AAA288-P02 AAA288-K02 AAA008-M21 AB-663-O07

AB-661-L21 AB-661-I02 AB-661-M19

AAA160-J20 Cand. Nitrosopumilus sediminis AR2 Cand. Nitrosopumilus salaria BD31 Nitrosopumilus maritimus SCM1 Cand. Nitrosopumilus koreensis AR1 Cand. Nitrosoarchaeum koreensis MY1 Cand. Nitrosoarchaeum limnia SFB1 Cand. Nitrosoarchaeum limnia BG20

Cand. Cenarchaeum symbiosum A Cand. Nitrososphaera gargensis Ga9.2

100100

100

100

100

100

99

100

57

92

100

98

98

95

8456

23

75

90

81

31

48

68

98

85

62

25

49

36

98

98

97

100

100

100100

100

62

46

99

100

83100

100

100

100

100

100

100100

100

100

100

0.1

Mesopelagic clade

Epipelagic clade

770 m, South Atlantic Gyre

800 m, North Pacific Gyre

400 m, AntarcticCircumpolar Deep Water

80 m, Antarctic Winter Water

Surface, Gulf of Maine

Figure S2 Luo et al.

Tab

le S

1. C

har

acte

rist

ics

of

46 m

arin

e T

hau

mar

chae

ota

sin

gle

-cel

l am

pli

fied

gen

om

es.

SA

G I

D

(SC

GC

-)

IMG

Tax

on

ID

Lo

cati

on

1

Lat

itude

Longit

ude

Dat

e D

epth

(m)

Tem

p

erat

ur

e (°

C)

Sal

init

y

(PS

U)

Ass

emb

l

y s

ize2

(Mb

p)

No

.

con

tigs

No

.

gen

es

AB

-66

1-I

02

2

52

40

23

096

WW

64°2

4.1

6′S

64°5

5.9

′W

11Ja

n2011

80

-0.5

1

33

.93

0.5

2

70

68

5

AB

-661

-L2

1

25

24

02

30

84

WW

64°2

4.1

6′S

64°5

5.9

′W

11Ja

n2011

80

-0.5

1

33

.93

0.5

1

51

64

6

AB

-661

-M1

9

25

24

02

30

95

WW

64°2

4.1

6′S

64°5

5.9

′W

11Ja

n2011

80

-0.5

1

33

.93

0.5

8

57

74

0

AB

-663

-F1

4

25

23

53

36

33

CD

W

64°2

4.1

6′S

64°5

5.9

′W

11Ja

n2011

40

0

1.4

3

4.6

4

0.5

7

10

7

77

6

AB

-663

-G14

2

52

40

23

085

CD

W

64°2

4.1

6′S

64°5

5.9

′W

11Ja

n2011

40

0

1.4

3

4.6

4

0.5

8

78

75

5

AB

-663

-N18

2

52

40

23

093

CD

W

64°2

4.1

6′S

64°5

5.9

′W

11Ja

n2011

40

0

1.4

3

4.6

4

0.3

6

53

45

0

AB

-663

-O07

2

52

40

23

092

CD

W

64°2

4.1

6′S

64°5

5.9

′W

11Ja

n2011

40

0

1.4

3

4.6

4

0.4

8

37

59

8

AB

-663

-P0

7

25

24

02

30

94

C

DW

64°2

4.1

6′S

64°5

5.9

′W

11Ja

n2011

40

0

1.4

3

4.6

4

0.8

2

38

10

10

AA

A160

-J2

0

25

29

29

26

98

GO

M

43°5

0′3

9.8

7′′N

69°3

8′2

7.4

9′′W

16S

ep2009

1

22

.3

30

0

.56

99

72

6

AA

A007

-O2

3

25

27

29

15

00

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

1

.13

67

13

93

AA

A001

-A1

9

25

13

23

70

67

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.76

22

3

11

14

AA

A007

-N1

9

25

13

23

70

68

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.89

19

4

12

95

AA

A007

-C2

1

25

24

02

31

06

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.28

18

5

45

1

AA

A007

-E1

5

25

24

02

31

07

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.45

24

4

70

4

AA

A007

-G1

7

25

24

02

31

08

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.34

18

9

58

1

AA

A007

-M2

0

25

24

02

31

09

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.30

25

3

50

3

AA

A007

-N2

3

25

24

02

31

10

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.17

12

3

29

8

AA

A008

-E0

2

25

24

02

31

11

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.53

28

3

83

8

AA

A008

-E1

5

25

24

02

31

12

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.46

23

6

70

6

AA

A008

-E1

7

25

24

02

31

13

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.43

25

7

73

0

AA

A008

-G0

3

25

24

02

31

14

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.53

23

6

85

2

AA

A008

-M2

1

25

24

02

31

15

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.36

19

7

60

6

AA

A008

-M2

3

25

24

02

31

16

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.35

17

1

55

8

AA

A008

-N0

7

25

24

02

31

17

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.28

16

3

48

9

AA

A008

-O0

5

25

24

02

31

18

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.35

15

0

55

7

AA

A008

-O1

8

25

24

02

31

19

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.39

23

4

65

3

AA

A008

-P0

2

25

24

02

31

20

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.44

19

5

73

5

AA

A008

-P2

3

25

24

02

31

21

SA

12°2

9′4

1.4

″S

4°5

9′5

5.2

″W

01D

ec2007

80

0

4.8

3

4.5

0

.50

23

0

80

1

AA

A288

-I1

4

25

13

23

70

66

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

1

.06

15

3

15

34

AA

A288

-J1

4

25

13

23

70

65

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.68

25

3

10

73

AA

A288

-C1

7

25

24

02

31

22

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.59

26

9

95

3

AA

A288

-D0

3

25

24

02

31

23

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.53

33

5

88

0

AA

A288

-D2

2

25

24

02

31

24

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.55

23

4

85

9

AA

A288

-E0

9

25

24

02

31

25

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.55

30

0

88

7

AA

A288

-G0

5

25

24

02

31

26

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.71

24

2

10

63

AA

A288

-K0

2

25

24

02

31

27

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.66

31

9

10

53

AA

A288

-K0

5

25

24

02

31

28

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.49

27

9

80

3

AA

A288

-K2

0

25

24

02

30

97

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.57

28

8

88

8

AA

A288

-M0

4

25

24

02

30

98

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.56

32

7

91

7

AA

A288

-M2

3

25

24

02

30

99

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.51

32

0

84

7

AA

A288

-N1

5

25

24

02

31

00

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.27

17

0

50

0

AA

A288

-N2

3

25

24

02

31

01

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.49

24

8

76

8

AA

A288

-O1

7

25

24

02

31

02

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.13

12

7

24

9

AA

A288

-O2

2

25

24

02

31

03

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.37

17

8

62

1

AA

A288

-P0

2

25

24

02

31

04

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.86

30

7

12

74

AA

A288

-P1

8

25

24

02

31

05

NP

22°4

5′ N

158°0

0′ W

09S

ep2009

77

0

4.7

3

4.3

0

.68

33

5

10

91

1W

W (

Anta

rcti

c W

inte

r W

ater

), C

DW

(A

nta

rtic

Cir

cum

pola

r D

eep W

ater

), G

OM

(G

ulf

of

Mai

ne)

, S

A (

South

Atl

anti

c g

yre

), N

P

(Nort

h P

acif

ic g

yre

).

2T

he

cult

ure

d r

efer

ence

str

ain

Nit

roso

pum

ilus

mari

tim

us

scm

1 h

as a

gen

om

e si

ze o

f 1.6

5 M

bp.

Table S2. COG annotation of the 97 proteins used for phylogenomic analysis.

COG id Biological function

COG0541 Signal recognition particle GTPase

COG2511 Archaeal Glu-tRNAGln amidotransferase subunit E (contains GAD domain)

COG0252 L-asparaginase/archaeal Glu-tRNAGln amidotransferase subunit D

COG5257 Translation initiation factor 2, gamma subunit (eIF-2gamma; GTPase)

COG0152 Phosphoribosylaminoimidazolesuccinocarboxamide (SAICAR) synthase

COG1797 Cobyrinic acid a,c-diamide synthase

COG2138 Uncharacterized conserved protein

COG2082 Precorrin isomerase

COG0096 Ribosomal protein S8

COG0113 Delta-aminolevulinic acid dehydratase

COG1881 Phospholipid-binding protein

COG2109 ATP:corrinoid adenosyltransferase

COG0468 RecA/RadA recombinase

COG0126 3-phosphoglycerate kinase

COG0520 Selenocysteine lyase

COG1093 Translation initiation factor 2, alpha subunit (eIF-2alpha)

COG0615 Cytidylyltransferase

COG0097 Ribosomal protein L6P/L9E

COG0128 5-enolpyruvylshikimate-3-phosphate synthase

COG2260 Predicted Zn-ribbon RNA-binding protein

COG0093 Ribosomal protein L14

COG0396 ABC-type transport system involved in Fe-S cluster assembly, ATPase component

COG0090 Ribosomal protein L2

COG0092 Ribosomal protein S3

COG1471 Ribosomal protein S4E

COG0034 Glutamine phosphoribosylpyrophosphate amidotransferase

COG1339 Transcriptional regulator of a riboflavin/FAD biosynthetic operon

COG2090 Uncharacterized protein conserved in archaea

COG1675 Transcription initiation factor IIE, alpha subunit

COG0048 Ribosomal protein S12

COG2125 Ribosomal protein S6E (S10)

COG0225 Peptide methionine sulfoxide reductase

COG0504 CTP synthase (UTP-ammonia lyase)

COG0169 Shikimate 5-dehydrogenase

COG1303 Uncharacterized protein conserved in archaea

COG2139 Ribosomal protein L21E

COG1324 Uncharacterized protein involved in tolerance to divalent cations

COG2262 GTPases

COG0088 Ribosomal protein L4

COG0100 Ribosomal protein S11

COG0010 Arginase/agmatinase/formimionoglutamate hydrolase, arginase family

COG1958 Small nuclear ribonucleoprotein (snRNP) homolog

COG1646 Predicted phosphate-binding enzymes, TIM-barrel fold

COG0189 Glutathione synthase/Ribosomal protein S6 modification enzyme (glutaminyl transferase)

COG4830 Ribosomal protein S26

COG1547 Uncharacterized conserved protein

COG1258 Predicted pseudouridylate synthase

COG0265 Trypsin-like serine proteases, typically periplasmic, contain C-terminal PDZ domain

COG1412 Uncharacterized proteins of PilT N-term./Vapc superfamily

COG0199 Ribosomal protein S14

COG0287 Prephenate dehydrogenase

COG1798 Diphthamide biosynthesis methyltransferase

COG0094 Ribosomal protein L5

COG1903 Cobalamin biosynthesis protein CbiD

COG2890 Methylase of polypeptide chain release factors

COG2073 Cobalamin biosynthesis protein CbiG

COG1867 N2,N2-dimethylguanosine tRNA methyltransferase

COG2875 Precorrin-4 methylase

COG0667 Predicted oxidoreductases (related to aryl-alcohol dehydrogenases)

COG1491 Predicted RNA-binding protein

COG2429 Uncharacterized conserved protein

COG1985 Pyrimidine reductase, riboflavin biosynthesis

COG0358 DNA primase (bacterial type)

COG0057 Glyceraldehyde-3-phosphate dehydrogenase/erythrose-4-phosphate dehydrogenase

COG1024 Enoyl-CoA hydratase/carnithine racemase

COG1537 Predicted RNA-binding proteins

COG0671 Membrane-associated phospholipid phosphatase

COG4221 Short-chain alcohol dehydrogenase of unknown specificity

COG0087 Ribosomal protein L3

COG0186 Ribosomal protein S17

COG1460 Uncharacterized protein conserved in archaea

COG0091 Ribosomal protein L22

COG0644 Dehydrogenases (flavoproteins)

COG1587 Uroporphyrinogen-III synthase

COG0049 Ribosomal protein S7

COG0030 Dimethyladenosine transferase (rRNA methylation)

COG2241 Precorrin-6B methylase 1

COG0185 Ribosomal protein S19

COG0863 DNA modification methylase

COG1180 Pyruvate-formate lyase-activating enzyme

COG0195 Transcription elongation factor

COG3253 Uncharacterized conserved protein

COG1254 Acylphosphatases

COG1378 Predicted transcriptional regulators

COG1439 Predicted nucleic acid-binding protein, consists of a PIN domain and a Zn-ribbon module

COG1522 Transcriptional regulators

COG1964 Predicted Fe-S oxidoreductases

COG1599 Single-stranded DNA-binding replication protein A (RPA), large (70 kD) subunit and

related ssDNA-binding proteins

COG1940 Transcriptional regulator/sugar kinase

COG0054 Riboflavin synthase beta-chain

COG0255 Ribosomal protein L29

COG2242 Precorrin-6B methylase 2

COG1409 Predicted phosphohydrolases

COG1382 Prefoldin, chaperonin cofactor

COG1703 Putative periplasmic protein kinase ArgK and related GTPases of G3E family

COG3185 4-hydroxyphenylpyruvate dioxygenase and related hemolysins

COG0805 Sec-independent protein secretion pathway component TatC

Table S3. Examples of gene families distributed in the epi- and mesopelagic clades of marine

Thaumarchaeota.

Family

id1

Gene Epipelagic SAGs

(N=4)

Nmar2

(N=1)

Mesopelagic

SAGs (N=42)

OR3550 urease subunit alpha 0 0 2

OR3005 urease subunit beta 0 0 3

OR3010 urease accessory protein UreD 0 0 3

OR2375 urease accessory protein 0 0 7

OR2177 urease accessory protein 0 0 8

OR2326 urea amidohydrolase subunit gamma 0 0 8

OR2294 urease accessory protein 0 0 8

OR2270 urease accessory protein UreD 0 0 9

OR1967 urea active transporter 0 0 10

OR2098 urease subunit alpha 0 0 10

OR3671 Deoxyribodipyrimidine photolyase 1 0 0

OR3847 twin-arginine translocation pathway

signal 2 0 0

OR2460 universal stress protein 3 1 0

OR3848 catalase/peroxidase HPI 2 0 0

OR2257 ammonia monooxygenase, subunit A 2 1 6

OR2022 ammonium transporter 2 1 8

OR1861 superoxide dismutase 3 1 11

OR2115 ammonia monooxygenase operon-

associated hypothetical protein 2 1 9

OR2077 ammonia monooxygenase subunit B 2 1 10

OR1607 blue (type1) copper domain-

containing protein 2 1 17

OR1653 multicopper oxidase type 3 2 1 17

OR1554 ammonia monooxygenase/methane

monooxygenase subunit C 2 1 19

OR1042 DSBA oxidoreductase 3 1 27

OR2116 peroxiredoxin 2 1 9

OR1393 peroxiredoxin 1 1 22

OR1993 peroxiredoxin 0 1 13

OR1659 peroxiredoxin 0 1 18

OR2405 peroxiredoxin 0 0 5

OR1050 blue (type1) copper domain-

containing protein 3 1 30

OR1688 blue (type1) copper domain-

containing protein 1 1 18

OR1569 blue (type1) copper domain-

containing protein 0 1 18

1These are orthologous gene families identified by the OrthoMCL software; the family id is

arbitrary.

2Nmar: Nitrosopumilus maritimus scm1, the only marine Thaumarchaeota strain in pure culture.