Is the whole Genome Represented? An ... - ESCMID

19
Is the whole Genome Represented? An Investigation to Determine if Plasmids Are Adequately Represented In The NCTC3000 WGS Project Sarah Alexander @NCTC_3000

Transcript of Is the whole Genome Represented? An ... - ESCMID

Is the whole Genome Represented?

An Investigation to Determine if Plasmids Are Adequately

Represented In The NCTC3000 WGS Project

Sarah Alexander

@NCTC_3000

2

No Conflicts of Interest to Report

3

National Collection of Type Cultures

•A unique bacterial strain collection founded in 1920

• Clinical strains – veterinary and medical importance

•Dynamic collection – modern and historical strains

•Type and Reference - ~5200 strains

•Freeze dried, Lenticulated or DNA Format

•Awarded funding - Wellcome Trust Sanger Institute - Sequence 3000 NCTC strains

4

“Generate reference genomes for 3000 bacterial strains within the collection and embed these

genomes in an accessible resource which will enhance the scientific value of the collection”

• Community Resource Project

• High Molecular Weight DNA – extracted from NCTC strains

• DNA quality profile checked on the TapeStation

• DNA sent to the WTSI – PacBio Sequencing

NCTC3000 – Aims & Methods

5

NCTC3000 - Analysis Pipeline

ENA/NCBI

Web/FTP

16S Check

Annotation

Quiver

Circulator

HGAP Assembly

6

NCTC3000 – Data Sharing

• Data regularly uploaded on the WTSI website

• http://www.sanger.ac.uk/resources/downloads/bacteria/nctc/#t_2

@NCTC_3000

7

• Plasmids status of NCTC strains unknown

• Important – plasmids often encode for important phenotypic traits

• Paucity of data on plasmid loss/coverage using PacBio sequencing

• Multiple stages where plasmids can be lost

• Culture and Isolation

• DNA extraction

• Library Preparation

• in silico during assembly

WGS: Is the Whole Genome Represented?

8

WGS: Is the Whole Genome Represented?

• Aim – to determine which NCTC strains may be missing plasmid data from the WGS

• Examined - TapeStation Electropherogram profiles of DNA extracts for evidence of plasmids

• Examined WGS reviewed plasmid number for each strain

• The two datasets were compared for 783 NCTC bacterial strains – 169 different species

9

NCTC13532: TapeStation Electropherogram Profile

Lower Marker Chromosomal DNA2 Plasmids

10

NCTC13532: WGS Plasmid Output

Species Strain Sample RunsManual Assembly

Chromosome Contig No. Plasmid No.

E. coli NCTC13532 ERS605481 ERR832412 GFF 2 2

11

NCTC13532: TapeStation Electropherogram vs WGS

Species Strain Sample RunsManual Assembly

Chromosome Contig No. Plasmid No.

E. coli NCTC13532 ERS605481 ERR832412 GFF 2 2

Concordant

12

TapeStation verses WGS Data - Plasmids

No. Strains (783) No. Peaks on TapeStation No. Plasmids Identified by WGS

473 0 0

19 ≥1 ≥1

205 0 ≥1

86 ≥1 0

• 60% Strains - showed no evidence of plasmids - by either the TapeStation or WGS

• 2.4% Strains had concordant plasmid data between TapeStation and WGS

13

TapeStation verses WGS Data - Plasmids

No. Strains (783) No. Peaks on TapeStation No. Plasmids Identified by WGS

473 0 0

19 ≥1 ≥1

205 0 ≥1

86 ≥1 0

• 60% Strains - showed no evidence of plasmids - by either the TapeStation or WGS

• 2.4% Strains had concordant plasmid data between TapeStation and WGS

14

TapeStation verses WGS Data - Plasmids

No. Strains (783) No. Peaks on TapeStation No. Plasmids Identified by WGS

473 0 0

19 ≥1 ≥1

205 0 ≥1

86 ≥1 0

• 60% Strains - showed no evidence of plasmids - by either the TapeStation or WGS

• 2.4% Strains had concordant plasmid data between TapeStation and WGS

15

TapeStation verses PacBio - PlasmidsNo. Strains (783) No. Peaks on TapeStation No. Plasmids Identified by WGS

473 0 0

19 ≥1 ≥1

205 0 ≥1

86 ≥1 0

• 26.5% Strains – No TapeStation peaks but had detectable plasmids in the WGS

• The average plasmid size in this group of strains was determine to be 85Kb

TapeStation verses WGS Data - Plasmids

16

No. Strains (783) No. Peaks on TapeStation No. Plasmids Identified by WGS

473 0 0

19 ≥1 ≥1

205 0 ≥1

86 ≥1 0

• 11% Strains – have TapeStation peaks (<10kb) but no detectable plasmids in the WGS

• Likely that in a small minority of bacterial strains – plasmids are not represented in the WGS

• Strains were analysed further

TapeStation verses WGS Data - Plasmids

17

Plasmid Discordant Strains

• 52/86 plasmids discordant strains - were investigated to resolve plasmid status

• Fifty two strains - Plasmid Miniprep extractions performed

• One or more plasmids were recovered from 43 strains (43/52 = 82%)

• Evidence small plasmids – not represented in the WGS

Further Work

• To determine if plasmid loss – library prep or in silico

• To ensure plasmid data is represented in WGS

18

Conclusions

• NCTC3000 - generate reference genomes for the scientific community

• Sequencing 3000 bacterial strains from 86 different families

• Comparison WGS with TapeStation profile - 11% (86/783) of strains missing plasmid data from the

final NCTC3000 dataset

• Plasmids that are smallest in size (average 4.1kb) appear to be at a highest risk of being lost during

the WGS library construction and sequence assembly

• Further work will be performed for all strains within this group to ensure that complete genomic data is

available

19

Culture Collections Team

• Ana Deheer-Graham

• NCTC operational team

• Mohammed Abbas Fazal

• Julie E. Russell

• Lusubilo Malakbungu

• Poster – 5427

• Stand – 34A

• http://www.sanger.ac.uk/resources/downloads/bacteria/nctc/#t_2

Acknowledgments

WTSI

• Karen Oliver

• Nick Grayson

• Julian Parkhill